Restarting DHCP safely whilst avoiding partner-down state

Fri May 13 14:23:25 UTC 2016

On 13 May 2016 at 14:22, Chuck Anderson <cra at wpi.edu> wrote:
> FWIW, we've been using the "kill" method for over a decade without any
> noticable side-effects (the default init.d scripts from RHEL 6
> (actually Scientific Linux 6) dhcp package).  We've never had to
> manually clean up a corrupted lease file.  We restart the services
> automatically on a 20 minute cycle, as needed.  We do one, then
> immediately do the other.  We do not wait to restart the other, and we
> do not monitor to see if failover has reconnected and rebalanced
> before restarting the other, but since we are SSH-ing into each server
> to do the restart, there might be enough of a built-in delay between
> restarting each server.

Thanks Chuck,

That's exactly our experience with SCPing the config from our IPAM
host, then using SSH to test and restart, for each instance. It's
*never* failed in practise despite doing this with up to a 5 minute
frequency.

But since I have the incentive to migrate our sys-v init scripts to
systemd and produce something useful to others I am trying to set the
bar higher and do the "right thing".

> I don't know if a corrupted lease file would cause a failure to start
> the dhcp server, or if it would just go unnoticed, perhaps with a log
> message.  But like I said, we've never had a failure to start the
> server that was caused by a lease file issue.

In our experience leases files corrupted by other means can cause a
failure to start. I don't recall whether that was due to mere
truncation though...

Thanks again for sharing your scripts.

> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>> Hi,
>>
>> I'm attempting to write a systemd .service file for my own uses of ISC
>> DHCP. However, if it can be made sufficiently generic then I would
>> intend to push this upstream or at least into distributions.
>>
>> It needs to be suitable for managing failover pairs and I'm struggling
>> with the age-old problem of restarting a dhcpd instance. From reading
>> around there does not currently appear to be a method for restarting
>> dhcpd that is both *safe* and *useful* in such a setup.
>>
>>
>> Restarting with signals:
>>
>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>> option, except where there is a high turnover of leases and the
>> production environment requires a high degree of reliability from
>> DHCP. In that case, we'd suggest that administrators consider using
>> OMAPI to control the daemon instead and to request a graceful
>> shutdown. The reason for this is that there is the slight possibility
>> that by using kill, administrators may stop dhcpd in the middle of
>> appending a lease to the leases file (in which case it may become
>> corrupted). This risk, while tiny, may be significant enough for some
>> administrators to prefer to use OMAPI instead."
>>
>> In other words this is recommending that casual users take the risk
>> that their service might not recover after restarting. This may be
>> unlikely but it's still dangerous advice! The documentation does
>> indicates that a feature for "gentle shutdown" in response to a signal
>> was added in the 4.2 time frame and then subsequently removed:
>>
>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>> #32692] [ISC-Bugs 34945]"
>> "Disable the gentle shutdown functionality until we can determine the
>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>> #36066]"
>>
>> Is it still the case that kill isn't suitable for production purposes?
>>
>>
>> With OMAPI:
>>
>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>> effect on the failover protocol is less-ideal than with signals.
>>
>> OMAPI shutdown will place the partner into "partner-down" state making
>> it become active for all leases in the failover pools which isn't
>> ideal when brief restarting an instance. Contrast this with the effect
>> of restarting an instance with kill which is to briefly place the
>> partner into "communications-interrupted" state from which it
>> immediate revert to "normal" once the restarted instance is available
>> (with auto-partner-down taking care for things if the instance does
>> not recover.)
>>
>>
>> Is there a safe way to restart DHCP that has minimal impact on the
>> failover protocol?
>>
>>
>> Thanks,
>>
>> Terry