Restarting DHCP safely whilst avoiding partner-down state

Fri May 13 14:33:26 UTC 2016

On 13 May 2016 at 15:10, Steve van der Burg <steve.vanderburg at lhsc.on.ca> wrote:
> Here we push out new configs to a partner pair from a central server.  The config for one of the partners contains an extra file (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl script):
>
>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>      exit if (localtime)[1] % 2 == 0;
>   }
>   else {
>      exit if (localtime)[1] % 2 == 1;
>   }
>
>   ... continue (test new config, kill running server, start new one, etc)
>
> So the config change, stop, start, etc, can only happen on odd minutes for one server and even minutes for the other.  As long as startup time is less than a minute (and it's much, much less than that) it all works smoothly.

Thanks Steve. We've also been pushing configs around then
synchronously restarting servers back-to-back (without sleeping) for
several years without incident.

It makes me a little suspicious about whether just killing the process
is indeed unsafe... But then maybe we've been lucky.

As mentioned I want to improve on what distributions are currently
doing so I'm deliberately setting the bar high and it would be great
if ISC could provide a single, approved, safe shutdown/restart
mechanism or describe what is required to develop such a mechanism.
Unfortunately the detail of Bug #36066 (retracting support for gentle
shutdown) isn't available as it would be interesting to see what
issues were encountered with the previous approach.

> Chuck Anderson <cra at WPI.EDU> wrote:
>> FWIW, we've been using the "kill" method for over a decade without any
>> noticable side-effects (the default init.d scripts from RHEL 6
>> (actually Scientific Linux 6) dhcp package).  We've never had to
>> manually clean up a corrupted lease file.  We restart the services
>> automatically on a 20 minute cycle, as needed.  We do one, then
>> immediately do the other.  We do not wait to restart the other, and we
>> do not monitor to see if failover has reconnected and rebalanced
>> before restarting the other, but since we are SSH-ing into each server
>> to do the restart, there might be enough of a built-in delay between
>> restarting each server.
>>
>> I don't know if a corrupted lease file would cause a failure to start
>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>> message.  But like I said, we've never had a failure to start the
>> server that was caused by a lease file issue.
>>
>> Our script does test the config file before doing the restart:
>>
>> #!/bin/bash
>> echo -n "Testing DHCP configuration: "
>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>         echo "Restarting DHCP"
>>         sudo /etc/rc.d/init.d/dhcpd restart
>> else
>>         echo "FAIL: Not restarting DHCP"
>> fi
>>
>> which in CentOS 6 does the following:
>>
>> exec=/usr/sbin/dhcpd
>> configtest() {
>>     [ -x $exec ] || return 5
>>     [ -f $config ] || return 6
>>     $exec -q -t -cf $config
>>     RETVAL=$?
>>     if [ $RETVAL -eq 1 ]; then
>>         $exec -t -cf $config
>>     else
>>         echo "Syntax: OK" >&2
>>     fi
>>     return $RETVAL
>> }
>>
>>
>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>> Hi,
>>>
>>> I'm attempting to write a systemd .service file for my own uses of ISC
>>> DHCP. However, if it can be made sufficiently generic then I would
>>> intend to push this upstream or at least into distributions.
>>>
>>> It needs to be suitable for managing failover pairs and I'm struggling
>>> with the age-old problem of restarting a dhcpd instance. From reading
>>> around there does not currently appear to be a method for restarting
>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>
>>>
>>> Restarting with signals:
>>>
>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>> option, except where there is a high turnover of leases and the
>>> production environment requires a high degree of reliability from
>>> DHCP. In that case, we'd suggest that administrators consider using
>>> OMAPI to control the daemon instead and to request a graceful
>>> shutdown. The reason for this is that there is the slight possibility
>>> that by using kill, administrators may stop dhcpd in the middle of
>>> appending a lease to the leases file (in which case it may become
>>> corrupted). This risk, while tiny, may be significant enough for some
>>> administrators to prefer to use OMAPI instead."
>>>
>>> In other words this is recommending that casual users take the risk
>>> that their service might not recover after restarting. This may be
>>> unlikely but it's still dangerous advice! The documentation does
>>> indicates that a feature for "gentle shutdown" in response to a signal
>>> was added in the 4.2 time frame and then subsequently removed:
>>>
>>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>>> #32692] [ISC-Bugs 34945]"
>>> "Disable the gentle shutdown functionality until we can determine the
>>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>>> #36066]"
>>>
>>> Is it still the case that kill isn't suitable for production purposes?
>>>
>>>
>>> With OMAPI:
>>>
>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>> effect on the failover protocol is less-ideal than with signals.
>>>
>>> OMAPI shutdown will place the partner into "partner-down" state making
>>> it become active for all leases in the failover pools which isn't
>>> ideal when brief restarting an instance. Contrast this with the effect
>>> of restarting an instance with kill which is to briefly place the
>>> partner into "communications-interrupted" state from which it
>>> immediate revert to "normal" once the restarted instance is available
>>> (with auto-partner-down taking care for things if the instance does
>>> not recover.)
>>>
>>>
>>> Is there a safe way to restart DHCP that has minimal impact on the
>>> failover protocol?
>>>
>>>
>>> Thanks,
>>>
>>> Terry