Restarting DHCP safely whilst avoiding partner-down state

Fri May 13 14:42:10 UTC 2016

On 13 May 2016 at 15:37, Pallissard, Matthew
<matthew.paul at pallissard.net> wrote:
> I just tested this and it seemed to work for me.

Do you not find if you tail the log on the partner that it transitions
to "partner-down" rather than "communications-interrupted"?

Thanks!

> #dhcpd4.service
> [Unit]
> Description=IPv4 DHCP server
> After=network.target
>
> [Service]
> Type=forking
> PIDFile=/run/dhcpd4.pid
> ExecStart=/usr/bin/dhcpd -4 -q -cf /etc/dhcpd.conf -pf /run/dhcpd4.pid
> ExecStop=/path/to/shutdown/script.sh
>
> [Install]
> WantedBy=multi-user.target
>
> #/path/to/shutdown/script.sh
> #copy-pasted from
> https://kb.isc.org/article/AA-00475/0/Sending-a-Server-Shutdown-Message-Via-OMAPI.html
> #
> #!/bin/sh
>
> #  uses omshell to connect to a dhcp server on the
> #  local machine, create a control object, set the
> #  state of the control object, and update the
> #  running server to cause that server to shut down
> #  gracefully.
> #
> #  per dhcpd man page, server shutdown can take
> #  several seconds as the server waits for close
> #  on all OMAPI connections.  Watching log files
> #  for shutdown messages is recommended.
>
> omshell << END_OF_INPUT > /dev/null 2> /dev/null
> server localhost
> port 7911
> key omapi_key Ofakekeyfakekeyfakekey==
> connect
> new control
> open
> set state=2
> update
> END_OF_INPUT
>
> echo "done sending shutdown instruction to dhcp server.."
>
> Matt Pallissard
>
>
> On 05/13/2016 09:33 AM, Terry Burton wrote:
>>
>> On 13 May 2016 at 15:10, Steve van der Burg <steve.vanderburg at lhsc.on.ca>
>> wrote:
>>>
>>> Here we push out new configs to a partner pair from a central server.
>>> The config for one of the partners contains an extra file
>>> (dhcpd.i.am.secondary).  Each of the partners runs this every minute (perl
>>> script):
>>>
>>>   if ( -e "$spath/dhcpd.i.am.secondary" ) {
>>>      exit if (localtime)[1] % 2 == 0;
>>>   }
>>>   else {
>>>      exit if (localtime)[1] % 2 == 1;
>>>   }
>>>
>>>   ... continue (test new config, kill running server, start new one, etc)
>>>
>>> So the config change, stop, start, etc, can only happen on odd minutes
>>> for one server and even minutes for the other.  As long as startup time is
>>> less than a minute (and it's much, much less than that) it all works
>>> smoothly.
>>
>>
>> Thanks Steve. We've also been pushing configs around then
>> synchronously restarting servers back-to-back (without sleeping) for
>> several years without incident.
>>
>> It makes me a little suspicious about whether just killing the process
>> is indeed unsafe... But then maybe we've been lucky.
>>
>> As mentioned I want to improve on what distributions are currently
>> doing so I'm deliberately setting the bar high and it would be great
>> if ISC could provide a single, approved, safe shutdown/restart
>> mechanism or describe what is required to develop such a mechanism.
>> Unfortunately the detail of Bug #36066 (retracting support for gentle
>> shutdown) isn't available as it would be interesting to see what
>> issues were encountered with the previous approach.
>>
>>
>>> Chuck Anderson <cra at WPI.EDU> wrote:
>>>>
>>>> FWIW, we've been using the "kill" method for over a decade without any
>>>> noticable side-effects (the default init.d scripts from RHEL 6
>>>> (actually Scientific Linux 6) dhcp package).  We've never had to
>>>> manually clean up a corrupted lease file.  We restart the services
>>>> automatically on a 20 minute cycle, as needed.  We do one, then
>>>> immediately do the other.  We do not wait to restart the other, and we
>>>> do not monitor to see if failover has reconnected and rebalanced
>>>> before restarting the other, but since we are SSH-ing into each server
>>>> to do the restart, there might be enough of a built-in delay between
>>>> restarting each server.
>>>>
>>>> I don't know if a corrupted lease file would cause a failure to start
>>>> the dhcp server, or if it would just go unnoticed, perhaps with a log
>>>> message.  But like I said, we've never had a failure to start the
>>>> server that was caused by a lease file issue.
>>>>
>>>> Our script does test the config file before doing the restart:
>>>>
>>>> #!/bin/bash
>>>> echo -n "Testing DHCP configuration: "
>>>> if sudo /etc/rc.d/init.d/dhcpd configtest; then
>>>>         echo "Restarting DHCP"
>>>>         sudo /etc/rc.d/init.d/dhcpd restart
>>>> else
>>>>         echo "FAIL: Not restarting DHCP"
>>>> fi
>>>>
>>>> which in CentOS 6 does the following:
>>>>
>>>> exec=/usr/sbin/dhcpd
>>>> configtest() {
>>>>     [ -x $exec ] || return 5
>>>>     [ -f $config ] || return 6
>>>>     $exec -q -t -cf $config
>>>>     RETVAL=$?
>>>>     if [ $RETVAL -eq 1 ]; then
>>>>         $exec -t -cf $config
>>>>     else
>>>>         echo "Syntax: OK" >&2
>>>>     fi
>>>>     return $RETVAL
>>>> }
>>>>
>>>>
>>>> On Fri, May 13, 2016 at 02:00:03PM +0100, Terry Burton wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm attempting to write a systemd .service file for my own uses of ISC
>>>>> DHCP. However, if it can be made sufficiently generic then I would
>>>>> intend to push this upstream or at least into distributions.
>>>>>
>>>>> It needs to be suitable for managing failover pairs and I'm struggling
>>>>> with the age-old problem of restarting a dhcpd instance. From reading
>>>>> around there does not currently appear to be a method for restarting
>>>>> dhcpd that is both *safe* and *useful* in such a setup.
>>>>>
>>>>>
>>>>> Restarting with signals:
>>>>>
>>>>> >From AA-01043 (Last Updated: 2015-03-18): "kill is the recommended
>>>>> option, except where there is a high turnover of leases and the
>>>>> production environment requires a high degree of reliability from
>>>>> DHCP. In that case, we'd suggest that administrators consider using
>>>>> OMAPI to control the daemon instead and to request a graceful
>>>>> shutdown. The reason for this is that there is the slight possibility
>>>>> that by using kill, administrators may stop dhcpd in the middle of
>>>>> appending a lease to the leases file (in which case it may become
>>>>> corrupted). This risk, while tiny, may be significant enough for some
>>>>> administrators to prefer to use OMAPI instead."
>>>>>
>>>>> In other words this is recommending that casual users take the risk
>>>>> that their service might not recover after restarting. This may be
>>>>> unlikely but it's still dangerous advice! The documentation does
>>>>> indicates that a feature for "gentle shutdown" in response to a signal
>>>>> was added in the 4.2 time frame and then subsequently removed:
>>>>>
>>>>> "Added support for gentle shutdown after signal is received. [ISC-Bugs
>>>>> #32692] [ISC-Bugs 34945]"
>>>>> "Disable the gentle shutdown functionality until we can determine the
>>>>> best way to present it to remove or reduce the side effects. [ISC-Bugs
>>>>> #36066]"
>>>>>
>>>>> Is it still the case that kill isn't suitable for production purposes?
>>>>>
>>>>>
>>>>> With OMAPI:
>>>>>
>>>>> You can cleanly shutdown via OMAPI "set state=2, etc." however the
>>>>> effect on the failover protocol is less-ideal than with signals.
>>>>>
>>>>> OMAPI shutdown will place the partner into "partner-down" state making
>>>>> it become active for all leases in the failover pools which isn't
>>>>> ideal when brief restarting an instance. Contrast this with the effect
>>>>> of restarting an instance with kill which is to briefly place the
>>>>> partner into "communications-interrupted" state from which it
>>>>> immediate revert to "normal" once the restarted instance is available
>>>>> (with auto-partner-down taking care for things if the instance does
>>>>> not recover.)
>>>>>
>>>>>
>>>>> Is there a safe way to restart DHCP that has minimal impact on the
>>>>> failover protocol?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Terry