DHCP Failover Complexity?

Tue May 19 08:13:19 UTC 2009

Greetings!

As we've been testing dhcpd failover in various scenarios, we've found
it to be very reliable (thanks!).  However after deploying to  few
sites, we have discovered a couple of distinct failure modes
(considering server A and server B):

*  Server A or server B goes down for an extended period of time.  The
remaining server is left in 'communications-interrupted'.  If the
technician forgets to put the remaining server in partner-down, then
the site experiences an outage, because eventually the server hands
out all its leases.

Of course, if the technician puts the system in partner_down, this is
written in the leases file, and the state is persisted across bounces
of the server.  Therefore, said technician must also remember to take
the system -out- of partner_down, or else:

* After a power event or something affecting the physical servers,
server A is in recover_wait, server B in partner_down.  The server
wants to wait MCLT before giving out addresses.  This means the
technician engaged to recover the dhcp environment must manually place
the remaining server in recover_done to get things moving along.

Failover seems to solve a lot of problems, but it introduces
complexities that can be difficult to troubleshoot for technicians who
are not intimately familiar with the guts of the daemon.  :-)  So,
what other edge cases are there out there that we have not found yet?

On another note, I'm tinkering with a script which might help automate
some of the recovery actions, and reduce the human error risk.  I want
to run this bit of logic by you guys and see what you think:

<snip>
# If we are in communications-interrupted for longer than the number
of seconds we decide, then automatically switch to partner-down
# If we're in recover, but the partner thinks we're down, then let's
move on to recover-done
if ( $local_status eq 3 and $local_time_in_current_state >=
$max_comms_inter_time ) {
        print "INFO: we are in communications interrupted and have
been for $local_time_in_current_state seconds.  Switching to
partner_down automatically.\n";
        print "executing $partner_down_script.\n";
        my $out = `$partner_down_script`;
        exit 0;
} elsif ( $local_status eq 6 and $peer_status eq 4 and
$local_time_in_current_state >= $max_recover_time and $peer_status) {
        print "INFO: we are in recover, peer is in partner_down, and
have been in this state for $local_time_in_current_state seconds.
Switching to recover_done automatically.\n";
        print "executing $recover_done_script.\n";
        my $out = `$recover_done_script`;
        exit 0;
}
...
</snip>

Is this a bad idea?  If so, why?  What other conditions should I be looking for?

Thanks in advance!

--
Matt