DHCP Failover Complexity?
Matt Causey
matt.causey at gmail.com
Tue May 19 08:13:19 UTC 2009
Greetings!
As we've been testing dhcpd failover in various scenarios, we've found
it to be very reliable (thanks!). However after deploying to few
sites, we have discovered a couple of distinct failure modes
(considering server A and server B):
* Server A or server B goes down for an extended period of time. The
remaining server is left in 'communications-interrupted'. If the
technician forgets to put the remaining server in partner-down, then
the site experiences an outage, because eventually the server hands
out all its leases.
Of course, if the technician puts the system in partner_down, this is
written in the leases file, and the state is persisted across bounces
of the server. Therefore, said technician must also remember to take
the system -out- of partner_down, or else:
* After a power event or something affecting the physical servers,
server A is in recover_wait, server B in partner_down. The server
wants to wait MCLT before giving out addresses. This means the
technician engaged to recover the dhcp environment must manually place
the remaining server in recover_done to get things moving along.
Failover seems to solve a lot of problems, but it introduces
complexities that can be difficult to troubleshoot for technicians who
are not intimately familiar with the guts of the daemon. :-) So,
what other edge cases are there out there that we have not found yet?
On another note, I'm tinkering with a script which might help automate
some of the recovery actions, and reduce the human error risk. I want
to run this bit of logic by you guys and see what you think:
<snip>
# If we are in communications-interrupted for longer than the number
of seconds we decide, then automatically switch to partner-down
# If we're in recover, but the partner thinks we're down, then let's
move on to recover-done
if ( $local_status eq 3 and $local_time_in_current_state >=
$max_comms_inter_time ) {
print "INFO: we are in communications interrupted and have
been for $local_time_in_current_state seconds. Switching to
partner_down automatically.\n";
print "executing $partner_down_script.\n";
my $out = `$partner_down_script`;
exit 0;
} elsif ( $local_status eq 6 and $peer_status eq 4 and
$local_time_in_current_state >= $max_recover_time and $peer_status) {
print "INFO: we are in recover, peer is in partner_down, and
have been in this state for $local_time_in_current_state seconds.
Switching to recover_done automatically.\n";
print "executing $recover_done_script.\n";
my $out = `$recover_done_script`;
exit 0;
}
...
</snip>
Is this a bad idea? If so, why? What other conditions should I be looking for?
Thanks in advance!
--
Matt
More information about the dhcp-users
mailing list