DHCP failover problems - still

Wed May 6 20:18:02 UTC 2009

> We've been having a problem for quite some time with failover and  
> DHCPD server.  For weeks at a time the servers will run absolutely  
> great... then suddenly they just "lose connection" to each other and  
> NEVER try to reconnect.
> 
> The servers are sitting right next to each other on the same Cisco Gig- 
> E switch, both servers are identical software run diskless via NFS...  
> no other network service problems, no errors, nothing.

DHCP is a (disk) I/O intensive activity, and I would be rather sceptical
about running this over NFS.

> Suddenly, one day all of our leases are consumed and the servers stop  
> handing out new leases.
> 
> After more research we found that the failover connection between the  
> two servers has been "interrupted".  Even though the logs claim that  
> the connection was interrupted, both servers are running perfectly  
> independent of each other on the same LAN.

We have seen this too.

> So question #1 is I'm not sure why connections are interrupted in the  
> first place...  The LAN never lost carrier, the servers sit on a  
> private low traffic network.  According to the syslog....

My own theory is that the server gets sufficiently busy with disk I/O
(or in your case NFS I/O) that it doesn't have time to process the
keepalive messages. See for instance

  https://lists.isc.org/pipermail/dhcp-users/2008-August/007017.html

> The second question is, why don't they attempt to "reconnect"?

This is a bug that I and several others have seen. I can reproduce it,
and I have tried to give ISC enough info to reproduce it (offering the
use of my lab if necessary). But so far no luck. See

  https://lists.isc.org/pipermail/dhcp-users/2008-November/007433.html

Steinar Haug, Nethelp consulting, sthaug at nethelp.no