DHCP failover problems - still
sthaug at nethelp.no
sthaug at nethelp.no
Wed May 6 20:18:02 UTC 2009
> We've been having a problem for quite some time with failover and
> DHCPD server. For weeks at a time the servers will run absolutely
> great... then suddenly they just "lose connection" to each other and
> NEVER try to reconnect.
>
> The servers are sitting right next to each other on the same Cisco Gig-
> E switch, both servers are identical software run diskless via NFS...
> no other network service problems, no errors, nothing.
DHCP is a (disk) I/O intensive activity, and I would be rather sceptical
about running this over NFS.
> Suddenly, one day all of our leases are consumed and the servers stop
> handing out new leases.
>
> After more research we found that the failover connection between the
> two servers has been "interrupted". Even though the logs claim that
> the connection was interrupted, both servers are running perfectly
> independent of each other on the same LAN.
We have seen this too.
> So question #1 is I'm not sure why connections are interrupted in the
> first place... The LAN never lost carrier, the servers sit on a
> private low traffic network. According to the syslog....
My own theory is that the server gets sufficiently busy with disk I/O
(or in your case NFS I/O) that it doesn't have time to process the
keepalive messages. See for instance
https://lists.isc.org/pipermail/dhcp-users/2008-August/007017.html
> The second question is, why don't they attempt to "reconnect"?
This is a bug that I and several others have seen. I can reproduce it,
and I have tried to give ISC enough info to reproduce it (offering the
use of my lab if necessary). But so far no luck. See
https://lists.isc.org/pipermail/dhcp-users/2008-November/007433.html
Steinar Haug, Nethelp consulting, sthaug at nethelp.no
More information about the dhcp-users
mailing list