DHCP failover problems - still

Sun Nov 22 21:00:13 UTC 2009

> Digging more, I noticed that there was a large spike of dhcp requests/ 
> acks at the beginning of one of the comms failures.  Looks like we saw  
> a period of 60 reqests/acks per second for a brief period.  Maybe 4-6  
> offers per second during that time.
> 
>  From the capture, I see that the secondary sends contacts and bind  
> updates to the primary.  I can see from a TCP perspective that those  
> frames are ACK'd.  This is good news as it means the network is  
> healthy.  But it seems that during this time, the primary dhcp daemon  
> isn't sending the acknowledgements for those contacts and bind updates.
> 
> Is it possible/plausible that the dhcp daemon could be 'too busy' at  
> 65 xactions/second to respond to the failover messages?  I mean, mebbe  
> we are blocking on disk or CPU... But I've reviewed the performance  
> stats for these systems and they are not busy at all.  They're modern  
> whitebox servers, 8G RAM, quad core wirlygig with all the animal  
> options...

This is what we suspect is happening - that there is starvation of
the TCP redundancy traffic due to high rate of UDP traffic and/or
disk traffic. See

https://lists.isc.org/pipermail/dhcp-users/2008-August/007017.html
https://lists.isc.org/pipermail/dhcp-users/2008-October/007351.html
https://lists.isc.org/pipermail/dhcp-users/2008-November/007426.html

Please note that the problem of the servers not reconnecting, as
mentioned in the last two messages above, has been fixed in newer
ISC DHCP versions (e.g. 4.1.1b3). However, the possible starvation
issue has never really been resolved.

Steinar Haug, Nethelp consulting, sthaug at nethelp.no