Possible starvation of redundancy messages in failover pair?

Mon Aug 25 11:32:44 UTC 2008

We have two 3.1.1 servers in a failover pair. Once in a while we get a
rather high rate of client requests to these servers, and in connection
with such episodes we have several times seen the primary and the backup
server lose their connection. Typical messages in the logs:

Aug 23 18:43:44 primary dhcpd: peer dhcp1-dhcp2: disconnected
Aug 23 18:43:44 primary dhcpd: failover peer dhcp1-dhcp2: I move from normal to communications-interrupted
Aug 23 18:44:04 primary dhcpd: failover: link startup timeout
Aug 23 18:44:29 primary dhcpd: failover: link startup timeout

Aug 23 18:44:14 backup dhcpd: peer dhcp1-dhcp2: disconnected
Aug 23 18:44:14 backup dhcpd: failover peer dhcp1-dhcp2: I move from normal to communications-interrupted
Aug 23 18:44:14 backup dhcpd: failover: link startup timeout
Aug 23 18:44:35 backup dhcpd: failover: link startup timeout

Having tried to dig into the code in omapi_one_dispatch() which calls
the operating system select() and processes the results, I am left
wondering what could prevent the following starvation scenario from
occurring:

- High rate of client requests, thus lots of UDP packets arriving on
the socket listening to UDP port 67.
- When omapi_one_dispatch() is called, there are UDP packet(s) ready
for reading/processing.
- Because there are so many UDP packets available, the TCP socket for
the redundancy connection isn't checked/read for a long time, and
this results in starvation/timeout.

It is entirely possible that I'm not understanding the code correctly,
thus I'd like to ask whether this starvation scenario has been thought
of and handled.

Steinar Haug, Nethelp consulting, sthaug at nethelp.no