DHCP4 failover pair hanging after startup

Wed Jul 16 11:32:43 UTC 2008

Hello,

Until recently we were running DHCP3 without a failover setup. Since 
moving to a failover pair, we have started to see multiple errors. The 
most worrying of which I will describe here.

We have a large dhcpd.conf which contains every host we wish to allow 
allocations for, with almost every pool configured with 'deny unknown 
clients'. Due to this, we have to regularly (by means of a script) 
reload the dhcpd in order to apply changes to dhcpd.conf (i.e. add new 
hosts to the config file).

On some occasions, very soon after starting up one or both of the 
servers will stop processing incoming requests. The daemon is still 
running and does perform "housekeeping" activities of removing ddns 
entries, however the log shows no errors and nothing to suggest there is 
a problem other than the flurry of dhcp requests have stopped.

I have saved the state to a trace file and played it back, which gives a 
more meaningful error;

---------------------START DEBUG OUTPUT--------------------------

[root at coreserv2 /]# /usr/local/sbin/dhcpd -play 
/home/netadmin/dhcpd.trace -lf /home/netadmin/dhcpd.leases -d
Internet Systems Consortium DHCP Server 4.0.0
Copyright 2004-2007 Internet Systems Consortium.
All rights reserved.
For info, please visit http://www.isc.org/sw/dhcp/
Wrote 0 class decls to leases file.
Wrote 0 deleted host decls to leases file.
Wrote 0 new dynamic host decls to leases file.
Wrote 0 leases to leases file.
Wrote 0 class decls to leases file.
Wrote 0 deleted host decls to leases file.
Wrote 0 new dynamic host decls to leases file.
Wrote 39817 leases to leases file.
failover peer coreserv: I move from communications-interrupted to startup
Listening on 
Trace/bge0/05:00:00:00:31:00:00:00:03:00:00:00:00:00:00:00/143.167.1.0/24
Sending   on 
Trace/bge0/05:00:00:00:31:00:00:00:03:00:00:00:00:00:00:00/143.167.1.0/24
Listening on Trace/fallback/
message length wait: unexpected error
peer coreserv: disconnected
failover peer coreserv: I move from startup to communications-interrupted
Read packet type inpacket when expecting mr-randomid
trace_mr_statp: no statp packet found.
Read packet type inpacket when expecting mr-statp
trace_mr_statp: no statp packet found.
Read packet type inpacket when expecting mr-output
trace_mr_recvfrom: no input found.
Unable to add forward map from h-0-16-6f-b-48-7b.ddns.shef.ac.uk to 
172.30.14.70: connection refused
DHCPREQUEST for 172.30.14.70 from 00:16:6f:0b:48:7b (your-3ee1f19287) 
via 172.30.12.253
DHCPACK on 172.30.14.70 to 00:16:6f:0b:48:7b (your-3ee1f19287) via 
172.30.12.253
Read packet type connection-input when expecting mr-statp
trace_mr_statp: no statp packet found.
Read packet type connection-input when expecting mr-output
trace_mr_recvfrom: no input found.
Unable to add forward map from h-0-16-6f-b-48-7b.ddns.shef.ac.uk to 
172.30.14.70: connection refused
DHCPREQUEST for 172.30.14.70 from 00:16:6f:0b:48:7b (your-3ee1f19287) 
via 172.30.12.252
DHCPACK on 172.30.14.70 to 00:16:6f:0b:48:7b (your-3ee1f19287) via 
172.30.12.252
Impossible case at failover.c:591.

If you did not get this software from ftp.isc.org, please
get the latest from ftp.isc.org and install that before
requesting help.

If you did get this software from ftp.isc.org and have not
yet read the README, please read it before requesting help.
If you intend to request help from the dhcp-server at isc.org
mailing list, please read the section on the README about
submitting bug reports and requests for help.

Please do not under any circumstances send requests for
help directly to the authors of this software - please
send them to the appropriate mailing list as described in
the README file.

exiting.

---------------------END DEBUG OUTPUT--------------------------

We have also seen this happen when running just one of the two servers, 
eliminating the possibility of it being caused by spurious messages sent 
from the peer. Note that the time taken between starting up and failing 
is somewhat variable, but it is always within 1 minute. During this 
first minute, all appears to be working correctly (as can be seen by the 
2 processed requests before the failure).

The failover section of the configuration looks like this;
failover peer "coreserv" {
  secondary;
  address 143.167.1.11;
  port 647;
  peer address 143.167.251.11;
  peer port 647;
  max-response-delay 60;
  max-unacked-updates 10;
  load balance max seconds 3;
}

I can provide more config if it would be helpful.

As I mentioned, there are other problems we are encountering and I will 
capture debug information and report these in due course where necessary.

Regards,
Tom Griffin
Data Network Administrator
The University of Sheffield