Failover peer separation revisted

sthaug at nethelp.no sthaug at nethelp.no
Tue Nov 18 22:40:07 UTC 2008


> Everything appears to be working normally, except at random times our  
> DHCP servers appear to just disconnect from each other and they NEVER  
> reconnect.  It can take just a few days or it can take a few weeks for  
> this problem to creep up.  At first we thought maybe this was a  
> problem on the network, but everything has been verified.  Ports are  
> good, good NIC's, cabling, no errors on interfaces, no other network  
> problems reported by any other application or the switch itself.  The  
> systems are directly connected to a Cisco 6509 via 1000BaseT.
> 
> So the two questions are, why are they disconnecting?  The primary  
> server sees it as a timeout, and secondly why are they not attempting  
> to reconnect?

I have a reproducible case here where hosts in a failover pair never
reconnect after a break in the network between them. Configuration is
included at the end of this message.

What I do to reproduce the problem (no dhcp clients needed):

- Configure two hosts in a failover pair, on different network segments
and thus connected by router(s). Start dhcpd on primary and secondary.
tail -f logs, observe that they both move to state normal. Run tcpdump
or similar sniffer on both hosts, observe that primary and secondary
have full communication.

- Configure an access list on a suitable router that blocks all the
redundancy traffic between primary and secondary. Observe in logs that
the hosts move to state communication-interrupted after a while.

- Watch the traffic on both hosts with packet sniffer. Observe that
when the hosts move to state communication-interrupted they start
sending TCP SYNs (definitely expected) - but around 3 minutes after
traffic has been blocked, the last packet is sent between the hosts.

- After this time the traffic stops completely, and the hosts never
try to reconnect again (as seen by packet sniffer on both hosts).
Obviously, removing the access list which blocks the traffic doesn't
help as long as the hosts don't try to reconnect.

- Removing the access list and *restarting* primary or secondary
dhcpd process (doesn't matter which one) brings the situation back to
normal again.

I suggest you run through such a scenario several times, to convince
yourself that the hosts don't actually try to reconnect any more after
around 3 minutes have passed.

Obviously, this is a serious error for anybody running ISC dhcpd in a
failover scenario, like we do. We have been bitten by this more than
once.

This failure scenario, with network break simulated using access lists,
has been tested on FreeBSD 7.0 and Linux with a 2.6.18 kernel, with ISC
dhcpd versions 3.0.7, 3.1.2b1 and 4.1.0b1. The behavior is identical in
all cases - after a network break of about 3 minutes, the primary and
the secondary in a failover pair never try to reconnect again, even if
the network between them is fixed.

Steinar Haug, Nethelp consulting, sthaug at nethelp.no
----------------------------------------------------------------------
primary (IP 193.75.110.74/30) dhcpd.conf:

max-lease-time 86400;
default-lease-time 86400;
min-lease-time 43200;
option domain-name-servers 193.75.75.75, 193.75.75.193;
option domain-name "bluecom.no";
option ntp-servers 193.71.1.10, 193.71.1.20;
ddns-update-style none;
authoritative;
ping-check false;
deny bootp;

failover peer "prim-sec" {
	primary;
	address 193.75.110.74;
	port 519;
	peer address 193.75.4.130;
	peer port 519;
	max-response-delay 60;
	max-unacked-updates 10;
	load balance max seconds 5;
	mclt 3600;
	split 128;
}

subnet 193.75.110.72 netmask 255.255.255.252 {}

subnet 81.191.97.0 netmask 255.255.255.0 {
	option routers 81.191.97.1;
	option broadcast-address 81.191.97.255;
	pool {
		failover peer "prim-sec";
		deny dynamic bootp clients;
		range 81.191.97.2 81.191.97.254;
	}
}

----------------------------------------------------------------------
secondary (IP 193.75.4.130/30) dhcpd.conf:

max-lease-time 86400;
default-lease-time 86400;
min-lease-time 43200;
option domain-name-servers 193.75.75.75, 193.75.75.193;
option domain-name "bluecom.no";
option ntp-servers 193.71.1.10, 193.71.1.20;
ddns-update-style none;
authoritative;
ping-check false;
deny bootp;

failover peer "prim-sec" {
	secondary;
	address 193.75.4.130;
	port 519;
	peer address 193.75.110.74;
	peer port 519;
	max-response-delay 60;
	max-unacked-updates 10;
	load balance max seconds 5;
}

subnet 193.75.4.128 netmask 255.255.255.252 {}

subnet 81.191.97.0 netmask 255.255.255.0 {
	option routers 81.191.97.1;
	option broadcast-address 81.191.97.255;
	pool {
		failover peer "prim-sec";
		deny dynamic bootp clients;
		range 81.191.97.2 81.191.97.254;
	}
}



More information about the dhcp-users mailing list