A failure of one dhcp server causes that the second one also is going to fail

Fri Nov 7 19:08:32 UTC 2014

I'm trying to set up two servers in order to use the dhcp failover protocol.
I've configured both of them in a similar way, and they work. One of the server
has IP address 192.168.1.1 (primary), the other 192.168.1.2 (secondary).

Here's the startup log from the main server:

Nov  7 18:52:04 the-mountain dhcpd: Wrote 0 deleted host decls to leases file.
Nov  7 18:52:04 the-mountain dhcpd: Wrote 0 new dynamic host decls to leases file.
Nov  7 18:52:04 the-mountain dhcpd: Wrote 0 leases to leases file.
Nov  7 18:52:04 the-mountain dhcpd:
Nov  7 18:52:04 the-mountain dhcpd: No subnet declaration for eth0 (10.1.20.140).
Nov  7 18:52:04 the-mountain dhcpd: ** Ignoring requests on eth0.  If this is not what
Nov  7 18:52:04 the-mountain dhcpd:    you want, please write a subnet declaration
Nov  7 18:52:04 the-mountain dhcpd:    in your dhcpd.conf file for the network segment
Nov  7 18:52:04 the-mountain dhcpd:    to which interface eth0 is attached. **
Nov  7 18:52:04 the-mountain dhcpd:
Nov  7 18:52:04 the-mountain dhcpd: failover peer dhcp-failover: I move from recover to startup

Here I started the second server (logs still come from main server):

Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: peer moves from unknown-state to recover
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: requesting full update from peer
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: I move from startup to recover
Nov  7 18:52:06 the-mountain dhcpd: Sent update request all message to dhcp-failover
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: peer moves from recover to recover
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: requesting full update from peer
Nov  7 18:52:06 the-mountain dhcpd: Update request all from dhcp-failover: sending update
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: peer update completed.
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: I move from recover to recover-done
Nov  7 18:52:06 the-mountain dhcpd: Sent update done message to dhcp-failover
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: peer moves from recover to recover-done
Nov  7 18:52:06 the-mountain dhcpd: Both servers have entered recover-done!
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: I move from recover-done to normal
Nov  7 18:52:06 the-mountain dhcpd: balancing pool 9e45f0 192.168.1.0/24  total 91  free 91  backup 0  lts 45  max-own (+/-)9
Nov  7 18:52:06 the-mountain dhcpd: balanced pool 9e45f0 192.168.1.0/24  total 91  free 46  backup 45  lts 0  max-misbal 14
Nov  7 18:52:06 the-mountain dhcpd: Sending updates to dhcp-failover.
Nov  7 18:52:06 the-mountain dhcpd: failover peer dhcp-failover: peer moves from recover-done to normal

So, it looks like that everything went according to plan -- both servers are up and working.

The following log is from the client machine:

Nov  7 19:01:52 red-viper dhclient: DHCPDISCOVER on wlan1 to 255.255.255.255 port 67 interval 14
Nov  7 19:01:53 red-viper dhclient: DHCPREQUEST on wlan1 to 255.255.255.255 port 67
Nov  7 19:01:53 red-viper dhclient: DHCPOFFER from 192.168.1.1
Nov  7 19:01:53 red-viper dhclient: DHCPACK from 192.168.1.1
Nov  7 19:01:53 red-viper dhclient: bound to 192.168.1.205 -- renewal in 27 seconds.
Nov  7 19:02:20 red-viper dhclient: DHCPREQUEST on wlan1 to 192.168.1.1 port 67
Nov  7 19:02:20 red-viper dhclient: DHCPACK from 192.168.1.1
Nov  7 19:02:20 red-viper dhclient: bound to 192.168.1.205 -- renewal in 27 seconds.
Nov  7 19:02:47 red-viper dhclient: DHCPREQUEST on wlan1 to 192.168.1.1 port 67
Nov  7 19:02:47 red-viper dhclient: DHCPACK from 192.168.1.1
Nov  7 19:02:47 red-viper dhclient: bound to 192.168.1.205 -- renewal in 25 seconds.

It looks like that leases also work as expected.

Now I simulated a failure of the secondary server -- I just turned it off:

Nov  7 19:05:07 red-viper dhclient: DHCPACK from 192.168.1.1
Nov  7 19:05:07 red-viper dhclient: bound to 192.168.1.205 -- renewal in 30 seconds.
Nov  7 19:05:37 red-viper dhclient: DHCPREQUEST on wlan1 to 192.168.1.1 port 67
Nov  7 19:05:38 red-viper dhclient: DHCPACK from 192.168.1.1
Nov  7 19:05:38 red-viper dhclient: bound to 192.168.1.205 -- renewal in 32 seconds.

As you can see, for a while the primary server works as it should, but about a minute after
the failure, it sees that the other server is missing (log from the server):

Nov  7 19:06:10 the-mountain dhcpd: timeout waiting for failover peer dhcp-failover 
Nov  7 19:06:10 the-mountain dhcpd: peer dhcp-failover: disconnected 
Nov  7 19:06:10 the-mountain dhcpd: failover peer dhcp-failover: I move from normal to communications-interrupted 

And after this, the client can't get a lease anymore:

Nov  7 19:06:10 red-viper dhclient: DHCPREQUEST on wlan1 to 192.168.1.1 port 67
Nov  7 19:06:15 red-viper dhclient: DHCPREQUEST on wlan1 to 192.168.1.1 port 67
Nov  7 19:06:27 red-viper dhclient: DHCPREQUEST on wlan1 to 192.168.1.1 port 67
Nov  7 19:06:39 red-viper dhclient: DHCPDISCOVER on wlan1 to 255.255.255.255 port 67 interval 8
Nov  7 19:06:47 red-viper dhclient: DHCPDISCOVER on wlan1 to 255.255.255.255 port 67 interval 13
Nov  7 19:07:00 red-viper dhclient: DHCPDISCOVER on wlan1 to 255.255.255.255 port 67 interval 10
Nov  7 19:07:10 red-viper dhclient: DHCPDISCOVER on wlan1 to 255.255.255.255 port 67 interval 12
Nov  7 19:07:22 red-viper dhclient: DHCPDISCOVER on wlan1 to 255.255.255.255 port 67 interval 18
Nov  7 19:07:40 red-viper dhclient: No DHCPOFFERS received.
Nov  7 19:07:40 red-viper dhclient: No working leases in persistent database - sleeping.

It's because the dhcpd process on the first server died. So there's no dhcp server in the network in this moment.
I don't get it -- there were two dhcp servers, they were working in failover mode, and when one of them went
offline, the second one also refused to work and committed suicide, and died. :)

I think it's not the way this should work, so what did I do wrong?

Both servers have version 4.2.4.

Below are two config files, one for each server:

------------------------------------------------------------------------
#
# /etc/dhcpd.conf for primary DHCP server
#

authoritative;
ddns-update-style none;

failover peer "dhcp-failover" {
	primary;
	address 192.168.1.1;
	port 520;
	peer address 192.168.1.2;
	peer port 519;
	max-response-delay 60;
	max-unacked-updates 10;
	mclt 3600;
	split 128;
	load balance max seconds 3;
}

default-lease-time 60;
min-lease-time 60;
max-lease-time 60;

subnet 192.168.1.0 netmask 255.255.255.0 {
	option routers 192.168.1.1;
	option subnet-mask 255.255.255.0;
	option broadcast-address 192.168.1.255;
	option domain-name "mhouse.lh";
	option domain-name-servers 192.168.1.1;
#	option ntp-servers 192.168.1.1;
	always-broadcast true;

	pool {
		failover peer "dhcp-failover";
		default-lease-time 60;
		min-lease-time 60;
		max-lease-time 60;
		range 192.168.1.160 192.168.1.250;
	}
}

subnet 10.1.0.0 netmask 255.255.0.0 {
}

group {
	use-host-decl-names on;

	host the-hound {
#		option host-name "the-hound";
		hardware ethernet 3c:4a:92:00:4c:5b;
		fixed-address 192.168.1.150;
	}

	host samknows {
#		option host-name "samknows";
		hardware ethernet e8:94:f6:c4:00:2a;
		fixed-address 192.168.1.20;
	}
}
------------------------------------------------------------------------
------------------------------------------------------------------------
#
# /etc/dhcpd.conf for secondary DHCP server
#

authoritative;
ddns-update-style none;

failover peer "dhcp-failover" {
	secondary;
	address 192.168.1.2;
	port 519;
	peer address 192.168.1.1;
	peer port 520;
	max-response-delay 60;
	max-unacked-updates 10;
	load balance max seconds 3;
}

default-lease-time 60;
min-lease-time 60;
max-lease-time 60;

subnet 192.168.1.0 netmask 255.255.255.0 {
	option routers 192.168.1.1;
	option subnet-mask 255.255.255.0;
	option broadcast-address 192.168.1.255;
	option domain-name "mhouse.lh";
	option domain-name-servers 192.168.1.1;
#	option ntp-servers 192.168.1.1;
	always-broadcast true;

	pool {
		failover peer "dhcp-failover";
		default-lease-time 60;
		min-lease-time 60;
		max-lease-time 60;
		range 192.168.1.160 192.168.1.250;
	}
}

subnet 10.1.0.0 netmask 255.255.0.0 {
}

group {
	use-host-decl-names on;

	host the-hound {
#		option host-name "the-hound";
		hardware ethernet 3c:4a:92:00:4c:5b;
		fixed-address 192.168.1.150;
	}

	host samknows {
#		option host-name "samknows";
		hardware ethernet e8:94:f6:c4:00:2a;
		fixed-address 192.168.1.20;
	}
}
------------------------------------------------------------------------