nagios check_tcp kills failover, then dhcp failure.

Tue Dec 7 21:08:52 UTC 2010

The short:
The previous DHCP version did not have this problem (dhcp-4.1.1-P1).
New DHCP installed (dhcp-4.2.0-P1) has failover failure problem.a

Two servers, in failover mode. When Nagios does a 
check_tcp on the failover port (520) the failover 
mechanism goes into no matching state, timeout,
disconnected, communications interrupted.
Then, with the DHCP service still running, it stops
answering any request on both the Primary and secondary.
Log snippets are below.

The long:
For one server, I built a new (primary) server, SLES11-sp1 x64, 
the secondary server is SLES10-sp1 x64. 
Installed the lastest DHCP on both servers:
dhcpd --version
isc-dhcpd-4.2.0-P1.

The previous DHCP version did not have this problem (dhcp-4.1.1-P1).

Both servers running normally, until the Nagios check_tcp process
checks for the open port (520). 
I narrowed it down to a port and Nagios check problem because while
first troubleshooting, I changed the port to 1520 and it ran great
all weekend (first upgraded on Dec. 3). Then Monday morning, our
nagios guy changed the check to the new port 1520 and it failed
shortly after.

It feels like a bug maybe?
For now, I've created an iptables rule to drop all packets to
that port from any server except the peer. This way, no other
port scans or any scans would accidentally take the failover offline.

Here is the failover section of our dhcpd.conf on both servers:
(primary)
failover peer "failover" {
  primary; # declare this to be the primary server
  address 10.120.11.85;
  port 1520;
  peer address 10.120.11.107;
  peer port 1520;
  max-response-delay 30;
  max-unacked-updates 10;
  load balance max seconds 3;
  mclt 900;
#  mclt 604800;
  split 128;
}

(secondary)
failover peer "failover" {
  secondary; # declare secondary
  address 10.120.11.107;
  port 1520;
  peer address 10.120.11.85;
  peer port 1520;
  max-response-delay 30;
  max-unacked-updates 10;
  load balance max seconds 3;
}

Then this happens in the logs on both servers:
(primary)
Dec  6 10:11:29 cs99la59 dhcpd: DHCPREQUEST for 10.124.8.91 from 00:16:e6:05:9d:d6 (LKAD649) via eth0
Dec  6 10:11:29 cs99la59 dhcpd: DHCPACK on 10.124.8.91 to 00:16:e6:05:9d:d6 (LKAD649) via eth0
Dec  6 10:11:29 cs99la59 dhcpd: DHCPINFORM from 10.120.131.122 via 10.120.130.3
Dec  6 10:11:29 cs99la59 dhcpd: DHCPACK to 10.120.131.122 (00:1e:4f:57:7c:9e) via eth0
Dec  6 10:11:29 cs99la59 dhcpd: DHCPINFORM from 10.120.131.122 via 10.120.130.2
Dec  6 10:11:29 cs99la59 dhcpd: DHCPACK to 10.120.131.122 (00:1e:4f:57:7c:9e) via eth0
Dec  6 10:11:30 cs99la59 dhcpd: failover: listener: no matching state
Dec  6 10:11:58 cs99la59 dhcpd: timeout waiting for failover peer failover
Dec  6 10:11:58 cs99la59 dhcpd: peer failover: disconnected
Dec  6 10:11:58 cs99la59 dhcpd: failover peer failover: I move from normal to communications-interrupted

(secondary)
Dec  6 10:12:03 dss-dr93lv01-b dhcpd: DHCPINFORM from 10.114.4.203 via 10.114.7.254
Dec  6 10:12:03 dss-dr93lv01-b dhcpd: DHCPACK to 10.114.4.203 (00:08:74:d8:3a:13) via eth0
Dec  6 10:12:11 dss-dr93lv01-b dhcpd: uid lease 10.120.11.196 for client 52:41:43:e3:bc:02 is duplicate on 10.120.8.0/22
Dec  6 10:12:11 dss-dr93lv01-b dhcpd: DHCPREQUEST for 10.120.11.194 from 52:41:43:e3:bc:02 via eth0: lease owned by peer
Dec  6 10:12:11 dss-dr93lv01-b dhcpd: uid lease 10.120.11.198 for client 52:41:43:e3:bc:02 is duplicate on 10.120.8.0/22
Dec  6 10:12:11 dss-dr93lv01-b dhcpd: DHCPREQUEST for 10.120.11.194 from 52:41:43:e3:bc:02 via 10.120.11.244: lease owned by peer
Dec  6 10:12:14 dss-dr93lv01-b dhcpd: DHCPDISCOVER from 00:23:ae:2b:51:2e (53XT6K1) via 10.126.30.254
Dec  6 10:12:14 dss-dr93lv01-b dhcpd: failover: listener: no matching state
Dec  6 10:12:15 dss-dr93lv01-b dhcpd: DHCPOFFER on 10.126.30.81 to 00:23:ae:2b:51:2e (53XT6K1) via 10.126.30.254
Dec  6 10:12:24 dss-dr93lv01-b dhcpd: timeout waiting for failover peer failover
Dec  6 10:12:24 dss-dr93lv01-b dhcpd: peer failover: disconnected
Dec  6 10:12:24 dss-dr93lv01-b dhcpd: failover peer failover: I move from normal to communications-interrupted
Dec  6 10:58:52 dss-dr93lv01-b dhcpd: Wrote 0 deleted host decls to leases file.
Dec  6 10:58:52 dss-dr93lv01-b dhcpd: Wrote 0 new dynamic host decls to leases file.
Dec  6 10:58:52 dss-dr93lv01-b dhcpd: Wrote 4332 leases to leases file.
Dec  6 10:58:53 dss-dr93lv01-b dhcpd: DHCPREQUEST for 10.126.30.81 (10.120.11.107) from 00:23:ae:2b:51:2e (53XT6K1) via 10.126.30.254
Dec  6 10:58:53 dss-dr93lv01-b dhcpd: DHCPACK on 10.126.30.81 to 00:23:ae:2b:51:2e (53XT6K1) via 10.126.30.254
Dec  6 10:58:53 dss-dr93lv01-b dhcpd: DHCPDISCOVER from 00:16:e6:05:73:60 via 10.114.7.254
Dec  6 10:58:53 dss-dr93lv01-b dhcpd: failover: listener: no matching state
Dec  6 10:58:53 dss-dr93lv01-b dhcpd: Failover CONNECT from failover: time offset too large
Dec  6 10:58:53 dss-dr93lv01-b dhcpd: failover: disconnect: time offset too large
Dec  6 10:58:54 dss-dr93lv01-b dhcpd: DHCPOFFER on 10.114.6.122 to 00:16:e6:05:73:60 via 10.114.7.254
Dec  6 10:59:08 dss-dr93lv01-b dhcpd: failover: link startup timeout
Dec  6 10:59:08 dss-dr93lv01-b dhcpd: failover: link startup timeout

I've let it sit like this for 30 minutes up to 1 hour. It never recovers itself.
It take a manual restart of DHCP.

Help,
thanks
bb