Primary on failover pair gets "stuck" in partner state

Tue Aug 30 03:23:48 UTC 2011

I have a pair of servers running 3.1.3 in a failover relationship.
Recently we added new scopes and were testing changes to the
configuration, requiring multiple starts/restarts.  Each time, the
primary was restarted, we waited for both servers to move to normal,
then restarted the secondary.  After one of these cycles, the primary
came up fine and moved to a "normal" state.  The secondary came up and
reported that it was recovering (as is normal).  The problem is that the
secondary never recovered, even after several hours (6 hours of
waiting).  There was communication between the two and the only change
was the addition of new scopes.

At one point, when the secondary was completely down, the primary was
still reporting that it had a local state of "Partner Down" but a
partner state of "Recovering".  Nothing done to the secondary caused the
states on the primary to change.

Bringing both servers down, the starting the primary, then the secondary
fixed the problem.  It's the second time we've seen this and we're not
exactly sure why one server gets stuck in a particular state.  Time
isn't an issue as both servers are updated via NTP and they have the
same time down to several microseconds.

We don't think it's a firewall or other communication problem (yes,
everyone says that right).  After both dhcpd processes were shutdown
then restarted, the two servers established a failover relationship with
no changes to the underlying network, or iptables.

Oscar

Primary:

failover peer "failover-dhcp" {
          primary;
          address 192.168.100.34;
          port 520;
          peer address 192.168.101.34;
          peer port 520;
          max-response-delay 60;
          max-unacked-updates 10;
          mclt 120;
          split 255;
          load balance max seconds 5;
        }

Secondary:

failover peer "failover-dhcp" {
          secondary;
          address 192.168.101.34;
          port 520;
          peer address 192.168.100.34;
          peer port 520;
          max-response-delay 60;
          max-unacked-updates 10;
          load balance max seconds 5;
        }

Yes, I know the "split 255" statement is a little weird but we do this
to try and "prefer" the primary to facilitate troubleshooting.  The
servers are started/restarted several times each month as new networks
are defined and the split statement doesn't cause any issues.