Primary on failover pair gets "stuck" in partner state

Tue Aug 30 03:45:17 UTC 2011

Does anything show up in the logs? Ideally this would be the first
place to start looking for odd behavior.

On Mon, Aug 29, 2011 at 9:23 PM, Oscar Ricardo Silva
<oscars at mail.utexas.edu> wrote:
> I have a pair of servers running 3.1.3 in a failover relationship.
> Recently we added new scopes and were testing changes to the
> configuration, requiring multiple starts/restarts.  Each time, the
> primary was restarted, we waited for both servers to move to normal,
> then restarted the secondary.  After one of these cycles, the primary
> came up fine and moved to a "normal" state.  The secondary came up and
> reported that it was recovering (as is normal).  The problem is that the
> secondary never recovered, even after several hours (6 hours of
> waiting).  There was communication between the two and the only change
> was the addition of new scopes.
>
> At one point, when the secondary was completely down, the primary was
> still reporting that it had a local state of "Partner Down" but a
> partner state of "Recovering".  Nothing done to the secondary caused the
> states on the primary to change.
>
> Bringing both servers down, the starting the primary, then the secondary
> fixed the problem.  It's the second time we've seen this and we're not
> exactly sure why one server gets stuck in a particular state.  Time
> isn't an issue as both servers are updated via NTP and they have the
> same time down to several microseconds.
>
> We don't think it's a firewall or other communication problem (yes,
> everyone says that right).  After both dhcpd processes were shutdown
> then restarted, the two servers established a failover relationship with
> no changes to the underlying network, or iptables.
>
>
>
>
> Oscar
>
>
> Primary:
>
> failover peer "failover-dhcp" {
>         primary;
>         address 192.168.100.34;
>         port 520;
>         peer address 192.168.101.34;
>         peer port 520;
>         max-response-delay 60;
>         max-unacked-updates 10;
>         mclt 120;
>         split 255;
>         load balance max seconds 5;
>       }
>
>
> Secondary:
>
> failover peer "failover-dhcp" {
>         secondary;
>         address 192.168.101.34;
>         port 520;
>         peer address 192.168.100.34;
>         peer port 520;
>         max-response-delay 60;
>         max-unacked-updates 10;
>         load balance max seconds 5;
>       }
>
>
>
> Yes, I know the "split 255" statement is a little weird but we do this
> to try and "prefer" the primary to facilitate troubleshooting.  The
> servers are started/restarted several times each month as new networks
> are defined and the split statement doesn't cause any issues.
>
> _______________________________________________
> dhcp-users mailing list
> dhcp-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/dhcp-users
>

-- 
Jason Gerfen
jason.gerfen at gmail.com

http://www.github.com/jas-
http://phpdhcpadmin.sourceforge.net