Secondary server in failover fails to come out of recover state

Tue Apr 30 19:41:35 UTC 2013

I should mention that we've written a patch to use the MAC address as 
the identifier instead of the client identifier.  We've done this so 
that a device will have the same identifier no matter what operating 
system it boots into.  This is an issue for multi-boot devices or if a 
devices boots into a PXE-boot environment and that's why you'll a this 
line in the configuration:

	key-off-mac-address true;

Operating system versions:

primary:  RHEL 6.2, kernel 2.6.32-220.7.1.el6.i686
secondary: RHEL 6.3, kernel 2.6.32-279.19.1.el6.i686

Primary server:

option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
         deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
          primary;
          address 192.168.100.2;
          port 520;
          peer address 192.168.101.2;
          peer port 520;
          max-response-delay 60;
          max-unacked-updates 10;
          mclt 300;
	 split 255;
          load balance max seconds 5;
        }
subnet 192.168.100.0 netmask 255.255.255.224 {
	}

Secondary:

option domain-name-servers  192.168.50.41,  192.168.50.40 ;
option ntp-servers  192.168.50.40,  192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
         deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
          secondary;
          address 192.168.101.2;
          port 520;
          peer address 192.168.100.2;
          peer port 520;
          max-response-delay 60;
          max-unacked-updates 10;
          load balance max seconds 5;
        }
subnet 192.168.101.0 netmask 255.255.255.224 {
	}

 > Date: Tue, 30 Apr 2013 19:58:06 +0100
 > From: Steven Carr <sjcarr at gmail.com>
 > To: Users of ISC DHCP <dhcp-users at lists.isc.org>
 > Subject: Re: Secondary server in failover fails to come out of recover
 > 	state
 > Message-ID:
 > 	<CALMep05YRY9LWtGsCQUV8uHXV0CAVOVJADiAoKYx8_A90=zHDA at mail.gmail.com>
 > Content-Type: text/plain; charset="utf-8"
 >
 > Can you post the two full configs somewhere (or as near full as you can
 > without removing the main bulk of the config)? or feel free to email them
 > to me directly. Also, so I can try to reproduce in our lab what OS 
are you
 > running?
 >

On 04/30/2013 01:34 PM, Oscar Ricardo Silva wrote:
> OK, I've tried running the server in debug mode but I don't see any
> additional information available. This happened again today.  Also, as
> previously suggested, I have raised the mclt from 120 to 300.
>
>
> At 11am, a configuration change was made on the primary and it was
> restarted.  Here's the logs from the secondary and you'll see that at
> 11:06:55 both servers moved to a "normal" state.
>
> Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves
> from normal to shutdown
> Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
> normal to partner-down
> Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
> Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
> from shutdown to recover
> Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
> from recover to recover
> Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
> from recover to recover-done
> Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
> partner-down to normal
> Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
> from recover-done to normal
>
>
>
> At 11:07:42, the secondary was restarted and these are the only entries
> in the log:
>
> Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
> normal to shutdown
> Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves
> from normal to partner-down
> Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
> shutdown to recover
> Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
> recover to startup
> Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
> startup to recover
>
> two hours later, the secondary server is still recovering.
>
>
>
> Again, here's the strangest part of this issue:  when I take down the
> secondary server (dhcpd not running at all), the primary still reports
> that the secondary is in recover mode.  dhcpd was stopped on the
> secondary at 13:07:08  and here's what the primary reports:
>
> Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected
>
>
> $Tue Apr 30 13:14:38 CDT 2013
>
> partner-state = 00:00:00:06
> local-state = 00:00:00:04
>
>
>
> There are router acls on interfaces between the two servers but the
> networks on which each server resides is completely allowed without
> restriction.  iptables is running on each server but again, no
> restrictions on communications between the two.  If there was a firewall
> issue then the servers would never have returned to a "normal" state
> after the primary was restarted.
>
> Time is perfectly sync'ed between the two servers.
>
>
>
>> Message: 2
>> Date: Thu, 25 Apr 2013 00:01:45 +0100
>> From: Steven Carr <sjcarr at gmail.com>
>> To: Users of ISC DHCP <dhcp-users at lists.isc.org>
>> Subject: Re: Secondary server in failover fails to come out of recover
>>     state
>> Message-ID:
>>     <CALMep064aX_L0q5ry4A4SLGn-x=pV1ou4ECdoRGK9o8fx2_DHg at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Can you crank up the logging level to debug (IIRC this needs to be
>> done via
>> syslog) so it details exactly what it is doing when it goes into RECOVER
>> state, it may give some extra pointers.
>>
>>
>> On 24 April 2013 23:50, Oscar Ricardo Silva <oscars at mail.utexas.edu>
>> wrote:
>>
>>> I should note that while it was recovering, the primary reported:
>>>
>>> partner-state = 00:00:00:06
>>> local-state = 00:00:00:04
>>>
>>>
>>> and the secondary reported:
>>>
>>> partner-state = 00:00:00:04
>>> local-state = 00:00:00:06
>>>
>>>
>>>
>>> In following another suggestion (recreate an empty dhcpd.leases file), I
>>> shutdown the secondary but the primary still reported:
>>>
>>> partner-state = 00:00:00:06
>>> local-state = 00:00:00:04
>>>
>>>
>>>
>>>
>>> The change that was made was the addition of these two scopes:
>>>
>>>
>>> subnet 192.168.75.128 netmask 255.255.255.128 {
>>>                 pool {
>>>                         range 192.168.75.130 192.168.75.254;
>>>                         deny dynamic bootp clients ;
>>>                         failover peer "dhcp" ;
>>>                  }
>>>         option domain-name "dept.utexas.edu";
>>>         option subnet-mask 255.255.255.128;
>>>         option broadcast-address 255.255.255.255;
>>>         option routers 192.168.75.129;
>>> }
>>>
>>>
>>> subnet 192.168.228.32 netmask 255.255.255.224 {
>>>          pool {
>>>                  range 192.168.228.34 192.168.228.62;
>>>                  deny dynamic bootp clients ;
>>>                  failover peer "dhcp" ;
>>>          }
>>>          default-lease-time 7200;
>>>          max-lease-time 7200;
>>>          option domain-name "dept.utexas.edu";
>>>          option subnet-mask 255.255.255.224;
>>>          option broadcast-address 255.255.255.255;
>>>          option routers 192.168.228.33;
>>> }
>>>
>>>
>>> the new scopes were first added to the primary, it was then reloaded.
>>> After both servers were in a "normal" state, the corresponding change
>>> was
>>> made on the secondary and it was reloaded.
>>>
>>> Per Stephen Carr's suggestion, I have increased the MCLT to 300 and both
>>> servers are still in the same state.
>>>
>>>
>>>
>>>
>>>
>>> On 04/24/2013 04:40 PM, Oscar Ricardo Silva wrote:
>>>
>>>> We have two servers in a failover relationship, both running
>>>> 4.1-ESV-R7.
>>>>    After a reload of dhcpd on the secondary, it has not come out of the
>>>> recover state after almost an hour.  We've had this happen with 3.1.3
>>>> and recently upgraded to this version.  The only thing we've been able
>>>> to do is stop both instances of dhcpd and remove "my state" and
>>>> "partner
>>>> state" from dhcpd.leases.
>>>>
>>>>
>>>> Here's the timeline of what happened.
>>>>
>>>> 1.  A change was made to the configuration of the primary and dhcpd
>>>> reloaded at 15:39:14.
>>>> 2. The primary moved back to a "normal" state at 15:43:42
>>>>
>>>> Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> normal to shutdown
>>>> Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves from
>>>> normal to partner-down
>>>> Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> shutdown to recover
>>>> Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> recover to startup
>>>> Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> startup to recover
>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
>>>> completed.
>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> recover to recover-done
>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves from
>>>> partner-down to normal
>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> recover-done to normal
>>>> Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves from
>>>> normal to shutdown
>>>> Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>> normal to partner-down
>>>> Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
>>>> Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
>>>> shutdown to recover
>>>> Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
>>>> recover to recover
>>>>
>>>>
>>>>
>>>> 3.  The corresponding change was made on the secondary and it was
>>>> reloaded at 15:44:53
>>>>
>>>> 4.  At 15:44:54 it came back up into recover, then moved from
>>>> recover to
>>>> startup, then from startup to recover.  That's where it's been ever
>>>> since.
>>>>
>>>> Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>> normal to shutdown
>>>> Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>>>> from normal to partner-down
>>>> Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>> shutdown to recover
>>>> Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>> recover to startup
>>>> Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>> startup to recover
>>>>
>>>>
>>>>
>>>> Here's dhcpd.conf for the primary:
>>>>
>>>> option domain-name-servers 192.168.50.41, 192.168.50.40 ;
>>>> option ntp-servers 192.168.50.40, 192.168.50.41;
>>>> default-lease-time 86400;
>>>> max-lease-time 86400;
>>>> one-lease-per-client true;
>>>> ddns-update-style ad-hoc;
>>>> ddns-updates off;
>>>> authoritative;
>>>> if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
>>>>           deny booting;
>>>> }
>>>> option voip-tftp-server-address code 150 = array of ip-address ;
>>>> set vendor-string = option vendor-class-identifier;
>>>> failover peer "dhcp" {
>>>>            primary;
>>>>            address 192.168.100.2;
>>>>            port 520;
>>>>            peer address 192.168.101.2;
>>>>            peer port 520;
>>>>            max-response-delay 60;
>>>>            max-unacked-updates 10;
>>>>            mclt 120;
>>>>            split 255;
>>>>            load balance max seconds 5;
>>>>          }
>>>> subnet 192.168.100.0 netmask 255.255.255.224 {
>>>>           }
>>>> include "/dhcpd/dhcpd.network.conf";
>>>>
>>>>
>>>> and the /dhcpd/dhcpd.network.conf file holds the scope definitions.
>>>> Both
>>>> servers sync time through ntp and have the same exact time.
>>>>
>>>>
>>>> Any information would be appreciated.
>>>>
>>>>
>>>>
>>>>
>>> ______________________________**_________________
>>> dhcp-users mailing list
>>> dhcp-users at lists.isc.org
>>> https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
>>>
>>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> <https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> dhcp-users mailing list
>> dhcp-users at lists.isc.org
>> https://lists.isc.org/mailman/listinfo/dhcp-users
>>
>> End of dhcp-users Digest, Vol 54, Issue 21
>> ******************************************
>>
>