Secondary server in failover fails to come out of recover state

Tue Apr 30 19:50:07 UTC 2013

I should add that the only way we can fix this is to shutdown both 
servers, remove the "my state" and "partner state" lines from this 
section of dhcpd.leases :

failover peer "dhcp" state {
   my state normal at 2 2013/04/30 18:39:28;
   partner state normal at 0 2008/01/06 16:56:50;
}

and the start dhcpd on both.  There doesn't appear to be any pattern to 
when this happens.  We can go for days or weeks with no issues, both 
servers will sync up after a change but then after adding a new scope, 
this issue may occur.

On 04/30/2013 02:41 PM, Oscar Ricardo Silva wrote:
> I should mention that we've written a patch to use the MAC address as
> the identifier instead of the client identifier.  We've done this so
> that a device will have the same identifier no matter what operating
> system it boots into.  This is an issue for multi-boot devices or if a
> devices boots into a PXE-boot environment and that's why you'll a this
> line in the configuration:
>
>      key-off-mac-address true;
>
>
>
> Operating system versions:
>
> primary:  RHEL 6.2, kernel 2.6.32-220.7.1.el6.i686
> secondary: RHEL 6.3, kernel 2.6.32-279.19.1.el6.i686
>
>
>
>
>
> Primary server:
>
>
> option domain-name-servers 192.168.50.41, 192.168.50.40 ;
> option ntp-servers 192.168.50.40, 192.168.50.41;
> default-lease-time 172800;
> max-lease-time 172800;
> one-lease-per-client true;
> ddns-update-style ad-hoc;
> ddns-updates off;
> authoritative;
> key-off-mac-address true;
> if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
>          deny booting;
> }
> option voip-tftp-server-address code 150 = array of ip-address ;
> set vendor-string = option vendor-class-identifier;
> failover peer "dhcp" {
>           primary;
>           address 192.168.100.2;
>           port 520;
>           peer address 192.168.101.2;
>           peer port 520;
>           max-response-delay 60;
>           max-unacked-updates 10;
>           mclt 300;
>       split 255;
>           load balance max seconds 5;
>         }
> subnet 192.168.100.0 netmask 255.255.255.224 {
>      }
>
>
>
>
> Secondary:
>
> option domain-name-servers  192.168.50.41,  192.168.50.40 ;
> option ntp-servers  192.168.50.40,  192.168.50.41;
> default-lease-time 172800;
> max-lease-time 172800;
> one-lease-per-client true;
> ddns-update-style ad-hoc;
> ddns-updates off;
> authoritative;
> key-off-mac-address true;
> if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
>          deny booting;
> }
> option voip-tftp-server-address code 150 = array of ip-address ;
> set vendor-string = option vendor-class-identifier;
> failover peer "dhcp" {
>           secondary;
>           address 192.168.101.2;
>           port 520;
>           peer address 192.168.100.2;
>           peer port 520;
>           max-response-delay 60;
>           max-unacked-updates 10;
>           load balance max seconds 5;
>         }
> subnet 192.168.101.0 netmask 255.255.255.224 {
>      }
>
>
>
>  > Date: Tue, 30 Apr 2013 19:58:06 +0100
>  > From: Steven Carr <sjcarr at gmail.com>
>  > To: Users of ISC DHCP <dhcp-users at lists.isc.org>
>  > Subject: Re: Secondary server in failover fails to come out of recover
>  >     state
>  > Message-ID:
>  >     <CALMep05YRY9LWtGsCQUV8uHXV0CAVOVJADiAoKYx8_A90=zHDA at mail.gmail.com>
>  > Content-Type: text/plain; charset="utf-8"
>  >
>  > Can you post the two full configs somewhere (or as near full as you can
>  > without removing the main bulk of the config)? or feel free to email
> them
>  > to me directly. Also, so I can try to reproduce in our lab what OS
> are you
>  > running?
>  >
>
>
> On 04/30/2013 01:34 PM, Oscar Ricardo Silva wrote:
>> OK, I've tried running the server in debug mode but I don't see any
>> additional information available. This happened again today.  Also, as
>> previously suggested, I have raised the mclt from 120 to 300.
>>
>>
>> At 11am, a configuration change was made on the primary and it was
>> restarted.  Here's the logs from the secondary and you'll see that at
>> 11:06:55 both servers moved to a "normal" state.
>>
>> Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>> from normal to shutdown
>> Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
>> normal to partner-down
>> Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
>> Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>> from shutdown to recover
>> Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>> from recover to recover
>> Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>> from recover to recover-done
>> Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
>> partner-down to normal
>> Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>> from recover-done to normal
>>
>>
>>
>> At 11:07:42, the secondary was restarted and these are the only entries
>> in the log:
>>
>> Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
>> normal to shutdown
>> Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>> from normal to partner-down
>> Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
>> shutdown to recover
>> Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
>> recover to startup
>> Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
>> startup to recover
>>
>> two hours later, the secondary server is still recovering.
>>
>>
>>
>> Again, here's the strangest part of this issue:  when I take down the
>> secondary server (dhcpd not running at all), the primary still reports
>> that the secondary is in recover mode.  dhcpd was stopped on the
>> secondary at 13:07:08  and here's what the primary reports:
>>
>> Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected
>>
>>
>> $Tue Apr 30 13:14:38 CDT 2013
>>
>> partner-state = 00:00:00:06
>> local-state = 00:00:00:04
>>
>>
>>
>> There are router acls on interfaces between the two servers but the
>> networks on which each server resides is completely allowed without
>> restriction.  iptables is running on each server but again, no
>> restrictions on communications between the two.  If there was a firewall
>> issue then the servers would never have returned to a "normal" state
>> after the primary was restarted.
>>
>> Time is perfectly sync'ed between the two servers.
>>
>>
>>
>>> Message: 2
>>> Date: Thu, 25 Apr 2013 00:01:45 +0100
>>> From: Steven Carr <sjcarr at gmail.com>
>>> To: Users of ISC DHCP <dhcp-users at lists.isc.org>
>>> Subject: Re: Secondary server in failover fails to come out of recover
>>>     state
>>> Message-ID:
>>>     <CALMep064aX_L0q5ry4A4SLGn-x=pV1ou4ECdoRGK9o8fx2_DHg at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Can you crank up the logging level to debug (IIRC this needs to be
>>> done via
>>> syslog) so it details exactly what it is doing when it goes into RECOVER
>>> state, it may give some extra pointers.
>>>
>>>
>>> On 24 April 2013 23:50, Oscar Ricardo Silva <oscars at mail.utexas.edu>
>>> wrote:
>>>
>>>> I should note that while it was recovering, the primary reported:
>>>>
>>>> partner-state = 00:00:00:06
>>>> local-state = 00:00:00:04
>>>>
>>>>
>>>> and the secondary reported:
>>>>
>>>> partner-state = 00:00:00:04
>>>> local-state = 00:00:00:06
>>>>
>>>>
>>>>
>>>> In following another suggestion (recreate an empty dhcpd.leases
>>>> file), I
>>>> shutdown the secondary but the primary still reported:
>>>>
>>>> partner-state = 00:00:00:06
>>>> local-state = 00:00:00:04
>>>>
>>>>
>>>>
>>>>
>>>> The change that was made was the addition of these two scopes:
>>>>
>>>>
>>>> subnet 192.168.75.128 netmask 255.255.255.128 {
>>>>                 pool {
>>>>                         range 192.168.75.130 192.168.75.254;
>>>>                         deny dynamic bootp clients ;
>>>>                         failover peer "dhcp" ;
>>>>                  }
>>>>         option domain-name "dept.utexas.edu";
>>>>         option subnet-mask 255.255.255.128;
>>>>         option broadcast-address 255.255.255.255;
>>>>         option routers 192.168.75.129;
>>>> }
>>>>
>>>>
>>>> subnet 192.168.228.32 netmask 255.255.255.224 {
>>>>          pool {
>>>>                  range 192.168.228.34 192.168.228.62;
>>>>                  deny dynamic bootp clients ;
>>>>                  failover peer "dhcp" ;
>>>>          }
>>>>          default-lease-time 7200;
>>>>          max-lease-time 7200;
>>>>          option domain-name "dept.utexas.edu";
>>>>          option subnet-mask 255.255.255.224;
>>>>          option broadcast-address 255.255.255.255;
>>>>          option routers 192.168.228.33;
>>>> }
>>>>
>>>>
>>>> the new scopes were first added to the primary, it was then reloaded.
>>>> After both servers were in a "normal" state, the corresponding change
>>>> was
>>>> made on the secondary and it was reloaded.
>>>>
>>>> Per Stephen Carr's suggestion, I have increased the MCLT to 300 and
>>>> both
>>>> servers are still in the same state.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 04/24/2013 04:40 PM, Oscar Ricardo Silva wrote:
>>>>
>>>>> We have two servers in a failover relationship, both running
>>>>> 4.1-ESV-R7.
>>>>>    After a reload of dhcpd on the secondary, it has not come out of
>>>>> the
>>>>> recover state after almost an hour.  We've had this happen with 3.1.3
>>>>> and recently upgraded to this version.  The only thing we've been able
>>>>> to do is stop both instances of dhcpd and remove "my state" and
>>>>> "partner
>>>>> state" from dhcpd.leases.
>>>>>
>>>>>
>>>>> Here's the timeline of what happened.
>>>>>
>>>>> 1.  A change was made to the configuration of the primary and dhcpd
>>>>> reloaded at 15:39:14.
>>>>> 2. The primary moved back to a "normal" state at 15:43:42
>>>>>
>>>>> Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> normal to shutdown
>>>>> Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves
>>>>> from
>>>>> normal to partner-down
>>>>> Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> shutdown to recover
>>>>> Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> recover to startup
>>>>> Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> startup to recover
>>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
>>>>> completed.
>>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> recover to recover-done
>>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves
>>>>> from
>>>>> partner-down to normal
>>>>> Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> recover-done to normal
>>>>> Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves
>>>>> from
>>>>> normal to shutdown
>>>>> Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> normal to partner-down
>>>>> Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
>>>>> Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves
>>>>> from
>>>>> shutdown to recover
>>>>> Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves
>>>>> from
>>>>> recover to recover
>>>>>
>>>>>
>>>>>
>>>>> 3.  The corresponding change was made on the secondary and it was
>>>>> reloaded at 15:44:53
>>>>>
>>>>> 4.  At 15:44:54 it came back up into recover, then moved from
>>>>> recover to
>>>>> startup, then from startup to recover.  That's where it's been ever
>>>>> since.
>>>>>
>>>>> Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> normal to shutdown
>>>>> Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
>>>>> from normal to partner-down
>>>>> Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> shutdown to recover
>>>>> Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> recover to startup
>>>>> Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
>>>>> startup to recover
>>>>>
>>>>>
>>>>>
>>>>> Here's dhcpd.conf for the primary:
>>>>>
>>>>> option domain-name-servers 192.168.50.41, 192.168.50.40 ;
>>>>> option ntp-servers 192.168.50.40, 192.168.50.41;
>>>>> default-lease-time 86400;
>>>>> max-lease-time 86400;
>>>>> one-lease-per-client true;
>>>>> ddns-update-style ad-hoc;
>>>>> ddns-updates off;
>>>>> authoritative;
>>>>> if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
>>>>>           deny booting;
>>>>> }
>>>>> option voip-tftp-server-address code 150 = array of ip-address ;
>>>>> set vendor-string = option vendor-class-identifier;
>>>>> failover peer "dhcp" {
>>>>>            primary;
>>>>>            address 192.168.100.2;
>>>>>            port 520;
>>>>>            peer address 192.168.101.2;
>>>>>            peer port 520;
>>>>>            max-response-delay 60;
>>>>>            max-unacked-updates 10;
>>>>>            mclt 120;
>>>>>            split 255;
>>>>>            load balance max seconds 5;
>>>>>          }
>>>>> subnet 192.168.100.0 netmask 255.255.255.224 {
>>>>>           }
>>>>> include "/dhcpd/dhcpd.network.conf";
>>>>>
>>>>>
>>>>> and the /dhcpd/dhcpd.network.conf file holds the scope definitions.
>>>>> Both
>>>>> servers sync time through ntp and have the same exact time.
>>>>>
>>>>>
>>>>> Any information would be appreciated.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ______________________________**_________________
>>>> dhcp-users mailing list
>>>> dhcp-users at lists.isc.org
>>>> https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
>>>>
>>>>
>>>>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> <https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> dhcp-users mailing list
>>> dhcp-users at lists.isc.org
>>> https://lists.isc.org/mailman/listinfo/dhcp-users
>>>
>>> End of dhcp-users Digest, Vol 54, Issue 21
>>> ******************************************
>>>
>>
>