pool misbalance issues on large, busy pool with frequent restarts

Thu Jun 3 03:16:50 UTC 2010

The practise I have always followed when restarting servers using 
failover is to stop and start the secondary, wait for startup to 
complete, then stop and start the primary. If scripting this process, 
then you should do a syntax check (dhcpd -t) before actually stopping 
either server.

The idea behind this is that if you restart the primary first, and you 
have configured, say, a new subnet, then there will be an error the 
primary sends a failover message to the secondary which doesn't know 
about the new subnet.

You could also add some logic to check to see if the config files 
(dhcpd.conf and any include files) have been changed since the last 
restart, and skip the restart if they are unchanged.

Because you are restarting so often, the rebalancing algorithms don't 
get to run before the next restart comes along. Restarting both 
simultaneously may mean that one is not ready to receive the update from 
the other re-starting server. Staggering the starts will help.

The running server knows when the other server comes up, as it receives 
a failover update message. SO if you were to put it ijn partner-down 
state, then it will automatically switch through the required states to 
normal as the other system comes up and communication is established.

There is also some information in the dhcpd.conf man page, in the 
section titled "The Failover pool balance statements". You may wish to 
fiddle with the max-balance setting if the new restart schedule doesn't 
help enough. Make this setting less than 5 minutes.

regards,
-glenn

On 06/03/10 06:32, Petersen, Kirsten J - NET wrote:
> We are running ISC DHCP 3.1.1-6 on Debian Lenny.
>
> We have two dhcp servers configured for load balancing and failover.  We do a dhcpd-restart every 5 minutes on both servers at the same time to pick up the latest updates to the dhcp config.  The max-lease-time is set very short on our wireless networks (1800) because we were running out of leases and wanted them to cycle quickly.
>
> We are seeing an issue where the two servers become out-of-sync with respect to pool balances.  Sometimes when this happens, the situation gets bad enough that neither server is handing out leases even when there should be leases still available (we think).  To work around that problem, we down the secondary and whack its leases file, then bring it back up and let it recover.  This is obviously less than ideal.
>
> Based on what we are seeing in the logs, it looks like one server is sometimes trying to rebalance when its partner is restarting, and it doesn't know that its partner is unavailable.  Is that possible?
>
> Logs illustrating brokenness:
>
> May 26 01:02:07 ns2 dhcpd: balancing pool aa8c8a8 128.193.136/21  total 2038  free 895  backup 594  lts -150  max-own (+/-)45  (requesting peer rebalance!)
> May 26 01:02:07 ns2 dhcpd: balanced pool aa8c8a8 128.193.136/21  total 2038  free 895  backup 594  lts -150  max-misbal 74
> May 26 01:02:07 ns1 dhcpd: balancing pool a8258d8 128.193.136/21  total 2038  free 699  backup 790  lts -45  max-own (+/-)45
> May 26 01:03:07 ns2 dhcpd: balancing pool aa8c8a8 128.193.136/21  total 2038  free 896  backup 596  lts -150  max-own (+/-)45
> May 26 01:03:07 ns1 dhcpd: balancing pool a8258d8 128.193.136/21  total 2038  free 701  backup 791  lts -45  max-own (+/-)45
> May 26 01:03:07 ns1 dhcpd: balanced pool a8258d8 128.193.136/21  total 2038  free 701  backup 791  lts -45  max-misbal 75
>
>
> A few questions:
>
> * Is there something horribly wrong with our config or the way we are doing restarts?
>
> * Is there a best practice for restarting load-balanced dhcp servers?  One idea we had was to write a script that would do something like the following:
>
>   - shutdown primary
>   - tell secondary partner is down
>   - startup primary
>   - check status of primary
>   - tell secondary partner is up
>   - wait 10 seconds or so
>   - repeat above for secondary
>
> Does the above look reasonable, or is there a better way?
>
> * When we shutdown a server, should we set its status to "recover-wait" or something else before shutdown, so that it comes up in a state where it is not trying to hand out leases?  Or, does the software do this by default?  It looks like status information is read from the leases file...
>
> * Or, would it be better to shutdown the primary, then shutdown the secondary, then bring up the primary, and then bring up the secondary?  That way, both servers are down at the same time, and neither one tries to rebalance when its partner is unavailable.   (Obviously this would result in a brief outage.)
>
> * Is there a way to tell the servers at what times or what interval they should rebalance pools?  If not, how do the servers decide when to check for rebalance?  If there is a fixed frequency, what is it?
>
>
> The configs follow.
>
>
> # primary config
>
> failover peer "dhcp" {
>      primary;
>      address x.y.z.10;
>      port 520;
>
>      # our peer is ns2
>      peer address x.y.z.20;
>      peer port 520;
>
>      max-response-delay 60;
>      max-unacked-updates 10;
>      mclt 600;
>      split 128;
>      load balance max seconds 3;
>
>      max-lease-misbalance 5;
>      max-lease-ownership 3;
> }
>
> # secondary config
>
> failover peer "dhcp" {
>      secondary;
>      address x.y.z.20;
>      port 520;
>
>      # ns1 is our secondary
>      peer address x.y.z.10;
>      peer port 520;
>
>      max-response-delay 60;
>      max-unacked-updates 10;
>      load balance max seconds 3;
>
>      max-lease-misbalance 5;
>      max-lease-ownership 3;
> }
>
> # Example wireless network definition
>
> subnet x.y.136.0 netmask 255.255.248.0 {
>          max-lease-time 1800;
>          option subnet-mask 255.255.248.0;
>          option netbios-name-servers x.y.z.39;
>          option routers x.y.136.1;
>          use-host-decl-names false;
>          default-lease-time 1800;
>          option netbios-node-type 8;
>          pool {
>                  failover peer "dhcp";
>                  deny dynamic bootp clients;
>                  range x.y.136.8 x.y.143.253;
>          }
> }
>
>
>
> ________________
> Kirsten Petersen
> Network Services * Oregon State University
> http://oregonstate.edu/net * irc.oregonstate.edu #osu-is
> "Aging is bad for your health." - Bent Petersen