pool misbalance issues on large, busy pool with frequent restarts

Petersen, Kirsten J - NET Kirsten.Petersen at oregonstate.edu
Wed Jun 2 20:32:46 UTC 2010


We are running ISC DHCP 3.1.1-6 on Debian Lenny.

We have two dhcp servers configured for load balancing and failover.  We do a dhcpd-restart every 5 minutes on both servers at the same time to pick up the latest updates to the dhcp config.  The max-lease-time is set very short on our wireless networks (1800) because we were running out of leases and wanted them to cycle quickly.

We are seeing an issue where the two servers become out-of-sync with respect to pool balances.  Sometimes when this happens, the situation gets bad enough that neither server is handing out leases even when there should be leases still available (we think).  To work around that problem, we down the secondary and whack its leases file, then bring it back up and let it recover.  This is obviously less than ideal.

Based on what we are seeing in the logs, it looks like one server is sometimes trying to rebalance when its partner is restarting, and it doesn't know that its partner is unavailable.  Is that possible?

Logs illustrating brokenness:

May 26 01:02:07 ns2 dhcpd: balancing pool aa8c8a8 128.193.136/21  total 2038  free 895  backup 594  lts -150  max-own (+/-)45  (requesting peer rebalance!)
May 26 01:02:07 ns2 dhcpd: balanced pool aa8c8a8 128.193.136/21  total 2038  free 895  backup 594  lts -150  max-misbal 74
May 26 01:02:07 ns1 dhcpd: balancing pool a8258d8 128.193.136/21  total 2038  free 699  backup 790  lts -45  max-own (+/-)45
May 26 01:03:07 ns2 dhcpd: balancing pool aa8c8a8 128.193.136/21  total 2038  free 896  backup 596  lts -150  max-own (+/-)45
May 26 01:03:07 ns1 dhcpd: balancing pool a8258d8 128.193.136/21  total 2038  free 701  backup 791  lts -45  max-own (+/-)45
May 26 01:03:07 ns1 dhcpd: balanced pool a8258d8 128.193.136/21  total 2038  free 701  backup 791  lts -45  max-misbal 75


A few questions:

* Is there something horribly wrong with our config or the way we are doing restarts?

* Is there a best practice for restarting load-balanced dhcp servers?  One idea we had was to write a script that would do something like the following:

 - shutdown primary
 - tell secondary partner is down
 - startup primary
 - check status of primary
 - tell secondary partner is up
 - wait 10 seconds or so
 - repeat above for secondary

Does the above look reasonable, or is there a better way?  

* When we shutdown a server, should we set its status to "recover-wait" or something else before shutdown, so that it comes up in a state where it is not trying to hand out leases?  Or, does the software do this by default?  It looks like status information is read from the leases file...

* Or, would it be better to shutdown the primary, then shutdown the secondary, then bring up the primary, and then bring up the secondary?  That way, both servers are down at the same time, and neither one tries to rebalance when its partner is unavailable.   (Obviously this would result in a brief outage.)

* Is there a way to tell the servers at what times or what interval they should rebalance pools?  If not, how do the servers decide when to check for rebalance?  If there is a fixed frequency, what is it?


The configs follow.


# primary config

failover peer "dhcp" {
    primary;
    address x.y.z.10;
    port 520;

    # our peer is ns2
    peer address x.y.z.20;
    peer port 520;

    max-response-delay 60;
    max-unacked-updates 10;
    mclt 600;
    split 128;
    load balance max seconds 3;

    max-lease-misbalance 5;
    max-lease-ownership 3;
}

# secondary config

failover peer "dhcp" {
    secondary;
    address x.y.z.20;
    port 520;

    # ns1 is our secondary
    peer address x.y.z.10;
    peer port 520;

    max-response-delay 60;
    max-unacked-updates 10;
    load balance max seconds 3;

    max-lease-misbalance 5;
    max-lease-ownership 3;
}

# Example wireless network definition

subnet x.y.136.0 netmask 255.255.248.0 {
        max-lease-time 1800;
        option subnet-mask 255.255.248.0;
        option netbios-name-servers x.y.z.39;
        option routers x.y.136.1;
        use-host-decl-names false;
        default-lease-time 1800;
        option netbios-node-type 8;
        pool {
                failover peer "dhcp";
                deny dynamic bootp clients;
                range x.y.136.8 x.y.143.253;
        }
}



________________
Kirsten Petersen
Network Services * Oregon State University
http://oregonstate.edu/net * irc.oregonstate.edu #osu-is
"Aging is bad for your health." - Bent Petersen



More information about the dhcp-users mailing list