excessive failover pool balancing, leases files getting out of sync

Gordon A. Lang glang at goalex.com
Wed Jun 15 20:15:33 UTC 2011


Please help.

I am seeing evidence in the logs that both failover peers are doing
what they should -- fast, appropriate responses.

While most clients are happily getting leases, many clients keep
retrying as if they never got the offer/acks or else they simply don't
like what they are getting.

The clients who experience trouble one day are not typically the
same clients who experience trouble the next day -- the problems
seem to be randomly and uniformly distributed across all users
(thousands of users) and all subnets (hundreds of subnets).

Does this ring a bell with anyone?

--
Gordon A. Lang

----- Original Message ----- 
From: "Gordon A. Lang" <glang at goalex.com>
To: "Users of ISC DHCP" <dhcp-users at lists.isc.org>
Sent: Tuesday, June 14, 2011 10:25 PM
Subject: excessive failover pool balancing, leases files getting out of sync


>I need help.
>
> My log files show that the failover peers have very different ideas
> about the number of "free" and "backup" leases.  There is very
> frequent pool balancing going on.  My leases files are growing from
> 10 meg after a restart up to 120 meg within a couple of hours.
>
> We have version 3.1.1 compiled with USE_SOCKET defined, configured
> with failover, running on Sun servers with Solaris 10, and it has
> been working flawlessly for years until suddenly on May 26, things
> have become quite rough.
>
> The continuing symptom is this: at some point in the middle of the
> "morning rush," each work day between 7 am and 9 am, while most users
> get leases, many users do not.  The users who do not get a lease,
> usually do get one on their second attempt (via reboot), but not
> always.  Restarting dhcpd at that time completely stops all dhcp
> related help desk calls for the remainder of the day.
>
> The first time this occurred, on May 26, it was after a power glitch
> that sharpened the "morning rush" into a "boot storm."  Restarting
> dhcpd did not stop the help desk calls on that day.  In fact, we found
> that the problems did not stop until we shut down the primary and
> put the failover into "partner down" state.
>
> After that, things were fine for several days, until the symptoms
> resurfaced on a daily basis, and we got used to waiting for the
> first help desk call each morning.
>
> Then I rebooted both servers (with the primary dhcpd inhibited), took
> the failover dhcpd out of the "partner down" state, deleted the leases
> file on the primary, and brought the primary dhcpd back to life.
> The leases file was repopulated very quickly.  Everything worked
> flawlessly, and the leases files stayed in sync for 6 days.  But
> today we are seeing that the leases files are out of sync.
>
> And now I am getting a new message in the logs that I didn't see before:
>
> dhcpd: bind update on 10.110.1.80 got ack from nsti1-nsti2: xid mismatch.
>
>
> If anyone can give any suggestions, I would very much appreciate it.
>
> More information: We use 2 day leases, with pool sizes typically 3 to 4
> times larger than the expected number of users.
>
> I don't want to post my whole config file, but in case it is a useful
> reference, here is an drastically trimmed, sanitized version of primary
> (nsti1) conf file:
>
> local-address 192.168.104.11;
> subnet 192.168.104.11 netmask 255.255.255.255 { }
>
> option option-62 code 62 = string;
> option option-98 code 98 = string;
> option option-116 code 116 = boolean;
> option option-117 code 117 = unsigned integer 16;
> option option-119 code 119 = string;
> option option-120 code 120 = string;
> option option-121 code 121 = string;
> option option-150 code 150 = array of ip-address;
> option option-176 code 176 = string;
>
> ddns-update-style interim;
> authoritative ;
> ddns-updates True ;
> default-lease-time 172800 ;
> log-facility local4 ;
> max-lease-time 172800 ;
> min-lease-time 172800 ;
> omapi-port 7911 ;
> pid-file-name "/export/local/etc/dhcpd.pid" ;
> ping-check True ;
> ping-timeout 1 ;
> server-identifier 192.168.104.11 ;
> update-optimization False ;
> update-static-leases True ;
> allow unknown-clients ;
> allow duplicates ;
>
> failover peer "nsti1-nsti2" {
>        primary;
>        address 192.168.104.11;
>        port 647;
>        peer address 192.168.104.21;
>        peer port 647;
>        max-response-delay 60;
>        max-unacked-updates 20;
>        mclt 3600;
>        split 255;
>        load balance max seconds 5;
> }
>
> class "Cisco IP Phones" {
> match if (substring (option vendor-class-identifier, 0, 28) = "Cisco
> Systems, Inc. IP Phone");
> }
>
> subnet 127.0.0.1 netmask 255.255.255.255 {
> }
>
> subnet 10.3.0.0 netmask 255.255.0.0 {
>        option subnet-mask 255.255.0.0 ;
>        option routers 10.3.1.1 ;
>        option time-servers 10.106.1.30 ;
>        option domain-name-servers 192.168.53.10 , 192.168.53.20 ;
>        option domain-name "domain.name" ;
>        option vendor-encapsulated-options
> 06:01:0B:08:07:AA:AA:01:0A:06:01:14:00 ;
>        option netbios-name-servers 10.105.3.39 , 10.105.3.41 ;
>        option netbios-node-type 8 ;
>        option bootfile-name "BStrap/x86pc/BStrap.0" ;
>        option slp-directory-agent true 10.105.6.241 , 10.151.209.100 ;
>        option slp-service-scope true "slp-fs" ;
>        next-server fsta4.ad.domain.name ;
>         deny bootp ;
>        pool {
>                range 10.3.4.64 10.3.7.255;
>                failover peer "nsti1-nsti2";
>                deny dynamic bootp clients;
>        }
>        host OC0001L1464807.ad.domain.name.-68B599EC9E7B-10-3-4-7 {
>                hardware ethernet 68:B5:99:EC:9E:7B;
>                fixed-address 10.3.4.7;
>        }
> }
>
> subnet 10.110.1.0 netmask 255.255.255.0 {
>        option subnet-mask 255.255.255.0 ;
>        option routers 10.110.1.1 ;
>        option time-servers 10.106.1.30 ;
>        option domain-name-servers 192.168.53.10 , 192.168.53.20 ;
>        option domain-name "domain.name" ;
>        option netbios-name-servers 10.105.3.39 , 10.105.3.41 ;
>        option netbios-node-type 8 ;
>        option slp-directory-agent true 10.105.6.241 , 10.151.209.100 ;
>        option slp-service-scope true "slp-fs" ;
>        option option-150 10.104.1.7 ;
>        pool {
>                range 10.110.1.76 10.110.1.78;
>                failover peer "nsti1-nsti2";
>                deny dynamic bootp clients;
>        }
>        pool {
>                range 10.110.1.80 10.110.1.239;
>                failover peer "nsti1-nsti2";
>                deny dynamic bootp clients;
>        }
>        pool {
>                range 10.110.1.251 10.110.1.254;
>                failover peer "nsti1-nsti2";
>                deny dynamic bootp clients;
>        }
>        host null-0006296C03CB-10-110-1-16 {
>                hardware ethernet 00:06:29:6C:03:CB;
>                fixed-address 10.110.1.16;
>        }
>
>        host null-0003BA2448D1-10-110-1-17 {
>                hardware ethernet 00:03:BA:24:48:D1;
>                fixed-address 10.110.1.17;
>        }
> }
>
>
> --
> Gordon A. Lang




More information about the dhcp-users mailing list