SV: Watching performance on a DHCP Server

Sat Apr 26 22:25:37 UTC 2008

Hi
The biggest problems with intensive DHCP-storms due to network outages longer then leasetime is that if the server, or cluster of servers is not quick enough to provide answers is that the request made by the clients times out before the client receives a answer from the server. This causes the server to only answer requests that are "old" and no clients receives there addresses.
The solution to implement when stuck in this situation is to block requests in the routers for large part of the network, and then bit by bit opening up everything again.
On the company which I work we have an inhouse built DHCP-server which is quite powerful, we have ~550000 leased IP's in the system. Althoug when we have had long outages we have been forced to used the solution described above.

You should monitor the udp-in-queue on the server / servers, checking that the server manages to answer clients quickly enough.

Regards Anders R 

-----Ursprungligt meddelande-----
Från: dhcp-users-bounce at isc.org [mailto:dhcp-users-bounce at isc.org] För Frank Bulk - iNAME
Skickat: den 26 april 2008 04:05
Till: dhcp-users at isc.org
Ämne: RE: Watching performance on a DHCP Server

It's not clear to me how 100% lease expiration, even on a ten-thousand large
network, could lease to hours or days of downtime, even if it was a MSFT
DHCP server.  Even if the DHCP server could offer just ten leases per
second, in a 50,000 client network they would all be served in less than 90
minutes.  Is it possible that there were other mitigating factors that were
the primary cause for delay?  

Frank

From: dhcp-users-bounce at isc.org [mailto:dhcp-users-bounce at isc.org] On Behalf
Of Blake Hudson
Sent: Tuesday, February 12, 2008 4:16 PM
To: dhcp-users at isc.org
Subject: Re: Watching performance on a DHCP Server

<snip>

I thought they were unlikely too, but I am choosing to plan for it since
I've seen two ISP's in the last month who have let 100% of their lease
expire. Their user bases are not 100k, but their servers/infrastructure
couldn't cope with the requests and it resulted in hours or days of
unnecessary downtime for their users. In these instances the problems were
likely 'administrator error', with either poor planning or configuration.

I'm glad that we have never experienced an issue of this magnitude, and
while administrator error can be mostly be reduced through redundancy,
planning, etc, there are some things that lie outside of our control. If
something should happen, I'd like to be prepared.

I'm preparing by testing our equipment and configuration in order to
confidently state what the server's limits are and to be able to provide
information which supports my claim. I felt the performance of DHCPD 3.x was
too slow. By removing the primary bottleneck in DHCPD (high numbers of
fsyncs) I have been able to increase the server's capacity ~3x on 4 way
transactions, and ~100x on 2 way transactions while putting less load on a
server that provides other needed services. I feel confident that if there
were a large scale outage, our DHCP server would not be overwhelmed, DHCP
convergence would not be a limiting factor, and customer downtime will have
been minimized by these efforts.

-Blake

-- 
This message has been scanned for viruses and
dangerous content by MailScanner on mars.rosendal.nu,
and is believed to be clean.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner on mars.rosendal.nu,
and is believed to be clean.