mclt Values, was 2 Instances of dhcp on Same Platform

Mon Feb 26 14:06:11 UTC 2007

>Date: Mon, 26 Feb 2007 06:35:41 -0600
>From: Martin McCormick <martin at dc.cis.okstate.edu>
>
>Glenn Satchell writes:
>> I'm curious as to why you have MCLT issues when restarting the servers.
>> 
>> I have run several large sites with failover, and we always used the
>> procedure to stop and then start the secondary. Wait for it to sync
>> (usually only a few seconds), then stop and start the primary. Both
>> servers continue to hand out leases with the normal lease times after
>> restarting this way.
>
>	Hmm.  I have a log file from a week day last Fall which
>summarizes what almost always happened when we did exactly what
>you describe.  I had an expect script that ran omshell to do an
>orderly shutdown and then ran a tail -1f on syslog until it saw
>the right message come across.  In this annotated example, landlord is the
>primary and slumlord is the secondary.  Our failover-specific
>directives were as follows:
>
>authoritative;
>ddns-update-style interim;
>
>failover peer "stw" {
>  primary; # declare this to be the primary server
>  address 10.1.2.3;
>  port 520;
>  peer address 10.1.2.5;
>  peer port 520;
>  max-response-delay 30;
>  max-unacked-updates 10;
>  load balance max seconds 3;
>  mclt 300;
>  split 128;
>}
>include "/usr/local/etc/dhcpd.conf.tested";
>
>
>	Here's a log.  We had started out with the recommended
>MCLT value and changed it to 300 seconds or 5 minutes after we
>didn't get back to normal until an hour had passed after adding
>some static bootp records:
>
>	We shutdown the secondary:
>
>Nov 28 09:14:59 landlord dhcpd: 
>failover peer stw: peer moves from normal to shutdown
>Nov 28 09:14:59 landlord dhcpd: failover peer stw: 
>I move from normal to partner-down
>Nov 28 09:20:00 landlord dhcpd: 
>failover peer stw: I move from partner-down to normal
>Nov 28 09:20:00 landlord dhcpd: 
>failover peer stw: peer moves from recover-done to normal
>
>	Okay.  We're half done.  Bounce landlord.
>
>Nov 28 09:20:16 landlord dhcpd: 
>failover peer stw: I move from normal to shutdown
>Nov 28 09:20:16 landlord dhcpd: 
>failover peer stw: peer moves from normal to partner-down
>Nov 28 09:25:17 landlord dhcpd: 
>failover peer stw: peer moves from partner-down to normal
>Nov 28 09:25:17 landlord dhcpd: failover peer stw: 
>I move from recover-done to normal
>
>	On rare occasions, the secondary came up to normal within
>a few seconds.  I think I saw once where the primary came
>immediately back up, but I never got out of one of those
>transactions without at least one MCLT wait.  Our sites are very
>busy as I am sure yours are.  On the day in question, we had 1,269,677
>lines of dhcpd messages.
>
>	On our network, the routers forward dhcp broadcasts to
>the servers and we do not appear to be having connectivity
>issues.
>
>	I am open for any suggestions as to possible causes and
>preventative measures to avoid the mclt issue on update because
>we really should keep the recommended mclt value of 30 minutes
>but even a 30-minute mclt is way too long for us.  We probably
>modify static dhcp/bootp data five to ten times a day on a random
>basis which would mean recover-wait states most of the time.:-(
>
>	The other solution I am looking at is to run a jail on
>both servers and put all the bootp data on one instance of dhcpd
>and all the dynamic operation on another.  Each box would run one
>of each for a total of 4 dhcp servers.  If everything we have can
>eventually talk to all 4, we've got it made.
>
>	Again, thanks for any thoughts.  When we had failover
>going, it was otherwise very good.  As I mentioned last week, the
>final blow was when we discovered that our present wireless
>network authentication devices could only talk to one dhcp server
>of the pair.
>
>Martin McCormick WB5AGZ  Stillwater, OK 
>Systems Engineer
>OSU Information Technology Department Network Operations Group
>
Thank you for posting an excellent description!

Here's an example when I do the shutdown. Note this is dhcpd 3.1.0a3,
so the message smight be slightly different, and it's my home network
with only two clients :) but the same theory applies.

Here I shutdown the process on the secondary, lager. This is basically a
kill -TERM.

Feb 27 00:35:50 drill dhcpd: [ID 702911 local7.info] peer Uniq14subnet: 
disconnected
Feb 27 00:35:50 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
I move from normal to communications-interrupted

Note the primary goes into communications-interrupted, not partner-down.
Restart the secondary.

Feb 27 00:36:20 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
peer moves from normal to normal
Feb 27 00:36:20 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
I move from communications-interrupted to normal

Here's the logs on the secondary for the same period. Nothing when it
exits, but the startup looks good.

Feb 27 00:36:20 lager dhcpd: failover peer Uniq14subnet: I move from normal to 
startup
Feb 27 00:36:20 lager dhcpd: failover peer Uniq14subnet: peer moves from normal 
to communications-interrupted
Feb 27 00:36:21 lager dhcpd: failover peer Uniq14subnet: I move from startup to 
normal
Feb 27 00:36:21 lager dhcpd: failover peer Uniq14subnet: peer moves from 
communications-interrupted to normal

Now do the primary, shutdown.

Feb 27 00:39:38 lager dhcpd: peer Uniq14subnet: disconnected
Feb 27 00:39:38 lager dhcpd: failover peer Uniq14subnet: I move from normal to 
communications-interrupted

Then start it up.

Feb 27 00:39:51 lager dhcpd: failover peer Uniq14subnet: peer moves from normal 
to normal
Feb 27 00:39:51 lager dhcpd: failover peer Uniq14subnet: I move from 
communications-interrupted to normal

Logs on the primary during this period:

Feb 27 00:39:51 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
I move from normal to startup
Feb 27 00:39:51 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
peer moves from normal to communications-interrupted
Feb 27 00:39:51 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
I move from startup to normal
Feb 27 00:39:51 drill dhcpd: [ID 702911 local7.info] failover peer Uniq14subnet: 
peer moves from communications-interrupted to normal

So, the question is, what effect does your clean shutdown to
partner-down have? The dhcpd.conf man page is not exactly clear on
this, but maybe the switch to partner-down triggers the wait until MCLT
passes? When it is communications-interrupted it does not wait because
the remaining server cannot hand out any leases for the other server's
address ranges. I think a change to the script so that it doesn't
switch the remaining server to partner-down might work better.

As for the multiple instances of dhcpd, in the dynamic pools you need a
way to deny the fixed-address clients. Otherwesie they could be given a
dynamic address. Unless you can come up with some sort of generic class
to do that, you are going to need to update both sets of dhcpd.conf
files with information about the static hosts, and this negates what
you're trying to do.

Alternatively what about using omapi to add/delete the fixed-address
host entries?  This avoids the server shutdown issue completely.

Another solution might be to not run dhcp failover, but to use
something like Linux-HA (www.linux-ha.org and many others) to switch a
single IP address between two servers. Configure dhcpd with the
server-identifier parameter set to this IP address. Then your duff
clients that will only support a single dhcp server may work.
Personally I don't think that provides the same benefit as using dhcp's
builtin failover, but it may be better for your situation.

Anyway, a few things to think about and discuss.

regards,
-glenn