Catastrophic failure and recovery

Mon Jun 25 22:15:31 UTC 2018

Ok, I do see somewhat similar lease information in each lease file [from each peer.]
[I thought each peer essentially only kept track of it's own leases, with some communication/coordination data.]

Yet the details in the lease records doesn't match exactly.

So, here's an example:

[This lease example is from the peer who has not issued the lease.]
starts 1 2018/06/25 17:13:25;
ends 1 2018/06/25 21:13:25;
tstp 6 2018/06/23 04:26:08;
tsfp 1 2018/06/25 23:13:25;
atsfp 1 2018/06/25 23:13:25;
cltt 3 2018/06/06 00:02:22;
binding state active;
next binding state expired;

[This lease example is from the peer who *has* issued the lease.]
starts 1 2018/06/25 17:13:25;
ends 1 2018/06/25 21:13:25;
tstp 1 2018/06/25 23:13:25;
tsfp 1 2018/06/25 23:13:25;
atsfp 1 2018/06/25 23:13:25;
cltt 1 2018/06/25 17:13:25;
binding state active;
next binding state expired;

So, lets assume the peer that did issue the lease dies and we setup a "new" peer with only the configuration.
The fail-over peer who didn't issue the lease will gather enough data, from simply communicating with the still active peer, that it will know that it is the peer responsible for this lease, and the prior lease data and simply come back up and rebuild the lease file properly. Correct?

[That seems reasonable, given what I see in the leases file - but just wanting to be sure I've not assumed something incorrectly from the response.]

A few follow-up questions:
--Why is: tstp 6 2018/06/23 04:26:08; in one vs tstp 1 2018/06/25 23:13:25; in the other?
[The docs I see say that this "indicates what time the peer has been told the lease expires." But this would seem to indicate that the two peers think the lease expires at different times.]

--I can't find any documentation to describe what the numbers after; starts, ends, tstp, tsfp, atsfp, etc. mean. [1, 6, 3, etc] 

Thanks again!

pl> The way you describe is how it would work if you didn't have
pl> failover setup at all.  With failover setup, the "new" server,
pl> when it connects to the existing, will get a list of all the
pl> current leases and such.  It will then enter the "recover" period
pl> where it won't hand any leases out.  "Recover" is the length of
pl> MCLT (from the failover configuration).  Once that period is
pl> passed, both servers will operate as normal.

pl> ----- Original Message -----
>> From: "Gregory Sloop" <gregs at sloop.net>
>> To: "Users of ISC DHCP" <dhcp-users at lists.isc.org>
>> Sent: Monday, June 25, 2018 1:29:59 PM
>> Subject: Catastrophic failure and recovery

>> Catastrophic failure and recovery So, in the case I'm interested in here, I've
>> got a pair of peers [failover].
>> [ISC/We really should pick a different name than failover, because it's
>> essentially load-balancing with redundancy, but I digress :) ]

>> Now while I'm using two peers, I think the question I'm asking about will be the
>> same regardless of peers or a single server...

>> So, lets assume the DHCP server [or a peer] dies. Assume we lost a disk.
>> Assume I've got configs, but no leases file.

>> What's the best recovery method?

>> ---
>> I assume we'll simply put the configurations back on a "new" server. [or peer]
>> Turn it on and bring it up. [In the peer setup, let it communicate with the
>> other peer.]

>> Since it won't have a record of any leases [that the dead-peer/old-server
>> actually leased] we'll have a bit of a mess.
>> But, we'd hope that most machines would already have a lease, and would ask for
>> renewal of that lease.
>> The server, I think, would generally grant that lease renewal on the same IP.
>> [Even though it has no record of it initially.]

>> "New" machines just powered up, may/will ask for new addresses, and may "steal"
>> a lease from an active client. ...BUT...
>> However, if the DHCP server can [and is set to use ping-check] AND the station
>> isn't firewalled or otherwise prevented from receiving/responding to the
>> ping-check, then the DHCP server will realize there's an active client using
>> the address and will avoid leasing that address.

>> If the active lease is on a machine that's off and returns to the network
>> [before the end of the lease] I'm not sure of the result. I *think* it will
>> attempt to confirm the lease when it comes back on, will get a NAK and be
>> forced to get a new lease.

>> Thus, generally, using best practices, the result of a catastrophic loss of a
>> DHCP server shouldn't be too disruptive.
>> [Provided it can be replaced fairly quickly before too many machines lose their
>> current lease.]
>> [ mailto:gregs at sloop.net ]
>> The above setup will be a lot cleaner if there's not much/any IP address churn -
>> in that, for a particular pool, there's enough addresses to give every machine
>> an address simultaneously. If there's a lot of churn it will be substantially
>> more messy, but machines will see far less stability in IP address assignment
>> [But there wasn't a lot of stability to start with, so we've probably only
>> increased the churn rate some.]

>> Does that sound about right?
>> I'm sure there's use cases I'm not considering because I don't have those
>> configurations - but am I missing anything serious?

>> ---
>> On a side note - is it worth capturing [backing up] the leases file, say at a
>> rate of 0.5 times the lease length? [The idea would be to have a reasonably
>> current leases file that might be 80%+ right. Or is this likely to cause more
>> problems than no leases file at all.]

>> Pointers to FAQ/Docs etc gladly accepted!

>> TIA
>> -Greg.
>> _______________________________________________
>> dhcp-users mailing list
>> dhcp-users at lists.isc.org
>> https://lists.isc.org/mailman/listinfo/dhcp-users

-- 
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gregs at sloop.net
http://www.sloop.net
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20180625/58fb81bf/attachment-0001.html>