DHCP peer failure and pool exhaustion...

Mon Sep 9 18:51:12 UTC 2013

See inline

SC> On 9 September 2013 18:50, Gregory Sloop <gregs at sloop.net> wrote:
>> So, does that sound right?

SC> Yes, your assumptions are correct, it sounds like the peer ran out of
SC> leases from it's side of the pool.

>> If so, is there any way to "automagically" put the peer in partner
>> down state if it's not able to be contacted in X amount of time?

SC> No, and that's generally a bad thing to do, that type of DR scenario
SC> should not be automated as it could be triggered accidentally by a
SC> number of situations.

Yes, I can see how that could happen - and if I understand things
correctly, it would be a bad thing(tm) if the peer that was
automagically put into "partner-down" suddenly came up.

BUT!

See below.

>> Or, perhaps I should simply say - I don't want the DHCP server to end
>> up without available addresses to lease, if the peer goes off-line and
>> I'm not able to do some manual process to intervene. That's my goal.
>>
>> With that goal in mind, what's the best way to accomplish it?

SC> There is no way on the DHCP side that would resolve your current
SC> situation. The easiest way is to fix the network side of things,
SC> increase the number of IPs you have in the pools by increasing the
SC> size of the networks or add additional networks to cope with your
SC> capacity.

I understand this IS the answer - but I don't like it, and while I'm
not complaining about your giving it, Steven, I think it's a *sucky*
answer. :)

It is what it is, but IMO, it needs changing.

SC> It also seems that you have a high churn rate of clients, without
SC> enough IPs if they all came online at once. Your leases are being
SC> recycled and not reused by the same client repeatedly so when you get
SC> a spike there is no existing expired lease that the client can grab a
SC> hold of so you end up depleting the remaining half pool quickly.

SC> You should calculate the pools to ensure that there is enough IPs for
SC> all the clients, plus some buffer space to allow for a host to fail. I
SC> would usually size the pool at 120-150% of the number of clients
SC> depending on the size of the subnet to start with.

Yeah, I understand this too. It's simply a failing of the network I'm
working this in, and it's something we can't just change easily.
[Though we are working to change it.] There is a lot of churn. Yeah,
it's ugly too.

However, I *still* think that the DHCP server fail-over/peer could
work in a way to handle this. I understand it probably requires some
structural changes and some heavy lifting, but I think it's generally
what people expect fail-over/peering to look like - not what we
currently have.

Yes, churn in this pool is very high - I don't like it and I'm trying
to resolve it. However, I don't think anything less than say 50-70%
free pool overhead will really fix/resolve the situation I fell into.

Longer leases would probably also help, so that, if the "downage" of
the peer is shorter than the lease, it would simply re-lease the
already existing IP lease under mclt.

But IMO, those are all *band-aids* for a fail-over protocol that has
some very significant problems. They get one *around* the problem, but
leave the basic failing in-place. [And if I were the benevolent
dictator-for-life, Hobbes and I would make some changes around here!
:) ]

[And while my tone about the protocol is "harsh" - please don't
construe that to be harsh about the "answer" or your help. I greatly
appreciate the help, but disagree with the selected "standard." :)  ]

---
So any other thoughts on how I can live with the standard more easily
until it gets fixed - if ever?

Thanks much!
Greg