recover-wait period question

Wed Dec 13 14:43:43 UTC 2006

The example I gave was an actual real world occurrence.  The secondary 
server was rebooted for maintenance, it failed to boot.  It was going to 
be several hours before onsite personnel would be able to investigate.  
Hence, we set the primary to partner-down status by modifying the lease 
file (we haven't tackled OMAPI yet). 
Later that day, onsite personnel managed to boot the secondary server.  
At this point, the secondary server entered recover, and then quickly 
entered recover-wait mode.  Unfortunately, we were not able to set the 
primary back to communications-interrupted before onsite personnel 
booted the secondary.  The primary entered a mode I've never seen 
before:  potential-conflict and then subsequently flipped to another 
mode I haven't seen before:  shutdown (or something similar to that).   
It then would hand out no addresses, and the secondary wouldn't hand any 
out either.  I was able to get the primary to hand out addresses, but I 
forget how.

Later, the primary started saying "peer holds all free leases" even 
though there were plenty of free addresses left.  The secondary was 
still in the recover-wait period.  This was service affecting, so I 
stopped the secondary, deleted the leases file, set MCLT to 5 and 
started the server.  This repaired the problem, although I'm sure it was 
a very bad idea. :) 

I have since set MCLT back to the same number as our min/default/max 
lease definitions (28800 or 8 hours on that particular server group).  
Perhaps the trouble would have been minimized had MCLT been set to 600 
or something, although I suspect there would still have been these 
problems, they just would have resolved themselves perhaps without us 
noticing.

It seems that the server that is NOT in recover-wait should be able to 
hand out the entire pool and merely notify the server that IS in 
recover-wait that it has done so.  Is that not the case?

David W. Hankins wrote:
> On Tue, Dec 12, 2006 at 11:14:51AM -0500, Darren wrote:
>   
>> however, respond to inform.  The primary will not hand out addresses 
>> that should be handed out by the secondary while the secondary is in 
>> recover-wait mode.  This means it is possible to run out of addresses 
>>     
>
> If the secondary is in recover-wait, hopefully your primary is in
> partner-down state (in the events you describe, anything else is
> either a bug or you're missing some events).
>
> In which case it will respond to all clients, and hand out free or
> backup leases alike (so long as STOS+MCLT has expired).
>
> So it's possible to run out of addresses, but only if all addresses
> are actively assigned, or if you run out of free addresses before
> STOS+MCLT expires (which may be before the secondary entered recover
> state, or may be approximately the same time).
>
> STOS is "Start Time Of Service" by the way.
>
>   
>> What is the purpose of this recover-wait period?
>>     
>
> There's a danger of duplicate allocation of any leases the secondary,
> in your example, handed out just prior to going down.  It's possible
> that neither the primary nor secondary would retain any information
> about these allocations.
>
> Since the surviving server (your primary) is the only one that has
> all the relevant state to avoid these duplicate allocations
> (STOS+MCLT), the secondary stays out of it.  Hence, recover-wait.
>
>