[Kea-users] Load-Balancing Network issue between Relay and Kea

Wed Jan 11 11:58:58 UTC 2023

On Tue, Jan 10, 2023, at 02:34, Marcin Siodelski wrote:
> For every client who sends a DHCPDISCOVER or DHCPREBIND to the partner 
> server and (finally) sets the "secs" field value greater than 
> "max-ack-delay", the other server bumps up its internal counter or 
> unacked clients. Again, it only does it when it has been unable to 
> communicate with the partner server over the control channel longer than 
> the configured "max-response-delay". A single successful heartbeat over 
> the control channel will clear the counters of the unacked clients and 
> make the server believe that the partner is healthy. It will also stop 
> looking at the "secs" and "Elapsed Time" values. The 
> "max-unacked-clients" no longer matters until the next communication 
> issue over the control channel.
>
> If the "max-unacked-clients" value is exceeded, the server can finally 
> transition to the partner-down state and handle both the traffic 
> directed to itself and the inoperational partner. Since the state 
> transitions are only carried after a heartbeat attempt, there may be a 
> slight delay between exceeding the "max-unacked-clients" value and 
> actually transitioning to the partner-down state.

This is *extremely* helpful and I want to express my sincere appreciation for you taking the time to compose and send it! It will probably be a good starting point for additional HA content in the ARM :-)

> I looked into our documentation and realized that although all of these 
> pieces are described there, it can be confusing because the ARM lacks a 
> sequence diagram or an example of how the failover process can look end 
> to end. That's something we should address.
>
> Going over the previous emails, I see that users can see different 
> failover strategies, depending on the types of failures they are likely 
> to experience in their setup. They are interesting cases, and we will 
> discuss them internally. We could consider some alternative failure 
> detection strategies, selectable with the HA configuration, but we 
> should be aware that there is no one-fits-all solution. There is always 
> a possibility that the true failure won't be detected or a false failure 
> will be detected, leading to a split-brain situation.

Totally understandable. For example, there could be an opt-in feature where the servers watch for unacked clients (increasing max-unacked-clients) at all times, not just when there has been a comms failure between the peers. If that feature existed it would address the situation for some of us in this thread, but shouldn't be used by anyone whose servers may choose not to respond to requests for other valid reasons (leading to a 'false failure' detection). Since the peers in an HA configuration should have identical configuration files, it's conceivable that a peer could decide to track ack failures only if *it* would have responded to the client request when its companion is partner-down, and then peer failures to respond would almost certainly be an indication of an actual failure.

> It would be useful if you could please open tickets in Gitlab to 
> describe your failover scenarios and the desired behavior. Please 
> disregard it if you have already opened them.

I'll do that now and post another reply with the link.