[Kea-users] HA survivor not taking over avec after communication interrupted with partner

Mon Oct 9 17:52:53 UTC 2023

Hello Alex,

Your configuration includes `max-unacked-clients` parameter set to 5. It 
means that when the communication between the two servers is interrupted 
the surviving server starts checking if the other server responds to the 
DHCP queries. It assumes that the other server doesn't respond  to the 
queries when the `secs` field value the clients set in the discover or 
rebind messages exceed the value of `max-ack-delay` (5 seconds in your 
case). It must find at least 5 distinct clients retransmitting with 
bumped `secs` field value before it may assume that the partner is dead 
and take over.

This process can take a while depending on the lease lifetime (rebind 
time), number of new or rebinding clients etc.

You can read more about the parameters controlling the failover process 
in the Kea ARM:

https://kea.readthedocs.io/en/kea-2.4.0/arm/hooks.html#load-balancing-configuration

In your failover scenario, please make sure that after the communication 
failure the clients properly set the `secs` field value upon 
retransmissions. If you want to bypass this mechanism, you can set the 
`max-unacked-clients` to 0.

Marcin Siodelski
Senior Software Engineer
ISC

On 9.10.2023 19:37, Alexandre Lessard wrote:
> Hello everyone,
> 
> I'm new here! I'm working for an ISP as a network administrator.
> Furthermore, I got about 7 years of experience doing all sort of IT
> stuff for this company. I've been using and configuring Kea DHCP for
> about 2 weeks now. Prior to that, I was using ISC DHCP, but since it's
> now deprecated, I'm preparing two new servers to migrate all customers
> on them.
> 
> The setup:
> The setup is DHCP relay with two Kea servers in HA hot-standby. There
> is three particularity that I want to mention right now.
> 
> First, because I couldn't find an out-of-the-box solution, I made a
> script that replicate the configuration through the API on both server
> when they are restarted. I don't think it interferes with the service
> as it is run prior to the service startup, but I don't want to
> overlook it either.
> 
> Second, they both have an IP configured on their loop back interface
> to be use kind of like an any cast address. That being said, I don't
> use them for the HA, it's only used by the Relay agents.
> 
> Third, they are Proxmox containers. I don't think it's problematic but
> tell me if I'm wrong, I will make VMs for them.
> 
> My problem:
> When I simulated an outage by stopping the server1, only 2 (test2 and
> test3) of the 4 subnets recover eventually. Even if they recover, it
> takes about 5 minutes. As much as I understand, it's supposed to be
> configured at 1 minute. The two other subnets never recovers.
> Why some subnets never recover?
> Why the 2 that recover take so long?
> 
> I observed that the state of server2 stays to "hot-standby" even if
> the remote communication is interrupted.
> 
> I have been working on fixing that for more than 10 hours now.
> Likewise, I really don't know what to look for anymore.
> 
> The config:
> The Control Agents have almost default configuration, except for the
> http-host that is set to the IP interface that receive the request
> (eth0).
> 
> The Dhcp6 server is disabled.
> 
> Has for the Dhcp4 config, it has been saved through the API, so it is
> massive! All default configs have been written in the config file. For
> this reason, I won't post it here if not required to avoid sending a
> wall of config. I've put it on a public repository of GitHub:
> https://github.com/AlexTargo/Kea-Dhcp
> 
> If I'm missing anything, let me know, and I'll share it as soon as possible.
> 
> I hope someone have good pointers for me.
> 
> Regards
> Alex