[Kea-users] Help diagnosing (and potentially addressing) a possible performance problem?

Tue Oct 10 10:33:31 UTC 2017

Hello Rasmus,

After about a week or so of time for analysis, it turns out that it was a
couple of factors working in concert, for the most part:

1. lease times were too short (1H), resulting in request storms as entire
racks leases would expire roughly simultaneously, swamping the server with
requests; I changed the default lease time to 12H, applied some monitoring
to keep track, and lease counts recovered within an hour or so and
stabilized

2. some racks timed out when requesting due to distances between the client
and the DHCP server due to either simple network distance or packets lost
due to asymmetric routing; one of the affected areas is in another segment
entirely with different routing and eventually will be firewalled off, so
the fastest way to resolve the issue was simply spin up another DHCP server
and point the segment's switches IP helpers to it, instead of the original
DHCP server. As they're not sharing the same lease/reservation tables, they
can get along using the same database without causing conflicts (I think
this scenario was actually explicitly tested by the Kea dev team, and found
to be stable)

3. I think the ALLOC_FAIL messages were actually red herring false
positives from a rack or racks(s) that haven't yet been assigned scopes, so
no leases are available to be granted yet

Thank you for the feedback -- I was able to work through these issues using
insights from your comments, a bit of "rubber duck" debugging with one of
our network engineers, and some instrumentation help generated with
InfluxDB. :-)

cheers,
Klaus

On Thu, Oct 5, 2017 at 2:02 AM, Rasmus Edgar <regj at arch-ed.dk> wrote:

> Hi Klaus,
>
> I have seen something very similar on vmware with another application
> receiving a lot udp traffic and unfortunately we never found a solution for
> it and switched to bare metal as a workaround, which has irked me ever
> since and I'm interested in finding a root causes for these kinds of
> problems.
>
> As far as I understand, and according to the netstat man page, Recv-Q is
> the count of bytes not yet copied by the user program connected to the
> socket.
>
> Do you have special rules, execute something or do dns lookups when
> handling dhcp requests?
>
> Have you read the comments on ALLOC_ENGINE_V4_ALLOC_FAIL?
>
> "% ALLOC_ENGINE_V4_ALLOC_FAIL %1: failed to allocate an IPv4 address after
> %2 attempt(s)
> The DHCP allocation engine gave up trying to allocate an IPv4 address
> after the specified number of attempts.  This probably means that the
> address pool from which the allocation is being attempted is either
> empty, or very nearly empty.  As a result, the client will have been
> refused a lease. The first argument includes the client identification
> information.
>
> This message may indicate that your address pool is too small for the
> number of clients you are trying to service and should be expanded.
> Alternatively, if the you know that the number of concurrently active
> clients is less than the addresses you have available, you may want to
> consider reducing the lease lifetime.  In this way, addresses allocated
> to clients that are no longer active on the network will become available
> sooner."
>
> Br,
>
> Rasmus
>
> Klaus Steden skrev den 2017-10-05 03:03:
>
>
> Hi everyone,
>
> We've been using Kea successfully for several months now as a key part of
> our provisioning process. However, it seems like the server we're running
> it on (a VM running under XenServer 6.5) isn't beefy enough, but I'm not
> 100% confident in that diagnosis.
>
> There are currently ~200 unique subnets defined, about 2/3rd of which are
> use to provide a single lease during provisioning, at which point the host
> in question assigns itself a static IP. There are 77 subnets that are
> actively in use (for IPMI), with the following lease attributes:
>
>   "valid-lifetime": 4000,
>   "renew-timer": 1000,
>   "rebind-timer": 2000,
>
> From what I'm seeing in the output of tcpdump, there are a LOT more
> requests coming in than replies going out, and *netstat* seems to confirm
> that:
>
> # netstat -us
> ...
> Udp:
>     71774 packets received
>     100 packets to unknown port received.
>     565 packet receive errors
>     4911 packets sent
>
> If I monitor *netstat* continuously, I see spikes on the RecvQ for Kea
> that fluctuate wildly, anywhere between 0 and nearly 500K (and sometimes
> higher) moment to moment.
>
> The log also reports a lot of ALLOC_ENGINE_V4_ALLOC_FAIL errors after
> typically 53 attempts (not sure why 53, but that number seems to be the
> typical upper limit before failure is confirmed).
>
> I've been experimenting over the last hour or so with tuning various
> kernel parameters (net.ip4.udp_mem, net.core.rmem_default,
> net.core.netdev_max_backlog, etc.) but those don't appear to make any kind
> of difference, and the RecvQ remains high.
>
> Is there any way I can either tune the daemon to handle this kind of
> backlog, or a list of which kernel tuneables I should be looking at
> modifying? Is there a more clear way to determine if I've got a genuine
> performance limitation that we're just now running into?
>
> I've got a bare metal machine temporarily helping carry the burden and it
> doesn't have these issues, but then again, it's not carrying the full load;
> I'm loath to dedicate a whole physical server just to DHCP, but if the load
> is going to remain high like this, maybe that's just what I have to do.
>
> thanks,
> Klaus
>
> _______________________________________________
> Kea-users mailing list
> Kea-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/kea-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/kea-users/attachments/20171010/194929cb/attachment.htm>