High recursive client counts

Tue Mar 25 15:52:01 UTC 2014

Hi Jason,

I've experienced similar things in the past on 9.8.  Since then we've
moved to the latest 9.9, but don't think this is at all version specific
(that said, you could obviously try upgrading).  I don't have an exact
solution for you, but some ideas of things to check and personal
experiences which might help you.

Are the servers in question VM or bare metal?  Several years back we made
a big push to virtualize everything, and after migrating recursive DNS it
worked great for awhile...as sites grew we hit a tipping point where
VM-based resolvers seemed to introduce additional query latency.  These
servers were running far below BIND's capabilities, not taxing virtual
resources, optimized per all available BIND/OS/virtualization knobs, and
using enterprise (read: not just the latest free bits slapped together and
expected to work) network, server and hypervisor tech.  I spent several
months trying to improve the situation and find a real root cause, but on
a whim I setup an identical cluster on bare metal...no more problems.  I
didn't have time to dig further, so we avoid virtualization on busy
resolvers (for now at least).

As your client count has grown...is there any bottlenecks on your network
that might be unaccounted for?  Beyond bandwidth I'm thinking of things
like resource constrained firewalls (are the resolvers in a DMZ?) which
could cause queries to be dropped/timed out/retried, etc?  I've seen
issues where overworked NetOps teams got behind in capacity
planning/upgrades and as clients/#DMZs grew firewalls couldn't keep up and
created all sorts of issues not related to BIND itself.

When the recursive client count backs up, you know more queries than usual
are taking longer than expected to get answers...if this is not related to
BIND itself, your servers, or the network...a bit of spelunking is in
order.  Capture some packets with tcpdump, and take a look at rndc
recursing output.  Take a look at the queries causing delays, dig them
manually from various locations, and try to find a common theme.  If there
is no common theme to the query destinations, then look even closer at
your network.  :-)

hth

-----Original Message-----
From: Jason Brandt <jbrandt at fsmail.bradley.edu>
Date: Tuesday, March 25, 2014 at 10:31 AM
To: "bind-users at lists.isc.org" <bind-users at lists.isc.org>
Subject: High recursive client counts

>We recently migrated to BIND for our internal resolvers, and since the
>migration, we are experiencing periods of high recursive client counts,
>which will at times cause the BIND server to quit responding.  As a
>workaround, I've been able to point
> the BIND server to a forwarder, bypassing the root hints, to restore
>stability, but this morning even with the forwarder, our count spiked.
>
>
>We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1.  The server is
>configured strictly as a resolver, and is not authoritative for any
>domains.
>
>
>We have approximately 15-20k client devices on campus.  Our average
>recursive client count is between 10 and 50.  When the spikes occur,
>counts will get upwards of 3-4k (this morning: recursive clients:
>2358/9900/10000). 
>
>
>What are possible causes of high recursive client count?  What can be
>done to prevent this or tune around it?  Obviously raising the max
>clients doesn't solve the problem, and the forwarder seemed to help, but
>apparently is still susceptible to
> the issue.  
>
>
>Any suggestions would be greatly appreciated.
>
>
>-- 
>Jason K. Brandt
>Systems Administrator
>
>
>
>