Tuning for lots of SERVFAIL responses

Thu Feb 18 20:29:24 UTC 2016

Thanks for the reply, Tony. With the recent glibc bug, I figured most
folks would be off putting out those fires!

On Thu, Feb 18, 2016 at 3:04 PM, Tony Finch <dot at dotat.at> wrote:
> John Miller <johnmill at brandeis.edu> wrote:
>
>> A couple of weeks ago, we experienced an outage on our external
>> Internet links.  Ideally, this shouldn't affect queries for internal
>> resources - we expect those queries to continue to be answered.
>
> We've had a few connectivity losses over the last year due to floods and
> DDoS attacks, so I have more experience with this than I would like.
>
> It's tricky. There are a surprising number of external dependencies on
> supposedly internal resources. For instance we have a web single-sign-on
> service which deliberately avoids using the Typekit font specified by our
> web designers, but it's still "slow" when we lose external connectivity
> because (I think) of attempts at TLS OCSP lookups :-(

We've run into similar issues in the past: people were hitting a
captive portal that didn't allow access to the CAs for OCSP
verification.  We essentially had to poke holes in our captive portal
to make sure all of that traffic got through.  The captive portal
thing isn't so much an issue any longer, but I'd bet you some of our
service outages were due to OCSP lookups failing.  This is a case
where I really wish more things would use OCSP stapling - there's no
reason not to for internal TLS-protected resources.

>
>> It's my understanding that by default, BIND limits the number of
>> concurrent recursive queries to 1000, so obviously during these
>> situations, we need to raise our client limit (recursive-clients) to
>> deal with this.
>
> Our recursive servers are built using BIND 9.10's ./configure
> --with-tuning=large option, and I have bumped up the max-clients option to
> 12345 (a number that I guessed but which turned out to be about right). We
> normally deal with about 1500-2000 qps on each server; during outages I
> observed this increased by a factor of 3 or 4. However the number of
> active clients went up to nearly 10,000 (it's normally negligible). The
> other reason 12345 is about right is that the default socket limit is
> 20,000 and each client seems to need two sockets.
>

We're not quite there with regard to traffic volume: we're somewhere
around 150 qps on each server (maybe 5-600 qps campus-wide), but as
happened to you, we saw the same 3-4x spike in volume.  Likewise, we
went from roughly 20 active clients per server (going off of UDP
socket stats from sar) to over 1000.  The servers themselves were
quietly twiddling their thumbs at 0.1 load: strictly a case of the
application doing the throttling.

>> What I'm curious about is how BIND behaves when it can't finish
>> iterative queries: when someone queries for yahoo.com, and the root
>> (or .com, yahoo.com) nameservers aren't reachable, does BIND then
>> issue a SERVFAIL response (assuming yes)?
>> How long will BIND wait before returning SERVFAIL?
>> At what point does BIND assume a domain is down altogether?  What's
>> the behavior then?
>
> Good questions :-)
>
> Tony.