intermittent SERVFAIL for high visible domains such as *.google.com

Tony Finch dot at dotat.at
Thu Jan 18 17:46:29 UTC 2018


Brian J. Murrell <brian at interlinx.bc.ca> wrote:
> On Thu, 2018-01-18 at 15:41 +0000, Tony Finch wrote:
> >
> > The default is 10 minutes - try reducing it and see if the outage
> > becomes shorter.
>
> If it does, what is that telling me?

My hypothesis here is that `named` has marked all the nameservers for the
domain that is failing as lame, so it no longer has anywhere to send
queries for the domain, so it returns a SERVFAIL.

The address database dump confirms this guess, so there isn't any need to
fiddle with the lame-ttl unless you want to double check.

> > When you have a failure, try `rndc flushtree` to more selectively drop
> > problematic state - you might have to find out the nameservers of the
> > broken domain and flush them. (The google.com nameservers are under
> > google.com; GitHub's are under dynect.net and a bunch of awsdns
> > domains.)
>
> rndc flushtree takes a domain name though doesn't it?  In what case
> would I need to find nameservers?

The idea is to flush the state needed to resolve queries for the domain,
so as well as flushing the domain itself, you also need to flush its
nameservers - easy for Google, harder for GitHub.

> So, when I do rndc reload am I flushing the cache?  :-(

No, a reload will (in almost all cases) retain the cache - though it might
clear other state (I have not checked exactly what). I'm a bit surprised
it fixes your problem; maybe the address database gets flushed on a reload.

> ; Address database dump
> ...
> ; ns3.google.com [v4 TTL 7] [v6 TTL 7] [v4 failure] [v6 failure]
> ; ns2.google.com [v4 TTL 7] [v6 TTL 7] [v4 failure] [v6 failure]
> ; ns1.google.com [v4 TTL 7] [v6 TTL 7] [v4 failure] [v6 failure]
> ; ns4.google.com [v4 TTL 7] [v6 TTL 7] [v4 failure] [v6 failure]

OK, here's a very smoky gun.

I think this suggests that you have some kind of connectivity problem
between your DNS server and Google's (etc) - you should check that large
fragmented EDNS responses get through OK, and that TCP works OK, and that
you don't have pMTUd problems.

> > and servfail cache.
>
> Non-existent section in my database dump.

Ah, the servfail cache is another 9.11 feature.

Tony.
-- 
f.anthony.n.finch  <dot at dotat.at>  http://dotat.at/  -  I xn--zr8h punycode
Hebrides, Bailey, Fair Isle, Faeroes: Cyclonic, mainly west, 5 to 7. Rough or
very rough, occasionally moderate in Fair Isle and Faeroes. Squally wintry
showers. Good occasionally poor.


More information about the bind-users mailing list