intermittent SERVFAIL for high visible domains such as *.google.com

Thu Jan 18 15:41:08 UTC 2018

Brian J. Murrell <brian at interlinx.bc.ca> wrote:
>
> In any case when this happens, it will last a few minutes until it
> resolves itself and/or I issue an "rndc reload".  That always seems to
> correct it if I don't care to wait it out.

Does the time to recovery correspond to the lame-ttl setting? The default
is 10 minutes - try reducing it and see if the outage becomes shorter.

When you have a failure, try `rndc flushtree` to more selectively drop
problematic state - you might have to find out the nameservers of the
broken domain and flush them. (The google.com nameservers are under
google.com; GitHub's are under dynect.net and a bunch of awsdns domains.)

> I have a db dump (rndc dumpdb) as well as some trace (rndc trace x10)
> while this is happening.  Is this enough?  If so, what should I look
> for as a cause of the SERVFAILs?

The trace might tell you where the SERVFAIL came from - you'll need to
read it carefully to work out how named handled the problem query. This
might or might not tell you the cause, depending on how clearly the gun is
smoking.

Look at the end of the dump - the address database, bad cache, and
servfail cache.

> Do I need tracing enabled before the situation happens?

That will make it a lot easier, yes :-)

> What level (how many "rndc trace"s should I run)?

You can specify a number directly, like `rndc trace 11` - level 11 is
handy because it includes query and response packet dumps (er, but that
is a 9.11 feature - in 9.9 you'll only get the response packets).

Tony.
-- 
f.anthony.n.finch  <dot at dotat.at>  http://dotat.at/  -  I xn--zr8h punycode
Trafalgar: North or northeast 5 to 7. High, becoming very rough. Fair. Good.