Odd failure - poisoning?

Tue Jan 10 23:27:06 UTC 2006

Karl,
I can't say I've ever run into a problem like that. However, in terms of 
troubleshooting, you could try turning on debugging, dump the cache, 
and/or issuing the new "recursing" command of rndc, which supposedly 
writes the state of all recursing queries to the file "named.recursing". 
Some combination of those 3, or all 3 of them, should give you a pretty 
good handle on what the nameserver is doing internally.

If it were me, though, I'd probably just fall back on my old habits of 
mimicking what the nameserver is doing by initiating queries from the 
nameserver itself all of the way down the delegation chain. It's 
tedious, and you need to be careful about 
recursion-versus-non-recursion, what interface(s) you issue your queries 
from (if that matters), etc. The +trace option to dig does more-or-less 
the same thing, although it doesn't check every nameserver at each level 
of the delegation tree. I suppose I could write a script...

- Kevin

Karl Auer wrote:

>Hi there.
>
>We just had a most peculiar failure at ETHZ in Switzerland. For about 24 hours, many (VERY many) external addresses were not being resolved, we were seeing SERVFAIL (and some timeouts). It wasn't ALL addresses, but over the course of the period, more and more addresses failed; to most users it felt like nothing was resolving, though we know it wasn't actually that complete a failure. Some big names, like Google, never failed to resolve.
>
>The thing is, there were no changes on the nameservers before or during the period, and no changes in our network (except see below). The whole thing resolved itself, leaving us with no clue as to why it happened or what fixed it. There was no significant load on our nameservers, no significant load on our network and people outside our network didn't see any problem (so it wasn't a global problem or even just a Swiss problem).
>
>Towards the end of the period, we moved the two nameservers from behind a firewall to in front of the firewall - no change, so we moved them back. The problem resolved itself in the 30 minutes to one hour after that. If it had suddenly come good, I'd suspect some firewall problem, but as it is I can't see that the firewall was involved.
>
>During the problem period, some names would not resolve up to twenty or thirty times - then would suddenly resolve. Some never did. Some, as noted, never failed to resolve.
>
>Internal addresses resolved correctly at all times.
>
>So we are completely stumped. I thought it might be cache poisoning (things going bad slowly and coming good "on their own" is usually cache-related), but the range of names involved seems too large and too random.
>
>The root hints are present and correct. I checked connectivity (and port 53 connectivity) to all of the root servers, all were present and correct. We checked fresh names, and many of them resolved, so it doesn't seem to have been a network problem as such. General network connectivity to the outside world was fine. We have received no reports of people being unable to resolve *us*, and  no reports of connectivity problems to us (and we would expect such reports from people like the supercomputer users and the thousands of students using our services from outside). There were no interesting log messages. Clients of every stripe were affected (WinXX, Linux, Mac etc.).
>
>Does anyone have any ideas as to what might have caused this behaviour? And what diagnostics we could have used during the period to give us some more clues?
>
>Regards, K.
>
>  
>