DNS resolver problems when one nameserver is down

Tue Sep 30 20:19:00 UTC 2003

James Pearson wrote:

>I've recently had a major problem when one of my internal DNS servers
>went down and I'm trying to work out a way of improving the situation.
>
>I'm have a network of mainly RedHat 7.2 based machines that each have
>a /etc/resolv.conf like:
>
>domain my.domain
>nameserver 1.2.3.4
>nameserver 1.2.3.5
>options rotate
>
>The 2nd listed nameserver above crashed and _all_ my linux clients had
>problems resolving hostnames - which has a massive knock-on effect,
>grinding everything to a halt.
>
>I'm now trying to get a better understanding of how the resolver works
>and how I can improve matters if this happens again.
>
>According to the resolv.conf man page, the 'options rotate' should
>spread the load amongst the nameservers - but in my subsequent tests,
>this doesn't happen - all it does is force the resolver to use the 2nd
>nameserver first for _every_ lookup - so when the 2nd nameserver
>crashed, every lookup times out after 5 seconds before using the 1st
>nameserver. It appears that if I hadn't used the rotate option, I
>would have been OK when the 2nd nameserver went down (but not if the
>1st did!).
>
>Should the rotate option work with RH7.2 (glibc 2.2.4)?
>
>I can improve matters if I reduce the timeout to 1 second, but it
>appears the resolver code is not intelligent enough to realize that it
>keeps timing out on the same nameserver with subsequent lookups.
>
>I guess I could use something like nscd - but that again still uses
>the same nameserver for subsequent lookups of hostnames that are not
>cached.
>
>Is there something analogous to the NIS 'ypbind' for DNS lookups? i.e.
>something like nscd that instead of caching hostnames, caches the good
>nameserver to use?
>
>Sorry if this is in a FAQ somewhere, but as it has always appeared to
>work OK, I've never really had to think about this before ...
>
If the unavailability of your *second*-listed nameserver caused 
problems, I think it's reasonable to assume that "rotate" is working on 
your platform -- without "rotate", the second-listed nameserver would 
only be consulted if the first-listed nameserver wasn't answering queries.

Sounds like your root problem is that you don't have enough nameserver 
resources to handle a single nameserver failure, given the way your 
clients are configured. Possible solutions:

1) Add another nameserver
2) Beef up your existing nameservers
3) Reconfigure your clients (you implied that your clients were 
Unix/Linux) with their own nameserver instances in order to reap the 
benefits of local caching.

                        - Kevin