DNS resolver problems when one nameserver is down

James Pearson j.pearson at ge.ucl.ac.uk
Fri Sep 26 16:40:58 UTC 2003


I've recently had a major problem when one of my internal DNS servers
went down and I'm trying to work out a way of improving the situation.

I'm have a network of mainly RedHat 7.2 based machines that each have
a /etc/resolv.conf like:

domain my.domain
nameserver 1.2.3.4
nameserver 1.2.3.5
options rotate

The 2nd listed nameserver above crashed and _all_ my linux clients had
problems resolving hostnames - which has a massive knock-on effect,
grinding everything to a halt.

I'm now trying to get a better understanding of how the resolver works
and how I can improve matters if this happens again.

According to the resolv.conf man page, the 'options rotate' should
spread the load amongst the nameservers - but in my subsequent tests,
this doesn't happen - all it does is force the resolver to use the 2nd
nameserver first for _every_ lookup - so when the 2nd nameserver
crashed, every lookup times out after 5 seconds before using the 1st
nameserver. It appears that if I hadn't used the rotate option, I
would have been OK when the 2nd nameserver went down (but not if the
1st did!).

Should the rotate option work with RH7.2 (glibc 2.2.4)?

I can improve matters if I reduce the timeout to 1 second, but it
appears the resolver code is not intelligent enough to realize that it
keeps timing out on the same nameserver with subsequent lookups.

I guess I could use something like nscd - but that again still uses
the same nameserver for subsequent lookups of hostnames that are not
cached.

Is there something analogous to the NIS 'ypbind' for DNS lookups? i.e.
something like nscd that instead of caching hostnames, caches the good
nameserver to use?

Sorry if this is in a FAQ somewhere, but as it has always appeared to
work OK, I've never really had to think about this before ...

Thanks

James Pearson


More information about the bind-users mailing list