DNS resolver problems when one nameserver is down

Wed Oct 8 22:02:49 UTC 2003

Kevin Darcy <kcd at daimlerchrysler.com> wrote in message news:<blt8bf$i8i$1 at sf1.isc.org>...
> James Pearson wrote:
> 
> >Kevin Darcy <kcd at daimlerchrysler.com> wrote in message news:<blcont$2h3e$1 at sf1.isc.org>...
> >  
> >
> >>James Pearson wrote:
> >>
> >>    
> >>
> >>>I've recently had a major problem when one of my internal DNS servers
> >>>went down and I'm trying to work out a way of improving the situation.
> >>>
> >>>I'm have a network of mainly RedHat 7.2 based machines that each have
> >>>a /etc/resolv.conf like:
> >>>
> >>>domain my.domain
> >>>nameserver 1.2.3.4
> >>>nameserver 1.2.3.5
> >>>options rotate
> >>>
> >>>The 2nd listed nameserver above crashed and _all_ my linux clients had
> >>>problems resolving hostnames - which has a massive knock-on effect,
> >>>grinding everything to a halt.
> >>>
> >>>I'm now trying to get a better understanding of how the resolver works
> >>>and how I can improve matters if this happens again.
> >>>
> >>>According to the resolv.conf man page, the 'options rotate' should
> >>>spread the load amongst the nameservers - but in my subsequent tests,
> >>>this doesn't happen - all it does is force the resolver to use the 2nd
> >>>nameserver first for _every_ lookup - so when the 2nd nameserver
> >>>crashed, every lookup times out after 5 seconds before using the 1st
> >>>nameserver. It appears that if I hadn't used the rotate option, I
> >>>would have been OK when the 2nd nameserver went down (but not if the
> >>>1st did!).
> >>>
> >>>Should the rotate option work with RH7.2 (glibc 2.2.4)?
> >>>
> >>>I can improve matters if I reduce the timeout to 1 second, but it
> >>>appears the resolver code is not intelligent enough to realize that it
> >>>keeps timing out on the same nameserver with subsequent lookups.
> >>>
> >>>I guess I could use something like nscd - but that again still uses
> >>>the same nameserver for subsequent lookups of hostnames that are not
> >>>cached.
> >>>
> >>>Is there something analogous to the NIS 'ypbind' for DNS lookups? i.e.
> >>>something like nscd that instead of caching hostnames, caches the good
> >>>nameserver to use?
> >>>
> >>>Sorry if this is in a FAQ somewhere, but as it has always appeared to
> >>>work OK, I've never really had to think about this before ...
> >>>
> >>>      
> >>>
> >>If the unavailability of your *second*-listed nameserver caused 
> >>problems, I think it's reasonable to assume that "rotate" is working on 
> >>your platform -- without "rotate", the second-listed nameserver would 
> >>only be consulted if the first-listed nameserver wasn't answering queries.
> >>    
> >>
> >
> >But the rotate option _always_ (as far as I can tell) uses the second
> >nameserver. From what the man page states, this option "causes round
> >robin selection of nameservers from among those listed". If I give
> >'options rotate' the first nameserver is never used. - which, as I
> >read it, is not what the man page says.
> >
> Right, and it doesn't make any sense either. Why would "rotate" just 
> alter the nameserver list in some deterministic way? That would be no 
> different than just having the list in a different order. Sounds like 
> you have a bad implementation of a resolver library, or you're 
> misinterpreting the "rotate" results (for the latter, have you tried 
> comparing query logs between the two nameservers?)
>

The only test I did was to add a few printf's to the resolver source
to print out what it was doing - it _always_ tried the second
nameserver first with the rotate option.

> > 
> >  
> >
> >>Sounds like your root problem is that you don't have enough nameserver 
> >>resources to handle a single nameserver failure, given the way your 
> >>clients are configured. Possible solutions:
> >>
> >>1) Add another nameserver
> >>2) Beef up your existing nameservers
> >>3) Reconfigure your clients (you implied that your clients were 
> >>Unix/Linux) with their own nameserver instances in order to reap the 
> >>benefits of local caching.
> >>    
> >>
> >
> >If I don't use 'options' rotate, then lookups will use the first
> >listed nameserver, if that fails, it uses the second listed, if that
> >fails, the third...
> >
> >The problem is that if the first nameserver goes away, then _all_
> >lookups will time out after 5 seconds before trying the second
> >nameserver. It doesn't matter how many other nameservers you have.
> >
> >This five second timeout is what crippled us.
> >
> >I have thought of using a local caching namesever - but I don't know
> >if that will suffer from the same time out problem - OK it will have
> >known hosts cached - but will the local nameserver have the same
> >timeout problem for unknown hosts if the first used upstream
> >nameserver has gone away?
> >
> I'm assuming by your use of the word "upstream" that you would configure 
> your boxes as forwarders, presumably because your "main" nameservers are 
> the only ones which are allowed to query the Internet for Internet 
> names. If that assumption is correct, then your question is about 
> forwarder failover, and the answer is: it depends on the version of BIND 
> you're running. In some versions, forwarders are selected according to 
> accumulated RTT (round trip-time) statistics, similar to the way caching 
> resolvers choose among NS records. In some other version of BIND, the 
> forwarders are always tried sequentially, which can result in resolution 
> delays if the 1st through nth forwarders are unavailable.
> 
> If, on the other hand, your individual client boxes have the ability to 
> query Internet names directly, then one option to consider is to 
> configure them all as *pure* caching resolvers (with nothing but a hints 
> zone and a master zone for the loopback address). NS failover is 
> *always* based on RTT, so recovery is fairly rapid. However, depending 
> on your routing/security paradigm, this configuration may not be an 
> option for you, and, even if it is, it may be arguably viewed as 
> anti-social, inasmuch as you'll have a whole bunch of boxes 
> independently deluging Internet nameservers with probably the same 
> queries over and over.

I'm not (that) worried about resolving outside address quickly - it's
internal addresses that I need quickly. Thanks for your comments -
I'll do a bit more digging.

> >My current work round is to hack the libresolv source to re-order the
> >list of nameservers to make sure the last successfully used nameserver
> >is used first on subsequent lookups. This appears to work OK.
> >
> Hacked libresolv? I'm glad it works for you, but I'd consider that a 
> last-resort workaround...

There is another very simple workaround - which I sort of mentioned in
my original post - I could just use NIS to do hostname lookups (and
use ypbind to worry about missing ypservers etc). I already generate a
hosts NIS map (for another reason) from the DNS files, so all I need
to do is swap the lookup order in nsswitch.conf ... however I would
prefer to use DNS to do this ...

> >
> >If I use this modified libresolv with nscd, then it appears I can
> >quite happily resolve hostnames etc. without problems - nscd gets an
> >initial 5 second time out on its first lookup, but all subsequent
> >lookups use the next working nameserver first.
> >
> >However, I'm not sure if doing this is likely to cause other
> >problems...
> >  
> >
> Frankly, I don't know either. But it wouldn't surprise me...
> 
>                                                                          
>                         - Kevin

Thanks

James Pearson