Root zone timeout and workarounds?

Wed Feb 21 04:30:55 UTC 2001

At 11:02 PM 2/20/2001 -0500, you wrote:

>Okay, so you're talking about other nameservers on the Internet, not stub
>resolvers, timing out trying to resolve names in your domain when 4 out of 
>your 5
>registered nameservers are unavailable.

Right.

>I suppose this isn't that surprising; 4
>out of 5 is a serious outage. Of course, if the remote nameservers are running
>BIND or something like it, they should quickly adapt to the outage. Have 
>you tried
>to *successive* queries to those remote nameservers? Do they eventually stop
>timing out?

Yes, but after way too many attempts ... I almost gave up trying myself, 
before I realized, "oh, it worked that time .. "

>If this temporary effect is unacceptable, then you may be able to increase 
>your
>availability by, paradoxically, reducing the number of registered 
>nameservers. If,
>for example, you reduced down to 3 nameservers in 2 different locations, 
>then if
>the larger location goes down -- thus making 2 of the nameservers 
>unavailable --
>convergence should be faster with 1/3 of your nameservers available than 
>with only
>1/5.

Tacky, I know .. but one of the reasons we have 4 nameservers on-site, and 
one off-site, is due to the fact that we want a majority of the requests to 
come to our network. Namely, because there will be a lot of them, and we 
don't want to soak the remote link's bandwidth with dns requests.

>Ultimately, of course, your best availability would be achieved by having
>*every* registered nameserver be in a different location and/or on a different
>network link. But that can be difficult to achieve economically and 
>logistically.

Exactly, I wish we could pull it off economically .. but this project just 
doesn't merit it unfortunately.

Thanks again for your time on this ... any ideas where I should head from 
here? Or any better way to weight requests with the root servers, so I can 
have less NSs listed?

>- Kevin
>
>denon wrote:
>
> > At 09:00 PM 2/19/2001 -0500, you wrote:
> >
> > >When you say the "resolvers" are timing out, do you mean caching 
> nameservers
> > >doing recursive lookups, or do you mean stub resolvers?
> >
> > Excuse my lack of terminology .. but here's what's happening, hopefully I'm
> > answering your question:
> >
> > say I have foo.com registered with NSI. I've also registered hosts ns, ns2,
> > ns3, ns4, ns5.foo.com.
> >
> > They're listed on foo.com, at NSI, in that order. NS5 being the off-site,
> > ns1-4 being the ones on our network.
> >
> > When I take ns1-4 down, I pick a random remote nameserver (say,
> > ns.yahoo.com), one that I know doesn't have it cached/etc. Then I try to
> > resolve SomeRandomArecord.foo.com off it. These resolves are what are
> > timing out. It doesn't matter what remote NS I pick, I have similar results
> > .. occasionally it'll resolve, usually it times out ..
> >
> > Am I making sense? I hope so ..
> >
> > >Perhaps you should consider putting
> > >the remote server second or third in the list to reduce the possibility of
> > >timeout.
> >
> > You're probably right, I guess I was under the impression that the root
> > servers picked the nameservers at random (random, weighted by uptime past
> > success, I guess).
> >
> > >In some versions of BIND 8 there was a "rotate" resolver option which
> > >would cause the stub resolver to rotate the nameserver list for each
> > >query. But
> > >that option appears to be gone as of BIND 9, so I wouldn't rely on it.
> >
> > Is this an issue with the root servers? Surely they're not running generic
> > bind8 .. :)
> >
> > Thanks for your ideas Kevin. I hope I've clarified things a little.
> >
> > >denon wrote:
> > >
> > > > I've been digging through the archives, usenet as well as a variety of
> > > > other tech docs in search of the answer for my question.  I haven't 
> come up
> > > > with any results, but if this is a "frequently asked question", please
> > > > don't be afraid to throw me to a url.
> > > >
> > > > Here's the situation we've got:  I have a situation, where I've got the
> > > > need for a relatively highly redundant dns system (who doesn't? :). 
> On an
> > > > Internet domain, as a test, I've listed 5 nameservers. One of the
> > > > nameservers is at a remote location, and the other 4 are at various 
> places
> > > > within our internal network.  Due to the fact that the internal 
> network is
> > > > all geographically in the same area, there's a "good chance" all 4 here
> > > > would go down at the same time. We don't presently have the 
> facilities for
> > > > more than one off-site, but I think it's safe to rely on just one.
> > > >
> > > > The problem is this: When I take down the 4 internal nameservers 
> (when I
> > > > say take down, I mean ndc stop, not just drop the zone), the 5th 
> nameserver
> > > > outside responds just fine. However, I think most resolvers are 
> timing out
> > > > before it does. Shouldn't the root servers respond faster than the 
> resolver
> > > > times out? While the 4 are down, if you resolve something 10 times in a
> > > > row, maybe 6 times it'll time out, and 4 times it'll resolve. 
> (assuming you
> > > > resolve something different from the same zone each time .. not
> > > caching/etc.).
> > > >
> > > > Is this a common problem? If all 4 of the internal nameservers go down,
> > > > will the 5th be of any use?
> > > >
> > > > I'd appreciate any insight you can give me, TIA.
> > > >
> > > > Best Regards.