Timeouts and retries on high speed Lans

Tue Sep 14 09:34:47 UTC 2010

I have been working on building out a couple of large data centres and
have been struggling with how to set up the systems so that we get a high
resilience, highly responsive DNS service in the presence of failing
equipment.

The configuration we have adopted includes a layer of BIND 9.6.x servers
that act as pure name server caches. We have six of these servers in each
data centre paired to provide service on VIPs so that if one of the pair
fails the other cache takes over.

Our resolv.conf is of the following form.

search xxx.com yyy.com
nameserver 10.1.1.1
nameserver 10.1.2.1
nameserver 10.1.3.1
options timeout:1 attempts:15 no-check-names rotate

The name servers are thus on different networks within the DCs.

Our first problem arises because the timeouts seem to be taken serially on
each server rather than the rotate applying between each name server
request. Is this what I should have expected i.e. a 15 second timeout
before the next server is tried in sequence.

The second problem we face is that even if we could get a one second
timeout this orders of magnitude too slow for names that should be
resolved within our local name space. In other words for lookups within
the xxx.com and yyy.com domains I would like to see timeouts in the
micro-second range.

Thinking further about this problem I have been considering whether the
resolver should be multi-threaded or parallelised in some way so that it
tries all fo the servers at once and accepts the first to respond. I have
come to the conclusion that this would be too difficult to make resilient
in the general use of the resolver code, but would make sense if the
lwresd layer is added to the equation.

Which brings me on to the use of lwresd, this would reduce the incidence
of problems with non-responsive servers in that it would detect and switch
to an alternative server on the first failed attempt. However, this still
means that if lwresd has not detected the down server then we get a stall
in response within the data centre.

So my questions are:

1. Does anybody have any experience in building such systems and
suggestions on how we should tune the clients and servers to make the
system less fragile in the presence of hardware, software and network
failures.

2. Is is possible with lwresd as it is written today to get the effect of
precognition - i.e. can I get lwresd to notice that a server has gone down
or has come back up without it needing to be triggered by a resolv
request.

3. Does anybody know if I can configure lwresd to expect particular zones
to be resolved within very small windows and use this to fail over to the
next server.

And for discussion I wonder if there would be room to add to the resolver
code and or lwresd additional options of the form

options zone-timeout: xxx.com:1usec

or something similar, whereby the resolver could be told that if the cache
does not respond within this time about that particular zone then it can
be assumed that the server is misbehaving.

Thank you for your attention

Regards, Howard.