Failover

Sat Nov 20 01:17:02 UTC 2004

jesk wrote:

>Hello,
>
>im thinking about a failover setup of webservices at different locations via
>DNS.
>I got some questions about the possibilities of this:
>
>1. how is "IN NS" cached and used by other bind nameservers if one of the NS
>is down? f.e. the TLD server has two "IN NS" records for my zone, now a
>nameserver is looking up this zone and will get this 2 records. first i
>think its trying to resolv via the first nameserver of the replyorder, but
>what would be if this one is down and not reachable, will the resolving
>nameserver try to query via the second one a second time? what would be if
>the first nameserver can succesfully answer, then will be cached by the
>resolving nameserver, but then in the future of the life of the cached "IN
>NS" record the nameserver will be down, is the second nameserver still in
>the cache and the failover will work if this will happen?
>
Nameservers will generally keep track of how fast other nameservers 
respond to queries and prefer faster nameservers over slower ones. If a 
nameserver stops responding, it'll get heavily penalized as being "slow" 
but eventually used again in case it has recovered. Overall this 
mechanism makes nameserver-to-nameserver traffic rather adaptive and 
resilient.

>
>2. is the only solution to get a global dns failover without the use of
>routing protocols like BGP to use two or more nameservers at different
>locations(AS or something else) which will then answer queries f.e. of
>webservers with its own specific A-records? f.e. if nameserver A is down in
>cause of a routing problem, then a resolver will query nameserver B(located
>at a different provider) which then will answer a query for www.domain.tld
>with a specific A-record which will be reachable, because its in the same
>physically network.
>
That'll give you very basic, crude failover capabilities, but it won't 
give you actual load-balancing (since the speed at which the nameserver 
responds may have nothing to do with the load on the webservers or 
whatever other servers you're trying to load-balance). In fact, if one 
nameserver happens to be significantly closer to the Internet backbone 
and/or the source of the majority of your potential clients, you may 
find that this approach causes severe load skewing.

Another drawback of this approach, of course, is that you have to 
maintain two different versions of A records you want to be redundant.

>3. if the "IN NS" failover is possible, whats about caching nameservers
>which are caching A-records? are them also failover possible, if yes would
>it be possible to return the A-records for the webserver of both locations
>so that a client will try webserver A first and when not reachable webserver
>B (i think its a implementation thing and too much risk)? or is the only
>solution to create a zone with a TTL of zero?
>
Any DNS-based-solution is going to require low TTL values, since 
otherwise it won't be very dynamic. Lowering your TTL values like that 
is rather anti-social since it not only makes your nameservers work 
harder handling more queries, but it makes everyone *else*'s nameservers 
work harder querying your nameservers. As for the approach of using 
multiple A records, you can do that, but you'll get a certain amount of 
randomness depending on what your TTL values are set to, since most 
resolvers will "round-robin" their answers when replying from cache. 
Also, be aware that some clients (in particular, some web browsers), 
take a long time to do address failover. Longer than may be acceptable 
to your customers.

>thanks for any hints and explanations to get this fully understanding :)
>
The *right* way to use DNS for resource failover is to use SRV records. 
Unfortunately, very few client software authors have adopted SRV 
support. So in the meantime, most folks are implementing dedicated 
and/or hardware-based load-balancing solutions instead, which give 
load-balancing benefits as well as just failover. The better ones can be 
pretty pricey though.

As you mention above, it's also possible to play stupid BGP tricks and 
the like to squeeze out some failover and/or load-balancing 
functionality. I don't have any direct experience with that, but I am 
led to believe that there are serious drawbacks to that approach in 
terms of reliability and convergence time.

                                                   - Kevin