CORRECTION: Re: Planning for Emergencies

Tue Mar 21 19:37:30 UTC 2000

Oops, I blew it! Most if not all commercial load-balancing platforms do
*not* actually flip A records around. Instead, they use an invariant address
and then some sort of routing chicanery to redirect packets based on load
levels and/or node failures. This is a third, non-DNS-based approach to
failover, in addition to the 2 DNS-based approaches I already described.
Sorry for the confusion.

- Kevin

Kevin Darcy wrote:

> wwebb at adni.net wrote:
>
> > Is there a way to configure a system that if a server fails another
> > server at a differant geographic location will automatically takeover?
> > If so, how is that done?
> >
> > Thanks for any tips.
>
> There are 2 basic approaches to this problem:
>
> 1. Have the name resolve to *both* addresses and try to always give the
> addresses out in a fixed order to all clients (via "rrset-order" or
> non-BIND equivalent, if any). This approach relies on the clients being
> smart enough to try the second address. Unfortunately, many clients just
> try the first address and give up, so this is a partial solution at best.
> Also, this approach undoubtedly leads to a lot of "leakage" to the backup
> server, because of the round-robin behavior of slaves and intermediate
> caching servers. If you can configure the rrset-order's on all of the
> slaves, that takes care of that part of the problem, but the only way to
> really prevent intermediate caching servers not under your control from
> doing round-robin is to reduce the TTL's on the records, thus forcing the
> caching servers to time out the records constantly and to re-fetch the
> data from the authoritative servers. Unfortunately, this increases
> DNS traffic.
>
> 2. Have the A record change in the event of an outage. This can be
> accomplished by various commercial platforms (as Barry mentioned), or you
> can roll your own by creating a monitoring script that detects an outage
> and then changes the A record, using Dynamic Update or some other
> mechanism. Again, in order to reduce the effect of cached records, you
> may need to reduce TTL values. All of your slaves should understand
> NOTIFY also, otherwise the A record change may not propagate quickly
> enough.
>
> In the future, intelligent clients -- ones which understand SRV records,
> or at least which can work their way through a list of addresses -- may
> be the best solution to the problem. In the meantime, Cisco and others
> are making a mint on server-side solutions.
>
> - Kevin