anycasting, DNS client retry/failover
Gordon A. Lang
glang at goalex.com
Fri Mar 6 21:02:58 UTC 2009
I have just implemented DNS anycasting on our inside network using Cisco
content switches to monitor the health of the servers and to advertise an
OSPF route when the back-end services are alive. I have three CSS's
simultaneously advertising the same service address to the network, and
clients get routed to the nearest one. It works great.
Anyone else try this?
When I was testing, I sent 2000 queries per second from two sources
simultaneously on diverse parts of the network, and proceeded to start
disconnecting and reconnecting cables on the content switches to see how
well it all worked. No matter what I did, I could not seem to lose more
than 10 packets per link-state-change (which is very good in my mind). But
when I stopped the services on the actual servers, it took up to 5 seconds
before the content switch registered the fault (because the keepalives are
currently configured for every 5 seconds), and I lost thousands of queries
in those few seconds.
I am considering reducing the keepalive period to improve this fault
response, but I'd like to get a better understanding of the DNS client
behavior when it's queries go unanswered.
>From what I recall, the typical DNS client will send a single query packet
to its first-configured dns resolver and wait 1 second for a response. If
no response comes, the DNS client sends a second query to the same dns
resolver and waits either 1 second or 2 seconds, depending on if the client
is progressive or not, for a response. If still no response comes, most DNS
clients will ask the same dns resolver one last time, and wait either 1 more
second or 4 seconds, depending on the client. And perhaps some
non-progressive DNS clients try a fourth time. If still no response comes,
then the DNS client starts from the beginning with the second-configured DNS
resolver.
If this is true, then I would think a keepalive period of 3 seconds ought to
divert queries away from dead servers fast enough to satisfy the vast
majority of DNS client requests before failing over to the second-configured
dns resolver.
Any comments?
And despite what I have read about DNS clients over the years, what I have
experienced in real life has left me uncertain about what really happens.
Typically, prior to this anycast deployment, when our first-configured dns
resolver went down, users complained about waiting 60 to 90 seconds before
their web pages would come up. That does not make sense to me because I
thought the second-configured resolver would be used within a few seconds.
Can any suggest why real life doesn't reflect what is written?
Thanks.
--
Gordon A. Lang
More information about the bind-users
mailing list