anycasting, DNS client retry/failover

Fri Mar 6 21:02:58 UTC 2009

I have just implemented DNS anycasting on our inside network using Cisco 
content switches to monitor the health of the servers and to advertise an 
OSPF route when the back-end services are alive.  I have three CSS's 
simultaneously advertising the same service address to the network, and 
clients get routed to the nearest one.  It works great.

Anyone else try this?

When I was testing, I sent 2000 queries per second from two sources 
simultaneously on diverse parts of the network, and proceeded to start 
disconnecting and reconnecting cables on the content switches to see how 
well it all worked.  No matter what I did, I could not seem to lose more 
than 10 packets per link-state-change (which is very good in my mind).  But 
when I stopped the services on the actual servers, it took up to 5 seconds 
before the content switch registered the fault (because the keepalives are 
currently configured for every 5 seconds), and I lost thousands of queries 
in those few seconds.

I am considering reducing the keepalive period to improve this fault 
response, but I'd like to get a better understanding of the DNS client 
behavior when it's queries go unanswered.

>From what I recall, the typical DNS client will send a single query packet 
to its first-configured dns resolver and wait 1 second for a response.  If 
no response comes, the DNS client sends a second query to the same dns 
resolver and waits either 1 second or 2 seconds, depending on if the client 
is progressive or not, for a response.  If still no response comes, most DNS 
clients will ask the same dns resolver one last time, and wait either 1 more 
second or 4 seconds, depending on the client.  And perhaps some 
non-progressive DNS clients try a fourth time.  If still no response comes, 
then the DNS client starts from the beginning with the second-configured DNS 
resolver.

If this is true, then I would think a keepalive period of 3 seconds ought to 
divert queries away from dead servers fast enough to satisfy the vast 
majority of DNS client requests before failing over to the second-configured 
dns resolver.

Any comments?

And despite what I have read about DNS clients over the years, what I have 
experienced in real life has left me uncertain about what really happens. 
Typically, prior to this anycast deployment, when our first-configured dns 
resolver went down, users complained about waiting 60 to 90 seconds before 
their web pages would come up.  That does not make sense to me because I 
thought the second-configured resolver would be used within a few seconds.

Can any suggest why real life doesn't reflect what is written?

Thanks.

--
Gordon A. Lang