dns and isp redundancy ? why would i need to restart bind after a isp failover ?

Tue Dec 20 01:06:20 UTC 2005

Tom V wrote:

>Hi,
>
>One of our customers has a firewall setup with isp failover (meaning, when
>one link to internet fails, we can switch to a standby link from another
>provider). Obviously, in this case our public ip adress also changes.
>
>Normally, this should not have any influence on the applications.
>
>However, today we had to switch over to another provider, and we noticed
>that our internal dns server wouldn't resolve any external adresses
>anymore. we always got a 'no servers could be reached' whenever we tried
>to resolve a domain that wasn't local or in the cache.
>
>We solved the problem by simply restarting bind (this is bind 9 on redhat
>linux enterprise 3). So it wasn't an access list somewhere that caused the
>problem.
>
>Any ideas what could have caused this ? I'm sorry that i don't have more
>information for you to work with.
>  
>
Yeah, more information would be helpful. If it weren't for the fact that 
a restart of named fixed the problem, I would have simply said that 
you're probably configured to forward-by-default to ISP #1's (ISP #1 
being the ISP that you switched away from) nameservers, in which case 
when ISP #1 was unavailable every non-cached-non-local-domain query 
would need to time out before named would (assuming "forward first", 
which is the default forwarding mode) fail over to iterative resolution. 
However, the fact that restarting named cleared the problem makes things 
a little more complicated. Are you sure there weren't any non-DNS 
changes (e.g. firewall, router) at the same time as the restart?

One possibility is that you were using nslookup (you didn't say what 
lookup tool you were using) and since nslookup always tries to look up 
the name of the nameserver it's using, if you had cached 
referral/round-trip-time (aka RTT) information pointing the relevant 
reverse zone(s) to ISP #1's nameservers, and they were unavailable, 
there would be a significant delay resolving every query that way, 
perhaps long enough for the nslookup to even time out. Restarting named 
would clear out that old RTT information and perhaps allow it to resolve 
the reverse-domain from some other server(s) in the NS records of the 
zone; ones that it could get to. Note that if the reverse zone was the 
problem, you can mitigate it by making yourself slave of the relevant 
reverse zones(s). Note also that regular apps doing regular forward 
lookups would not have been impacted by the reverse-lookup problem, if 
any, although apps that specifically do reverse lookups, e.g. for 
logging/auditing or possibly (cringe) source-address-based 
authentication, would obviously suffer. Finally, note that if my 
cached-RTT-for-reverse-zone theory is correct, it would have eventually 
sorted itself out over time, since unresponsive servers get their RTTs 
penalized, so the other servers in the same RRset would eventually "rise 
to the top" and get used for resolving queries in the zone. Restarting 
named simply accelerated that process, in other words.

                                       - Kevin