NS failover as opposed to A record failover

Tue Feb 25 22:51:01 UTC 2020

I know this isn’t a question ABOUT BIND, per se, but I think is still a question bind-users might have an answer to. I’ve seen various failover questions on the list, but nothing that talks specifically about NS records (at least nothing in the last decade), so I thought I’d inquire here.

I’m familiar with round-robin DNS and using multiple A records for the same name. I also understand that most clients, if the top server on the list doesn’t respond, will wait ~30 seconds before trying the next address on the list. This is pretty good, as far as automatic failover goes, but still, having X% of your users (X being down servers / all A records offered) wait an extra 30 seconds is not great so I’m going to run a regular health check on my front facing web servers from each BIND server and, if a server stops responding, change my zone file and reload until the server starts responding again, reversing the process. Then X% of my users will only need to wait 30 seconds until I fix the zone file (TTL will also be about the same frequency as the health checks so worst case scenario will be 2xTTL for X% of users having to wait those extra 30 seconds). Overall I’m satisfied with this balance between complexity and resiliency, particularly considering I can do record manipulation in advance of planned maintenance and then this problem only becomes an issue during unexpected outages.

This is all well and good until I think about failure or maintenance of the name servers, themselves. I’ll need to give my registrar my NS IPs for my domain but they will not be nearly as flexible regarding changes as I am running my own nameservers (TTL will probably be an hour, at the very least) which makes maintenance work a MUCH longer process for set-up and tear-down, if I have to make NS record changes in coordination with my registrar. However, this made me wonder, is NS failure responded to in the same way as the failure of an A record? Various Internet randos have indicated some DNS clients and resolvers will do parallel lookups and take the first response and others have indicated that the “try the next record” parameter for NS comms is 5 to 10 seconds rather than 30 and still others claim it’s the same as A record failover at 30 seconds before trying the next candidate on the list. Is there a definitive answer to this or, because it’s client related, are the answers too widely varied to rely upon (which is why the answers on the Internet are all over the map)?

Failures aside, I’m worried about creating a bad user experience EVERY time I need to take a DNS server down for patching. I can’t be the first person to run into this problem. Is it just something people live with (and shuffle NS records around all the time) or is NS failover really smoother than A record failover and I should concentrate on keeping my A records current in case of failure OR planned maintenance?

Any feedback would be greatly appreciated.

Thanks,

Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20200225/a167b4ca/attachment.htm>