What would be happen if one of two dns was down?

Wed Aug 13 03:37:45 UTC 2008

MontyRee wrote:
> sorry for non-txt based previous e-mail. sending again.
>
>
>
> So thanks for kind and concrete answers.
>
> and addtional questions are...
>
>
> -. others can use other resolvers like windows based or other bind version.
>     so this program works well as you said without exception?
>
>
> -. in the point of high-availability of service,
>    what it better two authorative dns servers or two master dns servers using L4 switch?
>
>
>
> So thanks again.
>
>
> Regards.
>
>
>
>   
>> Subject: RE: What would be happen if one of two dns was down?
>> From: chris_cox at stercomm.com
>> To: bind-users at isc.org
>> Date: Tue, 12 Aug 2008 10:44:02 -0500
>>
>> On Tue, 2008-08-12 at 06:42 +0000, MontyRee wrote:
>>     
>>> So thanks for kind answer.
>>>
>>>
>>> Additional questions below.
>>>
>>>
>>>       
>>>>> Hello, all.
>>>>>
>>>>>
>>>>> I have operated two dns(primary and secondary) for one domain like below.
>>>>>
>>>>>
>>>>> example.com IN NS ns1.example.com
>>>>> example.com IN NS ns2.example.com
>>>>>
>>>>>
>>>>> and there was a event that ns1.example.com dns was down.
>>>>> As I know, if ns1 dns is down, all requests go to the ns2.example.com.
>>>>>           
>>>> Depending on what 'down' means, it could take some time before
>>>> the request is sent to ns2. So there will likely be a delay, even
>>>> if not much (it will feel like forever to some users).
>>>>         
>>> my 'down' means that system down so can't ping to server.
>>>
>>>
>>>       
>>>>> But when ns1.example.com dns was down, actually some people can't lookup the domain.
>>>>>           
>>>> Sounds like a configuration issue. However realize there is a zone
>>>> cache and if ns2 is slaving zones of ns2 (typical bind master slave
>>>> scenario) and the zone cache expires, then ns2 will refuse to
>>>> trust the slaved zone it had... and thus nothing works.
>>>>         
>>> Sorry, I can't understand what you said.
>>> actually the master dns server(system) down time was just a hour and slave dns
>>> works well without any problem, but at that time some can connect but some said that
>>> they can't resolve the domain at all.
>>>       
>> The slave will answer queries for the zone until the zone TTL expires
>> in which case if cannot contact the master, the zone will go effectively
>> dead.
>>
>> I think I used some bad "terms" in my explanation. Basically
>> there is an expiration ttl for which a slave will consider its
>> data to be good. After that, it will need to hit the master.
>>
>> (I trip up on using the right words)
>>
>> The value is often set to 2 weeks or more. But if the master is
>> down for a LONG time... you'll lose it all eventually (the slave
>> won't answer for that zone anymore).
>>
>> If you're seeing this problem after a short period of time, that's
>> likely NOT the cause unless somebody set the expiry in the SOA
>> to something really small.
>>
>> Caching in DNS is a wonderful thing, but can cause scenarios where
>> resolution is working for one and not for another (because of
>> the various Time To Live values and the time of last query/cache).
>>
>> Can you give us a feel for the amount of time between the failure
>> and the problem? Is it almost immediate? If so, then it's some
>> other kind of configuration issue (unless, as I said the zone was
>> just totally miconfigured). Can you post the SOA for the zone?
>>
>>     
>>> It means, dns failover doesn't works well?
>>> and some resolver or some bind version, insist querying for the downed dns server?
>>>       
>> Usually the client resolver is looking to query multiple nameservers, if
>> the first one is down, it moves onto the next and so on. Failover works
>> fine in this style (normally). Of course, a client might NOT be aware
>> of more than one nameserver... in which case there is no failover (duh).
>>
>>
>> ...
>>     
>>> So thanks for your help again..
>>>       
>> Did I explain it better this time?
>>
>>
>>     
Let me try to explain this from a high level:

1) The NS records that are published for a zone are for the consumption 
of other nameservers (technically, "iterative resolvers"). If one of the 
nameservers listed as an NS for a zone becomes unavailable, failover is 
very quick to the other NS(es). So quick as to usually be unnoticeable 
by ordinary users. Iterative resolvers also *remember* which nameservers 
are down, or slow, so they are very adaptive to failures.

2) The nameservers that are defined for a "stub resolver", like your 
typical end-user PC, are tried *in*sequence*, so if the first one is 
down, there may be a delay before the second one is tried, and if that 
one is down an ever longer delay before the third one is tried, and so 
forth. The delay is often quite noticeable, and impatient applications 
may actually time out before a working nameserver is found. Stub 
resolvers typically don't *remember* that a particular nameserver is 
down, either, so in case of a failure, all queries are likely to be slow 
until the failure is corrected.

3) Between masters and slaves, there is a REFRESH interval defined for 
each replicated zone, which governs how often the slave checks the 
master for updates, and then an EXPIRE interval after which the slave 
considers the zone "bad" and will no longer give useful answers for 
names in the zone. As mentioned previously in the thread, while REFRESH 
can be as low as an hour or more, EXPIRE is typically on the order of 
weeks, if not months. If a slave can't talk to the master for weeks, 
chances are it's a permanent condition and the right thing to do is 
"expire" the zone so that clients aren't given stale information. In 
enterprises with a large number of slave servers (like ours), for 
redundancy it is common to have multiple tiers of slaves, and the slaves 
at a given tier to list multiple "masters" (i.e. sources of zone data) 
from higher tiers, so that even if a single intermediate "master" dies 
or becomes unavailable, changes still propagate out to the edges 
everywhere. Note that there is an inherent problem in having servers at 
the *same* tier list each other as "masters" reciprocally or in a 
circular fashion, because then slaved zones can become "immortal" (i.e. 
even if they're deleted from the primary master, the slaves in that 
particular tier keep refreshing it from each other indefinitely).

So, your questions are

a) "others can use other resolvers like windows based or other bind 
version."

Depends on what you mean by "resolver". If you mean the "resolver" part 
of a nameserver implementation like BIND, configured for iterative 
resolution (i.e. based on published NS records), then the failover is 
very fast.

If, on the other hand, you mean a "stub resolver", like a typical 
end-user PC client, then the failure of the first nameserver in the 
resolver list can cause noticeable delays for every query. Note that on 
some platforms it's possible to tune the delays (e.g. libresolv on some 
Unix/Linux platforms understands some /etc/resolv.conf options which 
govern timeouts and retries).

In the case of a "forwarding resolver", such as, e.g. BIND configured 
with a "forwarders" statement, it depends on the exact implementation. 
Even in its forwarding mode, BIND, for instance, still maintains a 
cache, so on that basis alone it can be expected to perform reasonably 
well even in the case of failures (unless the TTLs of the records being 
looked up are very low). Modern versions of BIND also keep track of 
up/down/slowness of its upstream forwarders, so it can adapt to failures 
in the same way that it does when resolving iteratively (older versions 
of BIND are not as adaptive in forwarding mode, trying each forwarder in 
sequence, so they degenerated to the performance level of stub resolvers 
+ caching). Other packages/implementations of forwarding resolvers may 
cope well with failures, or not so well. It really depends.

b) "in the point of high-availability of service, what it better two 
authorative dns servers or two master dns servers using L4 switch?"

I'm not 100% sure what you mean by "L4 switch". Do you mean a 
load-balancer? The Internet standards mandate at least 2 nameservers for 
each zone, so you don't technically have the option of putting 2 DNS 
servers behind a single, load-balanced VIP. We have 2 VIPs defined for 
our Internet-facing DNS zones and then each VIP has multiple nameservers 
behind it. This conforms to standards, and not only gives us an 
acceptable level of availability in the face of unplanned outages, but 
also the flexibility to perform maintenance, upgrades, etc. 
transparently to Internet DNS clients.

There's also the "anycast" approach, which is routing-layer-based, but 
since we don't use that here, and I haven't researched it at all, and in 
any case don't have a strong background in network routing, I'll defer 
to others to explain how that works.

What, by the way, do you mean by two "master" DNS servers? The term 
"master" is usually used in DNS in two different ways:
1) relationally, when talking about replication (as I do above), the 
master is the provider of the zone data, and the slave is the consumer. 
Within a multi-level replication hierarchy, a given server might be 
"master" with respect to other servers in the hierarchy, and "slave" to 
others.
2) When viewing the hierarchy as a whole, in the classic DNS replication 
model (i.e. based on point-to-point AXFR/IXFR transfers), there is 
really only 1 "master", i.e. the origin of the zone data, whether that 
be from a flat file, a database backend, or whatever. All other 
nameservers in the hierarchy are "slaves", in that they obtain the zone 
data from other nameservers, rather than a source external to DNS 
itself. Sometimes the term "primary master" is used for this kind of 
"master", to distinguish it from "master", as used in the relational 
sense in #1 above.

In neither sense of the term "master" do I understand how one could have 
multiple "masters" behind a load-balancer, unless you're i) talking 
about putting load-balancers between servers in the replication 
hierarchy (in which case they're all "authoritative" anyway and there's 
no difference between the options you presented), ii) deviating from the 
classic DNS replication model (e.g. Microsoft's "multi-master" 
architecture for Active Directory integrated DNS, where the backend is a 
replicated LDAP database), or iii) simply using the term incorrectly.

- Kevin