accelerated TTL decrement

Thu Apr 12 23:00:07 UTC 2001

I was browsing the mailing list archives today and I ran across a
theory made by Barry Margolin that possibly explains a problem I've
been having [pardon me if this has been discussed before but I
couldn't come up w/ a good enough search to narrow it down]:

Barry's theory:
--------
ns1.hank.org's A record was originally learned via a glue record from
the ..ORG server.  Glue records (and other records learned in
Additional Info) aren't considered as reliable as records from a
server that's authoritative for the domain, so its TTL is decreased at
an accelerated rate (every time it's looked up the TTL is dropped by
5%, in addition to the normal time decrement).  So after a while the
remote server has the NS record pointing to ns1 in its cache, but it
no longer has that A record, so it needs to ask one of the other
hank.org servers for it.
--------

Now I have a quick question..  Is the primary purpose of a secondary
nameserver
1) Load balancing
or
2) Failure recovery in case a primary nameserver is inaccessible (due
to network outage, for example)
?

The reason I ask is because of the following scenario:

The root name servers provide this information about domain.com:
domain.com NS ns1.domain.com
domain.com NS ns2.domain.com
and glue records:
ns1.domain.com address is 1.2.3.4
ns2.domain.com address is 5.6.7.8

The zone files on ns1 and ns2 have the same NS and A records.

ns1 is inaccessible due to a prolonged network outage.

Someone running named does a fresh lookup of www.domain.com.  named
asks the root nameservers for information about domain.com and gets
the NS records and the glue records... then it tries to ask ns1 about
www.domain.com.  ns1 doesn't answer.  named asks ns2... ns2 does
answer.  Everything is dandy.  

Now... that someone does a zillion lookups for the address of
ns2.domain.com.  It gets answers back because named has them cached.
But, according to Barry, since the address of ns2.domain.com was
originally returned as a glue record from the root nameservers, the
TTL is decreased by 5% during each lookup.  This means that the TTL
will head to zero very quickly.  Soon, the 'A' record for
ns2.domain.com will expire and be removed from the cache.  This leaves
named in the following situation:

It still has the 'NS' records for domain.com:
  NS ns1.domain.com
  NS ns2.domain.com
It still knows the address of (inaccessible) ns1.domain.com.  

Now the user does one final lookup for www.domain.com.  We'll assume
that the old entry for www.domain.com has expired and is not in the
cache anymore.  Now named needs to talk to one of the nameservers for
domain.com.  It knows who they are.. but it only has the address for
ns1.. .so it tries to contact ns1... which doesn't answer.  It never
tries to contact ns2 which is happily waiting to provide information.

At this point named needs to be restarted in order to get information
for domain.com.

This leads me back to my quick question above.  If the purposes of
secondary nameservers is to provide fault tolerance (which is what I
always felt they were mainly for), this behavior of named works
against that goal.