TTL for root NS and A records

Mon Nov 8 19:10:30 UTC 1999

Well, I haven't caught much attention the first time around, let me add some
detail on tesing this stuff in a lab with BIND 8.2.1 as a caching only server.
The root config is now

.                       1D IN NS        lab13.lab.chase.test.
lab13.lab.chase.test.   1M IN A         172.32.28.13

and I'm running bind 8.1.2 on caching only lab14.lab.chase.test in trace level
5.

A minute after the startup the A record expires. No non-cached query can be
answered. Below is a debug log of ". NS" query, which returned only NS record,
but no A record.

08-Nov-1999 13:23:15.555 datagram from [172.32.28.1].60876, fd 22, len 17
08-Nov-1999 13:23:15.556 XX+/172.32.28.1/./NS
08-Nov-1999 13:23:15.556 req: nlookup() id 6 type=2 class=1
08-Nov-1999 13:23:15.556 req: found '' as '' (cname=0)
08-Nov-1999 13:23:15.556 wanted(0xdec80, IN NS) [IN NS]
08-Nov-1999 13:23:15.556 wantedtsig(0xdec80, IN NS) [IN NS]
08-Nov-1999 13:23:15.556 make_rr(, dec80, effff711, 483, 1) 21 zone 0 ttl
942171732
08-Nov-1999 13:23:15.557 finddata: added 1 class 1 type 2 RRs
08-Nov-1999 13:23:15.557 req: foundname=1, count=1, founddata=1, cname=0
08-Nov-1999 13:23:15.557 findns: np 0xd6c88 ''
08-Nov-1999 13:23:15.557 findns: 1 NS's added for ''
08-Nov-1999 13:23:15.557 free_nsp: lab13.lab.chase.test rcnt 1
08-Nov-1999 13:23:15.558 doaddinfo() addcount = 2
08-Nov-1999 13:23:15.558 do additional "lab13.lab.chase.test" (from "")
08-Nov-1999 13:23:15.558 found it
08-Nov-1999 13:23:15.558 stale: ttl 942085392 -3 (x2)
08-Nov-1999 13:23:15.558 delete_all(0xdae5c:"lab13" IN A)
08-Nov-1999 13:23:15.558 rm_datum(d0f94, d0f94, 0, 0) -> 0
08-Nov-1999 13:23:15.559 sysquery(lab13.lab.chase.test, 1, 1, 0, 0, 53)
08-Nov-1999 13:23:15.559 qnew(0xe2f34)
08-Nov-1999 13:23:15.559 find_zone(lab13.lab.chase.test, 1)
08-Nov-1999 13:23:15.559 find_zone: unknown zone
08-Nov-1999 13:23:15.559 find_zone(lab.chase.test, 1)
08-Nov-1999 13:23:15.559 find_zone: unknown zone
08-Nov-1999 13:23:15.560 find_zone(chase.test, 1)
08-Nov-1999 13:23:15.560 find_zone: unknown zone
08-Nov-1999 13:23:15.560 find_zone(test, 1)
08-Nov-1999 13:23:15.560 find_zone: unknown zone
08-Nov-1999 13:23:15.560 find_zone(., 1)
08-Nov-1999 13:23:15.560 find_zone: existing zone 1
08-Nov-1999 13:23:15.561 findns: np 0xdae5c 'lab13'
08-Nov-1999 13:23:15.561 findns: np 0xdae40 'lab'
08-Nov-1999 13:23:15.561 findns: np 0xdae24 'chase'
08-Nov-1999 13:23:15.561 findns: np 0xdae08 'test'
08-Nov-1999 13:23:15.561 findns: np 0xd6c88 ''
08-Nov-1999 13:23:15.561 findns: 1 NS's added for ''
08-Nov-1999 13:23:15.561 nslookup(nsp=0xefffec28, qp=0xe2f34,
"lab13.lab.chase.test")
08-Nov-1999 13:23:15.562 nslookup: NS "lab13.lab.chase.test" c=1 t=2 (flags 0x2)
08-Nov-1999 13:23:15.562 nslookup: 0 ns addrs total
08-Nov-1999 13:23:15.562 sysquery: no addrs found for root NS
(lab13.lab.chase.test)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
08-Nov-1999 13:23:15.562 free_nsp: lab13.lab.chase.test rcnt 1
08-Nov-1999 13:23:15.562 ns_freeqry(0xe2f34)
08-Nov-1999 13:23:15.563 do additional "" (from "")
08-Nov-1999 13:23:15.563 found it
08-Nov-1999 13:23:15.563 ns_req: answer -> [172.32.28.1].60876 fd=22 id=6
size=50 rc=0
08-Nov-1999 13:23:15.563 prime_cache: priming = 0
08-Nov-1999 13:23:15.563 sysquery(, 1, 2, 0, 0, 53)
08-Nov-1999 13:23:15.563 qnew(0xe2f34)
08-Nov-1999 13:23:15.564 find_zone(., 1)
08-Nov-1999 13:23:15.564 find_zone: existing zone 1
08-Nov-1999 13:23:15.564 findns: np 0xd6c40 ''
08-Nov-1999 13:23:15.564 findns: 1 NS's added for ''
08-Nov-1999 13:23:15.564 sysquery: duplicate
08-Nov-1999 13:23:15.564 free_nsp: lab13.lab.chase.test rcnt 2
08-Nov-1999 13:23:15.564 ns_freeqry(0xe2f34)

About a minute later the situation clears off - the A record is reloaded from
the cache. Looking into the log, I can see it actually happened as a response to
a packet from the root server itself, which I'm not sure was a real query (no
query log line, different "fd":

08-Nov-1999 13:24:42.025 datagram from [172.32.28.13].53, fd 4, len 66
08-Nov-1999 13:24:42.025 qfindid(19609) -> 0xe2c10
08-Nov-1999 13:24:42.025 Response (SYSTEM PRIMING -) nsid=19609 id=0
08-Nov-1999 13:24:42.025 stime 942085440/463741  now 942085482/25350 rtt 41562
08-Nov-1999 13:24:42.026 NS #0 addr [172.32.28.13].53 used, rtt 12475
08-Nov-1999 13:24:42.026 rrextract: dname  type 2 class 1 ttl 86400
08-Nov-1999 13:24:42.026 rrextract: dname lab13.lab.chase.test type 1 class 1
ttl 60
08-Nov-1999 13:24:42.026 find_zone(., 1)
08-Nov-1999 13:24:42.026 find_zone: existing zone 1
08-Nov-1999 13:24:42.026 rrsetcmp: rrsets matched
08-Nov-1999 13:24:42.027 rrsetcmp: rrsets matched
08-Nov-1999 13:24:42.027 rrsetupdate: .
08-Nov-1999 13:24:42.027 rrsetcmp: rrsets matched
08-Nov-1999 13:24:42.027 rrsetupdate: lab13.lab.chase.test
08-Nov-1999 13:24:42.027 rrsetcmp: no records in database
08-Nov-1999 13:24:42.027 db_set_update(lab13.lab.chase.test)
08-Nov-1999 13:24:42.028 rrsetupdate: lab13.lab.chase.test 0
08-Nov-1999 13:24:42.028 rrsetupdate: lab13.lab.chase.test 0
08-Nov-1999 13:24:42.028 db_set_update(<NULL>)
08-Nov-1999 13:24:42.028 db_update(lab13.lab.chase.test, 0xd0f94, 0xd0f94, 0,
051, 0xcecb0)
08-Nov-1999 13:24:42.028 db_update: hint 'lab13.lab.chase.test' 942085542
08-Nov-1999 13:24:42.029 db_update(lab13.lab.chase.test, 0xd0f70, 0xd0f70, 0,
071, 0xcecc0) hint
08-Nov-1999 13:24:42.029 db_update: flags = 0x39, sizes = 4, 4 (cmp 0)
08-Nov-1999 13:24:42.029 credibility for lab13.lab.chase.test is 0(0)(sec 1)
from [172.32.28.13].53, is 4(1)(sec 0) in cache
08-Nov-1999 13:24:42.029 db_update: hint 0xd0f70 freed
08-Nov-1999 13:24:42.029 db_update: adding 0xd0f94
08-Nov-1999 13:24:42.029 rrset_free(lab13.lab.chase.test)
08-Nov-1999 13:24:42.029 1 root servers
08-Nov-1999 13:24:42.030 check_root: 1 root servers after query to root server <
 min
08-Nov-1999 13:24:42.030 retry(0xe2c10) id=0
08-Nov-1999 13:24:42.030 unsched(0xe2c10, 0)

Hmm, anyone has an idea what type of packet caused the "SYSTEM PRIMING". What
else would cause this ? How soon can I expect re-read of root A record from hint
file under normal (high) load ?

Honza

PS: At this point, it really looks like it's quite advisable for root A records
to have larger TTL than NS records. But I'm more than curious how comes it
actually keeps working quite satisfactorily under current setup, refusing the
query and logging the complaint message only occassionaly.

--- fwd ---

From: Jan Jirousek on 11/04/99 09:08 AM

To:   bind-users at isc.org
Subject:  TTL for root NS and A records

Hi,

Can someone give me an explanation how BIND (and perhaps various versions from
4.9.3 on) handles TTL expiry of root NS and A records ?

I run an internal DNS "universe" with no ties to external DNS whatsoever, no
forwarding through firewalls etc. I have four root DNS servers and the "dig .
NS" output looks like the following:

---------------------------------------------------------
.                       1D IN NS        dns1.chase.com.
.                       1D IN NS        dns2.chase.com.
.                       1D IN NS        dns3.chase.com.
.                       1D IN NS        dns4.chase.com.
dns1.chase.com.          1H IN A         1.2.3.4
dns2.chase.com.          1H IN A         5.6.7.8
dns3.chase.com.          1H IN A         9.1.2.3
dns4.chase.com.          1H IN A         4.5.6.7
---------------------------------------------------------

All root servers give the same answer for root NS query, all root nameservers
are also authoritative for chase.com (where A records for root nameservers
live). Nameservers through the enterprise have somethig lik ethe above in
db.cache/root.hint files, typically with either none or very high TTLs listed
for each record.

On some nameservers (only a few, typically caching-only BIND 4.9.4P1 servers) we
are getting repeated messages like the following. It is not on startup only, and
the servers seem to be responding to queries normally.

----------------------------------------------------------------------
Oct 24 06:31:37 s0001 named[20006]: sysquery: no addrs found for root NS
(dns1.chase.com)
Oct 24 06:31:37 s0001 named[20006]: sysquery: no addrs found for root NS
(dns2.chase.com)
Oct 24 06:31:37 s0001 named[20006]: sysquery: no addrs found for root NS
(dns3.chase.com)
Oct 24 06:31:37 s0001 named[20006]: sysquery: no addrs found for root NS
(dns4.chase.com)
Oct 24 06:31:37 s0001 named[20006]: ns_req: no address for root server
----------------------------------------------------------------------

It is going like this for quite some time, and the system as a whole appears to
be working fine, but I suspect there is something wrong with root nameserver
records, and the system keeps going mostly because the four root nameservers
records keep getting updated on all nameservers all the time (the same four
nameservers serve many internal domain, so the A records arrive in additional
data of every other request).

I don't have a good hold on the nameservers giving the error messages, but I
sort-of managed to replicate the problem in the lab, with a separate (single)
root server with NS record TTL of 15 minutes and A record TTL of 1 minute. I run
caching only nameserver with Lucent /QIP port of BIND 8.1 in the lab, with the
lab root server in db.cache, which gives me the same messages. It seems to be
unable to answer some queries after the A record expires, but only for a fairly
short period of time, then it refreshes. I will do more testing with it later.

I noticed the public Internet root server A records have higher TTL than
corresponding NS records, e.g.

---------------------------------------------------------
.                   256071  NS  A.ROOT-SERVERS.NET.
A.ROOT-SERVERS.NET.     342471  A   198.41.0.4
---------------------------------------------------------

What happens when the A records for the root nameservers expire, but the NS
records are still in cache. At what point is the hint information used again ?
Any timeouts there ? Is there any recommendation on root nameserver record TTLs
?

I was searching for the information in the list archive, but while these errors
came up several times, it was always due to some firewall/connectivity issue or
db.cache/root.hint misconfiguration. I think I know fairly well how the hint
information is handled on the startup and know a little bit about credibility of
cached records, so there is no need to reiterate that part.

Please cc me directly on replies, I'm only subscribed to the digest list.

Honza Jirousek