Bind 8.2.3, query-restart on expired NS and A record

Thu Oct 11 02:45:31 UTC 2001

[Mark_Andrews at isc.org: Thu, Oct 11, 2001 at 09:17:56AM +1000]
> 
> 	The problem is you have very small TTL on the NS record and
> 	as BIND 4 and BIND 8 don't have query restart the record
> 	is timing out before the second query comes in.
> 
> 	5 minutes would be stable.
> 

if the TTL of the NS and its corresponding A is 1 minute it will
happen on the first query that occurs 1+ minute after they were
written to the cache.

if the TTL is 5 minutes it will happen on the first query that
occurs 5+ minutes after they were written to the cache..

same with 1 hour ttls.

I've monitored all those cases.

it will happen anytime the NS and A record that it refers to are
simultaneously stale. it doesn't matter how many times they've been used.

It happens because on the update of the NS, the additional record
doesn't get written to cache (or used to respond to the current
request) because the stale A isn't known to be stale and bind doesn't
update ttls.

I really just want to document it for the archives at this point: I
didn't think query restarts could be triggered on zones using all "in
baliwick" delegations and including appropos additional records.

-P

> 	Mark
> 
> > 
> > [Mark.Andrews at isc.org: Wed, Oct 10, 2001 at 10:19:13AM +1000]
> > > 
> > > > does a bind 8.2.3 stub resolver ever overwrite existing cache entries
> > > > with records received from an additional section for the same record?
> > > 
> > > 	Well given that the stub resolver doesn't have a cache this
> > > 	does not make sence the way it was written.
> > 
> > you and kevin are right here: I meant a recursive server. sorry for
> > the confusion.
> > 
> > >	The nameserver does not refresh TTL based on answers it
> > > 	receives (though earlier versions did creating server lock).
> > 
> > I think this is the source of my problem.. Bind 8 does not refresh
> > TTLs even for expired records - expired records must first be
> > explicitly deleted before a record with a fresh TTL will take their
> > place.. I think this can be problematic with respect to glue
> > records. I'll back it up with a bind trace in a second.
> > 
> > But the setting is this: the cache has a stale A record for
> > gluetest.limey.net, it also has a stale NS record for
> > gluetest.limey.net (that delegates to px.limey.net), and it also has a
> > stale A record for px.limey.net. It has 2 fresh NS records for
> > limey.net (delegating to sidehack.gweep.net and ayup.limey.net) as
> > well as a fresh A records for sidehack and ayup.
> > 
> > [** First we find the A for gluetest.limey.net, figure out that it's
> >     stale and delete it. **]
> > req: found 'gluetest.limey.net' as 'gluetest.limey.net' (cname=0)
> > stale: ttl 1002571133 -7 (x2)
> > delete_all(0x80ef0e0:"gluetest" IN A)
> > 
> > [** The best we can do with what is fresh is to contact the limey.net
> >      nameservers. They reply with an NS record and a glue record **]
> > nslookup(nsp=0xbfbfeaf8, qp=0x810f000, "gluetest.limey.net")
> > nslookup: NS "AYUP.limey.net" c=1 t=2 (flags 0x2)
> > nslookup: NS "SIDEHACK.GWEEP.net" c=1 t=2 (flags 0x2)
> > nslookup: 2 ns addrs total
> > forw: forw -> [65.105.101.18].53 ds=4 nsid=1531 id=37409 18ms retry
> > 4sec
> > Response (USER NORMAL -) nsid=1531 id=37409
> > gluetest.limey.net.     1m41s IN NS     px.limey.net.
> > px.limey.net.           4m1s IN A       204.168.16.17
> > rrextract: dname gluetest.limey.net type 2 class 1 ttl 100
> > rrextract: dname px.limey.net type 1 class 1 ttl 100
> > 
> > [** Now, as I understand it, we check the extracted records against
> >   the existing cache.. The NS records doesn't match anything (we just
> >   deleted it, but px.limey.net matches an existing record. That record
> >   is stale, but we pay no heed **]
> > rrsetupdate: gluetest.limey.net
> > rrsetcmp: no records in database
> > rrsetupdate: gluetest.limey.net 0
> > rrsetupdate: px.limey.net
> > rrsetcmp: rrsets matched
> > 
> > [** we now write the NS to the cache.. but not the A. When I run this
> >     same trace on a clean cache it writes both the NS and the A here. **]
> > db_update(gluetest.limey.net, 0x810c1f8, 0x810c1f8, 0, 031, 0x80feca0)
> > db_update: adding 0x810c1f8
> > 
> > [** we now return to the business of following that delegation - but 
> >     the server can't find the px A record "wanted!" **]
> > resp: nlookup(gluetest.limey.net) qtype=1
> > resp: found 'gluetest.limey.net' as 'gluetest.limey.net' (cname=0)
> > wanted(0x810c1f8, IN A) [IN NS]
> > 
> > we now need to start a query for px.limey.net - that triggers the
> > query restart behavior and timeouts which started all of this in the
> > first place.. The problem seems to be that in order to get a TTL
> > updated you have to be explicitly deleted - and deletions only happen
> > when the cache has already looked up stale data.. because glue records
> > are really hints about future queries, they are not pre-emptively
> > deleted.
> > 
> > -P
> > 
> --
> Mark Andrews, Internet Software Consortium
> 1 Seymour St., Dundas Valley, NSW 2117, Australia
> PHONE: +61 2 9871 4742                 INTERNET: Mark.Andrews at isc.org
>