Strange / Frustrating Caching Problems

Fri Jul 14 15:29:32 UTC 2006

Merton Campbell Crockett wrote:
> On 13 Jul 2006, at 11:43 , Smith, William E. ((Bill)), Jr. wrote:
>
>   
>> -----Original Message-----
>> From: Mark_Andrews at isc.org [mailto:Mark_Andrews at isc.org]
>> Sent: Thursday, July 13, 2006 1:55 PM
>> To: Smith, William E. (Bill), Jr.
>> Cc: bind-users at isc.org
>> Subject: Re: Strange / Frustrating Caching Problems
>>
>>
>>     
>>> For the past few months, I have been trying to resolve  
>>> (unsuccessfully
>>> to thi s point) with a  trio of caching only name servers that we  
>>> have
>>> in place.  The general nature of the problem is as follows.  A dhcp
>>> client originally gets  an IP address on subnet A but at some point
>>> prior to lease expiration moves to subnet B, where they obtain a new
>>> IP address successfully.  The problem that I am seeing is that after
>>> the move to subnet B, one or more of our caching  only name servers
>>> are still returning the old IP address when a lookup of the hostname
>>> occurs.  This behavior seems reasonable at first glance since caching
>>> only servers should retain the information they have in cache until
>>> the TTL expires and/or the cache is flushed.  After digging into this
>>> further, I'm  finding that that the TTL for the hosts whose forward
>>> lookups are returning the wrong IP are set to 604800 seconds or 168
>>> hours.  I've determined this by dumping / viewing the cache.   In
>>> addition, I've also discovered that the TTL for the reverse record
>>> for the same client is also set to this high value.  This behavior
>>> would seem reasonable if this high value was the TTL value configured
>>> for the domain, which is not the case here.  We have the default TTL
>>> in our environment set for 10800 seconds or 4 hours.  Thus, I'm a
>>> little baffled as to why the TTL for some of these DHCP clients are
>>> being set to such a high value when other clients have their TTL's  
>>> set
>>> to the 10800 v alue configured at
>>> the domain level.  I've checked the registration at the ob ject level
>>> (in our IP management application) and the TTL field is blank, thu s
>>>       
>> implying the default TTL is in place.
>>     
>>> Aside from the above details, I can also note that the problematic
>>> lookups se em to involve the same DHCP clients.  The only reason I
>>> know about these clie nts is that they are unable to SSH to some Unix
>>> boxes in a DMZ that restrict access to hosts that they can perform
>>>       
>> both forward and reverse lookups for.
>>     
>>> In this scenario, the forward lookup is failing since it's returning
>>> the old IP address of the client.  When this problem occurs, it tends
>>> to affect one o r two of the caching servers but not all three.
>>> Furthermore, it is somewhat random as to which of the 3 servers are
>>>       
>> affected.
>>     
>>> The caching servers in question are all Solaris 9 running BIND 9.3.2
>>>
>>> If anyone can provide some insight here, it would be much  
>>> appreciated.
>>>       
>>> I can  provide additional information and/or elaborate on  
>>> something as
>>>       
>> needed.
>>     
>>> Bill Smith
>>> <mailto:bill.smith at jhuapl.edu>
>>> ISS Server Systems Group
>>> Johns Hopkins University Applied Physics Laboratory 11100 Johns
>>> Hopkins Road Laurel, MD 20723
>>> Phone:  443-778-5523
>>> Web:  http://www.jhuapl.edu <http://www.jhuapl.edu/>
>>>       
>> 	Nameservers do what the dhcp servers tell them to do.  The TTL
>> 	is set by the DHCP server.  Try lowering the dhcp lease time as
>> 	that influences the DNS TTL.
>>     
>
>
> In an environment where people can wander with their laptops from  
> subnet to subnet, why do you have caching only name servers?
>
> These name servers should, at least, have the local zones defined as  
> forward or stub zones to minimize the amount of erroneous data being  
> returned in a volatile environment.
>   
Uh, how will that help? Caching still occurs -- and TTLs are honored -- 
even for names in "forward" or "stub" zones.

The only way I can think of to speed up this propagation, short of 
reducing the TTLs that are set by the DHCP server, or running a modified 
version of BIND (e.g. QIP's version, in which secondaries can receive 
Dynamic Updates), or an out-of-band replication mechanism, is to set up 
all of the servers as stealth slaves enumerated in the relevant 
also-notify(s), so that the changes should replicate fairly quickly.

- Kevin