Caching only nameserver fails to resolve external zones periodically
Curtis Rempel
curtis at telus.net
Mon May 17 18:59:44 UTC 2004
On Mon, 17 May 2004 18:07:17 +0000, phn wrote:
> Curtis Rempel <curtis at telus.net> wrote:
>> Hi,
>
>> I've got a caching name server which also handles a zone (.lan) on an
>> internal 192.168.1.0/24 network. Both internal and external lookups work
>> fine as I have a forwarder entry defined in
>> /var/named/chroot/etc/named.conf
>
>> That is, until "something" happens which causes the external lookups to
>> fail. The internal zone resolution still works, however, it seems as far
>> as I can tell, that the forwarder entry does not respond and then it
>> starts crawling through the root name servers and eventually gives up.
>
>> Here's some sample output (from Fedora Core 1 Linux and bind 9.2.2.P3-9
>
>> When everything is working (i.e. immediately after a 'service named
>> restart' command), the following 'host' command works. However, when
>> things aren't working, I get the following output:
>
>> [root at vault root]# host www.telus.net
>> ;; connection timed out; no servers could be reached
>
>> This can be rectified by restarting the name server as above, but only for
>> awhile (which seems to vary), and then external lookups hang again. The
>> internal zone information can still be resolved.
>
>> When the system is not responding to external zone lookups, a tcpdump
>> looks like this with the above 'host' command:
>
>> 15:51:01.996338 vault.lan.33305 > ns7so.cg.shawcable.net.domain: 35946+ [1au] A? www.telus.net. (42) (DF)
>> 15:51:03.728476 vault.lan.33305 > f.root-servers.net.domain: 50741 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:06.008121 vault.lan.33305 > 198.41.0.4.domain: 14024 [1au] A? www.telus.net. (42) (DF)
>> 15:51:07.747854 vault.lan.33305 > G.ROOT-SERVERS.NET.domain: 52631 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:10.027489 vault.lan.33305 > 128.9.0.107.domain: 65124 [1au] A? www.telus.net. (42) (DF)
>> 15:51:11.767237 vault.lan.33305 > 128.63.2.53.domain: 65468 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:14.046919 vault.lan.33305 > 192.33.4.12.domain: 65502 A? www.telus.net. (31) (DF)
>> 15:51:15.786573 vault.lan.33305 > 192.36.148.17.domain: 32751 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:18.066210 vault.lan.33305 > d.root-servers.net.domain: 55260 A? www.telus.net. (31) (DF)
>> 15:51:19.038994 laser.lan.1024 > vault.lan.domain: 27316 A? fsa.cpsc.ucalgary.ca. (50)
>> 15:51:19.805969 vault.lan.33305 > k.root-servers.net.domain: 13778 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:22.085587 vault.lan.33305 > E.ROOT-SERVERS.NET.domain: 3376 A? www.telus.net. (31) (DF)
>> 15:51:23.825310 vault.lan.33305 > 202.12.27.33.domain: 1688 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:26.104947 vault.lan.33305 > f.root-servers.net.domain: 844 A? www.telus.net. (31) (DF)
>> 15:51:27.844754 vault.lan.33305 > j.root-servers.net.domain: 33190 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:30.124317 vault.lan.33305 > G.ROOT-SERVERS.NET.domain: 49363 A? www.telus.net. (31) (DF)
>> 15:51:31.864043 vault.lan.33305 > l.root-servers.net.domain: 18756 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:34.143694 vault.lan.33305 > 128.63.2.53.domain: 4724 A? www.telus.net. (31) (DF)
>> 15:51:35.883596 vault.lan.33305 > ns7so.cg.shawcable.net.domain: 2362+ PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:38.163051 vault.lan.33305 > 192.36.148.17.domain: 1181 A? www.telus.net. (31) (DF)
>> 15:51:40.902620 vault.lan.33305 > 198.41.0.4.domain: 24263 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:42.182418 vault.lan.33305 > k.root-servers.net.domain: 22529 A? www.telus.net. (31) (DF)
>
>> The first entry above (15:51:01) indicates that the requested is being
>> forwarded to the "forwarders" entry which resolves to
>> ns7so.cg.shawcable.net
>
>> When external resolution is working, this is the last entry as
>> ns7so.cg.shawcable.net provides the answer.
>
>> In a "hung" lookup, the output is above, first stop is the forwarder entry
>> and then the root servers and finally failure.
>
>> Does anybody have any idea why this external name resolution is
>> periodically failing like this? Any suggestions for debugging info?
>
>> It seems that external lookups can function fine for days and then quit,
>> sometimes only minutes and then quit.
>
>> Thanks!
>
>> curtis at telus dot net (which the smarter spambots can likely figure out
>> anyway...)
>
> I see three issues here :
>
> 1/ the zone "telus.net" is badly configured on a number of issues ( where mismatch
> between nameservers delegated to and the list of nameservers the servers say),
> very short ttl on NS records etc.
>
> 2/ you are running a beta-version of bind. Why ? 9.2.3 has been available for
> a long time.
>
> 3/ you state that you use forwarders. Why ? Failiure of the forwarders might
> give the behaviour you observe.
Thanks for your reply.
1/ - telus.net is only one zone I happened to use for the example. As it
turns out, any external zone lookup fails.
2/ - 9.2.2.P3-9 is what was "out of the box" on Fedora Core 1 and is the
latest according to yum and rpmfind.net (latest RPM that is). I am a
little hesitant to download/compile bind from source for the latest, I
would rather keep everything RPM if possible.
3/ - This is my prime suspicion - that the forwarder IP is failing.
However, here is some additional information I've since discovered: once
named gets into this failed state where the host command does not respond
correctly (i.e. it returns 'no servers could be reached'), I can specify
the IP of the forwarder on the 'host' command as follows and query the
forwarder directly and all works:
# host telus.net 64.59.135.133
Using domain server:
Name: 64.59.135.133
Address: 64.59.135.133#53
Aliases:
www.telus.net is an alias for cityweb.telus.net.
cityweb.telus.net has address 198.161.157.214
If I then use 'host telus.net' immediately after, it still fails. The
only way to get the caching name server working again is a 'service named
restart'
So, perhaps, the forwarder is not faulty at all but maybe bind is?
If that suspicion is true, can you suggest some sort of logging I might
enable to see if in fact bind is falling over?
Thanks!
More information about the bind-users
mailing list