Queries fail first time round

Mon Jul 28 14:40:56 UTC 2003

Simon Waters wrote:

>  > I have a longstanding issue with our BIND setup. If a domain has not
>>  been queried for some time, a query for a host in it often fails the
>>  first time, but then repeating the request results in a successful
>>  lookup. The lookup failures happen with a wide range of domains, so
>>  I'm pretty certain it's going to be something internal.
>
>Do give an example domain, it might help, as there are a lot of badly
>configured domains.

OK, one I had this morning was www.roxio.de - first attempt, no 
address found, hit reload and it loads fine. Then 
softwareupdates.roxio.com, same thing.

>I'd suggest trying to reproduce with "dig" at the server to eliminate
>client code.

I've just tried softwareupdates.roxio.com and dig came back after a 
few seconds with "connection timed out; no servers could be reached", 
retry immediately and it gave me the answer. This was on the Red Hat 
box running Bind 9.

I restarted BIND, then tried "dig +time=60 softwareupdates.roxio.com" 
after doing a "rndc querylog", I see one query for the address, and 
dig returns a result some 55-60 seconds later.

I turned on some logging (see below), and this is the only relevant 
output I found (obviously there was a lot more related to other 
queries going through this server) :

queries: info: client 127.0.0.1#4557: query: softwareupdates.roxio.com IN A
resolver: debug 1: createfetch: softwareupdates.roxio.com A

database: debug 1: no_references: delete from rbt: 0x8139608 
ns1.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns2.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns3.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns4.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns5.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns6.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns7.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns8.fp.sandpiper.net
database: debug 1: no_references: delete from rbt: 0x8139608 
ns9.fp.sandpiper.net

I also tried with www.apple.com, a reply came back in 47s and this 
was all that I saw logged :

queries: info: client 127.0.0.1#4557: query: www.apple.com IN A
resolver: debug 1: createfetch: www.apple.com A
resolver: debug 1: createfetch: www.apple.com.akadns.net A

>  > Does anyone have any clues how to track this one down ?
>
>Apple use to some odd resolver stuff, do you know precisely what the
>client is doing?

Usually it's web browsing. I've been using Netscape 7, and now Safari 
1, my OS is Mac OS X 10.2.6. - but I did see the problem before mving 
to OS X. Thinking about it, I tend to use a command line FTP client 
and don't recall seeing the problem with that.

But since I seem to be able to reproduce the problem from the server, 
it suggests that it is not a client issue.

>Also you can get the situation where the client times out before the
>server, if it takes a long time to resolve a query. This is often packet
>loss or badly configured servers causing undue delay in resolving, but
>leads to the circumstance you describe.
>
>I have at least one web browser that routinely gives up after a few
>seconds for any DNS request, most of the time reload works fine as the
>server has finished completing the request in the mean time.

That is EXACTLY the effect I see

>Similarly an overloaded, BIND server might also cause the client to time
>out.
>
>And yes upgrade from BIND 8.2, why not BIND 9 as the ISC recommend?

Well there's an element of 'well that is what comes pre-built from 
SCO', but yes I will look at upgrading. But since I'm able to 
re-product the issue on the box running Bind 9.2.2 I don't think that 
is likely to be my problem.

>Before packet tracing I usually switch on BIND's own debugging
>facilities, which are easier to read than some packet tracers output

With help from DNS & Bind, I've added this to my config :

logging {
   channel gen_log {
     file "general.log" ;
     severity dynamic ;
     print-category yes ;
     print-severity yes ;
   } ;

   category general {gen_log; } ;
   category client {gen_log; } ;
   category database {gen_log; } ;
   category lame-servers {gen_log; } ;
   category network {gen_log; } ;
   category queries {gen_log; } ;
   category resolver {gen_log; } ;
} ;

None of the other categories sound like they should be involved in 
any way with these lookups. Debug level is 2 according to rndc status.

I do know that at the moment, our internet connection is reasonably 
busy. In fact, ping reports an average round trip time of 7 1/2 
seconds to www.apple.com and over 6 second to www.demon.co.uk (our 
ISP) - that's worse than I usually see :-(

Simon

-- 

NOTE: This is a throw-away email address which will reach me for as 
long as it stays spam-free, remove date for real address.

Simon Hobson, Technology Specialist
Colony Gift Corporation Limited
Lindal in Furness, Ulverston, Cumbria, LA12 0LD
Tel 01229 461100, Fax 01229 461101

Registered in England No. 1499611
Regd. Office : 100 New Bridge Street, London, EC4V 6JA.