bind-9.7.2-P3 linux how to debug/troubleshoot query failures?

Thu Feb 3 20:44:07 UTC 2011

Hey all,

Well I'm reaching out as I'm at a loss. I have a distributed DNS
architecture with 2 bind-9.7.2-P3 servers behind an F5 Loadbalancer. I
then have another 2 behind another F5 at another location.

My app servers are configured with their resolv.conf looking like:
(please ignore the domain and networks, they have been altered)

search gc.domain.net
domain gc.domain.net
nameserver 1.1.1.15
nameserver 1.1.2.56
options timeout:1

What I'm finding out is that there are a ton of requests being made to
the 1.1.2.56 address. In reality the servers at 1.1.1.15 (again behind
the F5) are healthy, no retransmissions, no excessive load nothing
that tells me they are having issues. Yet my servers seem to fail to
connect to them and must failover to the secondary DNS servers (again
I don't understand why, nor can I figure out why).

If I run a script that does a dig I can't seem to get it to failover
to the secondary DNS, but something in code or other that uses
gethostbyname or the host command seem to cause a lookup fail and thus
it fails over to the secondary nodes, across the internet in fact.

Is there a documented method to troubleshoot, debug why a system
believes that they were unable to get an acceptable results from the
primary DNS server?

Doesn't appear to be any health related issues, so I'm at a loss. I
feel the DNS infrastructure is healthy but at this point
I need some assistance proving that it's not and therefore fixing it!

I've added the 1 Second timeout since I was seeing 5 second delays in
our application and again this was due to it waiting for the primary
server to respond before it could failover, now after a second it just
goes to the secondary dns and seems to be happy (most of the time, I'm
getting some hard failures that I'm trying to troubleshoot as well).

Thanks
Tory