bind 8 slow when resolving new domains!

Fri May 7 01:38:39 UTC 2004

On Thu, 06 May 2004 21:59:53 +0100, Simon Waters
<Simon at wretched.demon.co.uk> wrote:

>dap99 at i-55.com wrote:
>> I am having a big problem with slow internal DNS (named 8.3.7-REL on
>> FreeBSD 4.9).
>What no BIND 9?

We are currently using bind 8. No, no bind 9 at this time. :)

>> Also, we are using their two DNS servers as forwarders.
>
>"Red alert, captain".

Someone else mentioned that. I have since removed the forwarding
options in options {}.

>> The colo promises it's not them, but frankly I can't see how it's us.
>> 
>> # tcpdump -n host ns2 and \( icmp or udp \)
>> 10:07:37.832611 192.168.42.78.53 > isp-dns1.53:  4240+ [1au] A?
>> www.altavista.com. (46)
>> 10:07:51.013213 192.168.42.78.53 > isp-dns2.53:  4240+ [1au] A?
>> www.altavista.com. (46)
>> 10:07:51.074160 isp-dns2.53 > 192.168.42.78.53:  4240 2/9/10
>> CNAME[|domain] (DF)
>> 10:07:51.074476 192.168.42.78.53 > isp-dns1.53:  17509+ [1au] A?
>> avatw.search.yahoo2.akadns.net. (59)
>> 10:07:51.131568 isp-dns1.53 > 192.168.42.78.53:  17509 1/9/10 (393)
>> (DF)
>
...
>with queries from thousands of clients (or even one or two busy email
>servers), you may save a few tenths of a second per query by using them,
>but at the cost of slow responses if and when things go wrong (and more
>to go wrong).

Good to know. Thanks!

>
>>         forward only; // added while troubleshooting
>>         forward first; // added while troubleshooting
>
>One of these only.... forward-first, if allowed by the firewall, always
>seemed the smarter option to me.

Removed entirely anyway per your and other's recommendations.

>> ns2# nslookup www.looser.com
>
>Is "dig" broken on BSD ;)

No. Well, yes.

dig +trace doesn't show anything extra with FreeBSD. Man I could
really use +trace right now! I installed bind9 on a test system and
didn't see any change. (dig comes with bind9 as far as I know.)

>> Any ideas? Also, why so many FormErr (am I sending out bunk DNS
>> queries?). 
>
>EDNS0 is my first guess - although you can double check the tcpdump
>after reading the docs.

EDNS0 is "Extension Mechanisms for DNS"
(http://www.dns.pl/dnssec/rfc2671.txt). In the RFC I see this note:

5.3. Responders who do not understand these protocol extensions are
     expected to send a response with RCODE NOTIMPL, FORMERR, or
     SERVFAIL.  Therefore use of extensions should be "probed" such
that
     a responder who isn't known to support them be allowed a retry
with
     no extensions if it responds with such an RCODE.  If a
responder's
     capability level is cached by a requestor, a new probe should be
     sent periodically to test for changes to responder capability.

But I'm using a stock bind8 with a routine enough options {} section.
Why would I be sending out unsupported query types!?

>> I would be happy to show selected output from named -d 3.
>
>"{r}ndc querylog" is friendlier and easier to understand than "-d 3" or
>"tcpdump", even if it does eat disk space on busy servers.

That's not producing much debug output (unless I'm missing something).
Here is a query:

# dig @ns2 www.help.com

; <<>> DiG 8.3 <<>> @ns2 www.help.com
; (1 server found)
;; res options: init recurs defnam dnsrch
;; res_nsend: Operation timed out

And my log:

May  6 20:35:04 ns2 named[61979]: XX+/192.168.42.70/help.com/A/IN

And then another attempt a few seconds later:

# dig @ns2 www.help.com

; <<>> DiG 8.3 <<>> @ns2 www.help.com
; (1 server found)
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50738
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 2, ADDITIONAL: 2
;; QUERY SECTION:
;;      www.help.com, type = A, class = IN

;; ANSWER SECTION:
www.help.com.           4m53s IN CNAME
abv-sfo1-x-redirect-vip.cnet.com.
abv-sfo1-x-redirect-vip.cnet.com.  5M IN CNAME
abv-sfo1-x-redirect-rr.cnet.com.
abv-sfo1-x-redirect-rr.cnet.com.  5M IN A  206.16.0.29
abv-sfo1-x-redirect-rr.cnet.com.  5M IN A  206.16.0.28

;; AUTHORITY SECTION:
cnet.com.               1d23h59m53s IN NS  ns.cnet.com.
cnet.com.               1d23h59m53s IN NS  ns2.cnet.com.

;; ADDITIONAL SECTION:
ns.cnet.com.            1d23h59m1s IN A  216.239.126.10
ns2.cnet.com.           1d23h59m1s IN A  206.16.0.71

;; Total query time: 5067 msec
;; FROM: server-box to SERVER: 192.168.42.78
;; WHEN: Thu May  6 20:35:42 2004
;; MSG SIZE  sent: 30  rcvd: 221

And the log:

May  6 20:35:22 ns2 named[61979]: XX+/192.168.42.70/www.help.com/A/IN

I just don't see what's going on. Why is it timing out then working?
I'm thinking that it's taking a long time for my bind to get a
response, and my resolver times out first. The second try (after 5
seconds) must have been lucky as the domain was resolved just in time.

What I can't determine is *why* this is happening.