Resolver timeouts, EDNS and networking

Thu Sep 27 21:55:42 UTC 2007

Hello there,

    For quite a while now, I've been having a number of users on our
internal network having queries time out, and I've been investigating
over the past week to try and find out why. The symptom is simply that
at times:

    % host bugs.launchpad.net
    Nameserver not responding
    bugs.launchpad.net A record not found, try again

I enabled debugging and looked at my name server logs. What I could see
was (this is simplified; I only left in messages at important timestamp
boundaries, and the full log is at http://async.com.br/~kiko/bugs-dns1.log):

    27-Sep-2007 16:22:41.870
        database: debug 1: no_references: delete from rbt:
        0xb4930e 40 bugs.launchpad.net

So the entry got evicted from the cache. Then I see a query coming in a
few seconds later:

    27-Sep-2007 16:27:36.078
        queries: info: client 192.168.99.4#43950:
        query: bugs.launchpad.net IN A +
    27-Sep-2007 16:27:36.078
        resolver: debug 1: createfetch: bugs.launchpad.net A
    27-Sep-2007 16:27:36.078
        resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): create

Then, 5 seconds later, I see this:

    27-Sep-2007 16:27:41.090 
        resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): timeout

After about 3 iterations, we finally get this:

    27-Sep-2007 16:27:56.099
         resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'):
         too many timeouts, disabling EDNS0

I think at this point the client got an error.

The funny thing is that bugs.launchpad.net /does/ resolve fine using
other name servers, including my forwarders directly (though there is of
course a timing difference). And if I insist, it eventually resolves
fine on my server too:

    http://async.com.br/~kiko/bugs-dns3.log

Of course, once it's cached, it's fine and everyone enjoys the entry:

    http://async.com.br/~kiko/bugs-dns3.log

I was originally using 3 ISP-provided forwarders since they are intended
to improve performance; however I decided to try disabling them to see
if it improved things. And, if I do a profile of the timeout counts,
indeed it does reduce the timeouts, though I do still get a few server
failures on a few sites, which suggests this isn't the entire problem --
but maybe the sites are just broken or taking too long to answer.

Has anyone seen this before? Is the EDNS0 issue a red herring, or is
what I'm seeing indicative of EDNS being broken at a few sites,
including my forwarders? I can issue manual EDNS queries (using dig
+bufsize=500) just fine, so I would think not.

I'm behind a Linux firewall, but I have complete control over it, and I
doubt there is any UDP issue there (can I test somehow?)

If this just means that the network is acting flaky (high-latency or
otherwise) and it's going to happen, why does it improve when I disable
forwarders, and is there a way to mitigate the problem, perhaps using
caching, a different protocol, traffic shaping or an act of god?

Thanks!
-- 
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125