Resolver timeouts, EDNS and networking

Thu Sep 27 23:27:10 UTC 2007

Christian Robottom Reis wrote:
> Hello there,
>
>     For quite a while now, I've been having a number of users on our
> internal network having queries time out, and I've been investigating
> over the past week to try and find out why. The symptom is simply that
> at times:
>
>     % host bugs.launchpad.net
>     Nameserver not responding
>     bugs.launchpad.net A record not found, try again
>
> I enabled debugging and looked at my name server logs. What I could see
> was (this is simplified; I only left in messages at important timestamp
> boundaries, and the full log is at http://async.com.br/~kiko/bugs-dns1.log):
>
>     27-Sep-2007 16:22:41.870
>         database: debug 1: no_references: delete from rbt:
>         0xb4930e 40 bugs.launchpad.net
>
> So the entry got evicted from the cache. Then I see a query coming in a
> few seconds later:
>
>     27-Sep-2007 16:27:36.078
>         queries: info: client 192.168.99.4#43950:
>         query: bugs.launchpad.net IN A +
>     27-Sep-2007 16:27:36.078
>         resolver: debug 1: createfetch: bugs.launchpad.net A
>     27-Sep-2007 16:27:36.078
>         resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): create
>
> Then, 5 seconds later, I see this:
>
>     27-Sep-2007 16:27:41.090 
>         resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'): timeout
>
> After about 3 iterations, we finally get this:
>
>     27-Sep-2007 16:27:56.099
>          resolver: debug 3: fctx 0xb49971e0(bugs.launchpad.net/A'):
>          too many timeouts, disabling EDNS0
>
> I think at this point the client got an error.
>
> The funny thing is that bugs.launchpad.net /does/ resolve fine using
> other name servers, including my forwarders directly (though there is of
> course a timing difference). And if I insist, it eventually resolves
> fine on my server too:
>
>     http://async.com.br/~kiko/bugs-dns3.log
>
> Of course, once it's cached, it's fine and everyone enjoys the entry:
>
>     http://async.com.br/~kiko/bugs-dns3.log
>
> I was originally using 3 ISP-provided forwarders since they are intended
> to improve performance; however I decided to try disabling them to see
> if it improved things. And, if I do a profile of the timeout counts,
> indeed it does reduce the timeouts, though I do still get a few server
> failures on a few sites, which suggests this isn't the entire problem --
> but maybe the sites are just broken or taking too long to answer.
>
> Has anyone seen this before? Is the EDNS0 issue a red herring, or is
> what I'm seeing indicative of EDNS being broken at a few sites,
> including my forwarders? I can issue manual EDNS queries (using dig
> +bufsize=500) just fine, so I would think not.
>   
Hmmm... bufsize of 500 is rather silly, since that's _below_ the default 
buffer size (512). I'd set it to something higher. In fact, I'd probably 
do a packet trace of the forwarded queries and then try to replicate 
them *exactly* with "dig", including EDNS0 buffer size, source address, 
even source port. In the unlikely event that you're TSIG-signing your 
queries, I'd mimic that behavior as well. Assuming that you're still 
getting timeouts on precisely-mimic'ed queries, then I'd start changing 
things to see what makes it work better. A DNS query packet has only a 
finite number of attributes -- it should be possible to home in on the 
attribute or combination of attributes that is giving rise to the problem.

Note that the 10-minute TTL on bugs.launchpad.net is going to incur a 
fairly high fetch rate, and if there is some sort of connectivity 
problem between your ISP's nameservers and the launchpad.net 
nameservers, you could very well get timeouts. Is it possible that all 
the times you looked up via dig, the answer was, coincidentally, cached 
by your ISP's nameservers? You can try doing a non-recursive query 
before the recursive one, to see whether the answer is already cached or 
not.
> I'm behind a Linux firewall, but I have complete control over it, and I
> doubt there is any UDP issue there (can I test somehow?)
>
> If this just means that the network is acting flaky (high-latency or
> otherwise) and it's going to happen, why does it improve when I disable
> forwarders, and is there a way to mitigate the problem, perhaps using
> caching, a different protocol, traffic shaping or an act of god?
>   
If using forwarders is causing you problems, then the glib response is 
to stop using forwarders. Problem fixed, right?

I'd be curious what's causing the problem, though, so I'd want to get to 
the bottom of it. If there's something funky (firewall, intrusion 
detection, transparent proxy, traffic shaper or whatever) going on 
between you and your ISP that's affecting DNS in this way, conceivably 
it might affect other protocols as well.

Another, somewhat non-scalable, high-maintenance "middle ground" option 
would be to keep your forwarding configuration, but define 
"launchpad.net" as a "type stub" zone. The high-maintenance part comes 
in because you'd need to update the "masters" clause of that zone 
definition, whenever they migrate to a new server or set of servers. Not 
a responsibility I'd want to take on...

- Kevin