Resolver timeouts, EDNS and networking

Sat Sep 29 09:04:00 UTC 2007

Here are few  more things to check:
1. Are you running BIND 9.2.x on the forwarders? If it is BIND 9.2,
then forwarders
are not intelligent - if the first forwarder is unresponsive (or slow
for some reason),
then all subsequent queries blindly goes to the first forwarders
before attempting the
next ones.
2. Your resolver contents - too many zones in search ?
3. also use nslookup -d2 -swtrace as a command line to query.

My 2 cents
Blr

On Sep 28, 9:41 am, Christian Robottom Reis <k... at async.com.br> wrote:
> On Thu, Sep 27, 2007 at 07:27:10PM -0400, Kevin Darcy wrote:
> > > Has anyone seen this before? Is the EDNS0 issue a red herring, or is
> > > what I'm seeing indicative of EDNS being broken at a few sites,
> > > including my forwarders? I can issue manual EDNS queries (using dig
> > > +bufsize=500) just fine, so I would think not.
>
> > Hmmm... bufsize of 500 is rather silly, since that's _below_ the default
> > buffer size (512). I'd set it to something higher. In fact, I'd probably
> > do a packet trace of the forwarded queries and then try to replicate
> > them *exactly* with "dig", including EDNS0 buffer size, source address,
> > even source port. In the unlikely event that you're TSIG-signing your
> > queries, I'd mimic that behavior as well. Assuming that you're still
> > getting timeouts on precisely-mimic'ed queries, then I'd start changing
> > things to see what makes it work better. A DNS query packet has only a
> > finite number of attributes -- it should be possible to home in on the
> > attribute or combination of attributes that is giving rise to the problem.
>
> So after a day and night of this, ISTM that the resolvers appear to be
> red herrings. I disabled the resolvers last night but given it was off
> office peak hours I saw the timeouts lessened, and today, as soon as the
> office is in buzz I am seeing timeouts peak to 87 in a single minute
> (just counting the "too many timeouts" string in the debug log).
>
> A few are for servers I would not expect to time out:
>
>       6 0x8288158(api.del.icio.us/A'):
>       4 0xb489e018(ns-2.nipcable.com.br/A'):
>       3 0xb452f9d0(f3.yahoofs.com/A'):
>       3 0x8296f68(unisys.com/A'):
>       2 0xb489b080(zd.akadns.org/A'):
>       2 0xb455c250(row.bc.yahoo.com/A'):
>
> Am I right in assuming that when the server logs a "too many timeouts"
> it's likely that the client resolver library will have given up and
> reported an error upstream?
>
> The fact that the problems are really intermittent and that I am unable
> to reproduce any EDNS-related failures (just following the hint I picked
> up athttp://lists.debian.org/debian-user/2005/10/msg03334.html)
> suggests to me that either the network latency rises too high (it's
> around 40ms to my upstream hop, and I can see some packet loss, though
> not more than 5%) or the server is overloaded doing reverse-DNS
> queries for apache and DNSBL-related queries for sendmail.
>
> > Note that the 10-minute TTL on bugs.launchpad.net is going to incur a
> > fairly high fetch rate, and if there is some sort of connectivity
> > problem between your ISP's nameservers and the launchpad.net
> > nameservers, you could very well get timeouts. Is it possible that all
>
> Yeah, I raised that with the sysadmins and have requested they increase
> that, but I am still left with my general problem.
>
> > Another, somewhat non-scalable, high-maintenance "middle ground" option
> > would be to keep your forwarding configuration, but define
> > "launchpad.net" as a "type stub" zone. The high-maintenance part comes
>
> The problem is that I'm not really restricted to launchpad.net -- we get
> timeouts for assorted queries. I just picked launchpad.net because I
> care about it, and because it was easy to find in the logs!
> --
> Christian Robottom Reis |http://async.com.br/~kiko/| [+55 16] 3376 0125