Problem with BIND 9.10.1-P1 recursion limits

Evan Hunt each at isc.org
Tue Dec 9 19:41:29 UTC 2014


On Tue, Dec 09, 2014 at 05:51:58PM +0000, Evan Hunt wrote:
> That's unexpected. I'll see if I can reproduce it.

Okay, I can.

Part of the problem is the somewhat crazypants DNS configuration
of www.ibm.com:

  $ dig +noall +answer www.ibm.com
  www.ibm.com.            3600    IN      CNAME   www.ibm.com.cs186.net.
  www.ibm.com.cs186.net.  60      IN      CNAME   china-cdn.san.ibm.com.edgekey.net.
  china-cdn.san.ibm.com.edgekey.net. 21600 IN CNAME china-cdn.san.ibm.com.edgekey.net.globalredir.akadns.net.
  china-cdn.san.ibm.com.edgekey.net.globalredir.akadns.net. 900 IN CNAME e7826.x.akamaiedge.net.
  e7826.x.akamaiedge.net. 20      IN      A       23.59.201.136

... like, *wow*.  A chain of five aliases with TTLs ranging from 20
seconds to 6 hours, passing through five different zones (ibm.com,
cs186.net, edgekey.net, akadns.net, akamaiedge.net), hosted by
servers in three *more* zones (ihost.com, akam.net, and akadns.org,
in addition to akadns.net and akamaiedge.net).  I had to almost
double the maximum recursion queries to 99 to get this to work on
an empty cache.  Yikes.

Almost any non-empty cache will dodge the bullet. Preceeding the
lookup of www.ibm.com with "dig @::1 ns com" causes the query to
succeed.  Also, as previously noted, on 9.9 it will succeed without
a five-minute delay if you just issue the query a second time.

So, possible workarounds if this issue is causing problems for you:

  - Ensure that the first query sent to a newly-primed recursive
    resolver isn't quite as spectacular as this one;
  - Add "max-recursion-queries 100;" to your options statement;
  - Run 9.9.6-P1 instead of 9.10.1-P1

The five-minute delay is still a bit of a puzzle. It happens because
of this code in adb.c:

        /* XXXMLG Don't pound on bad servers. */
        if (address_type == DNS_ADBFIND_INET) {
                name->expire_v4 = ISC_MIN(name->expire_v4, now + 300);
                name->fetch_err = FIND_ERR_FAILURE;
                inc_stats(adb, dns_resstatscounter_gluefetchv4fail);
        } else {
                name->expire_v6 = ISC_MIN(name->expire_v6, now + 300);
                name->fetch6_err = FIND_ERR_FAILURE;
                inc_stats(adb, dns_resstatscounter_gluefetchv6fail);
        }

The "now + 300" bit is where the five minutes comes from.  That's code
that's been around for years, and it is in 9.9, but apparently it's
reached more easily in 9.10.  I'm looking into the reasons for this.

The problem should be addressed in 9.10.2, which is likely to be
released next month.

-- 
Evan Hunt -- each at isc.org
Internet Systems Consortium, Inc.


More information about the bind-users mailing list