Recursion ceases for 5-10 minutes at random intervals throughout the day

Wed Feb 20 23:57:58 UTC 2008

At Fri, 15 Feb 2008 14:48:24 -0500,
Bill Springall <springall at fuse.net> wrote:
> Correct, the requests themselves were answered but just with, "Server 
> Failure", messages.   (always seemed to respond quickly)  When it has 
> happened to me, I was unable to get anything but the error message, 
> although the graphs indicate ~100qps getting success (perhaps cache?)
> 
> (Graph: http://home.fuse.net/springall/dns-3.png - 5 min poll)

Okay (the graph helps).

> The server itself has been relatively flat when it comes to memory 
> usage.  It sits at about 750M.   I can set up a process memory graph if 
> needed.

Yes, a memory graph would also help.

> The CPU does jump up to 25% CPU load from 10%, during the last spike I 
> checked.
> 
> Unfortunately, I haven't tried Bind without thread support.  We have had 
> good luck with threads in testing and prod (especially with 2xdual 
> Opterons), so I haven't tried it.

Okay, some additional questions:
- One common reason for SERVFAIL caused internally is memory
  allocation failure.  are you sure that named does not hit any
  (possibly implicit) limitation of memory usage?  For example, (at
  least some older versions of) FreeBSD has a relatively small upper
  limit of datasize.  When this occurs, you should normally see log
  messages like this:
  error: could not mark server as lame: out of memory
  (and you don't have to raise the log level to see them because these
  are generally categorized as a pretty high-level error).

- Memory related troubles of BIND9 caching server are often due to
  overhead of cache cleaning.  Can you identify whether cleaning is
  performed while you see the problem?  To see this, you may want to
  apply the patch attached to this message and add (something like)
  the following to the logging statement of named.conf:

        channel dblog {
                file "db.log" versions 5 size 10M;
                severity debug 1;
                print-severity yes;
                print-time yes;
        };
        category database { dblog; };

  Then you'll see something like this in the "db.log" file (under the
  appropriate directory):

  20-Feb-2008 15:54:46.145 info: begin cache cleaning, mem inuse 33347457
  20-Feb-2008 15:54:46.607 info: end cache cleaning, mem inuse 33380881

  The attached patch raises the required log level for the cleaning
  related log messages in order to keep the entire log output
  reasonably quiet.

  Frankly, however, I don't think the cleaning overhead is the main
  reason for the SERVFAIL since the overhead normally doesn't result
  in the error; it would rather cause query drop.

- It would also be helpful if you can periodically keep track of the
  number of recursive clients by executing 'rndc status', and
  summarize the result in a graph.  Failure of recursion due to
  recursive-clients quota would cause SERVFAIL errors, although I
  doubt this is the case for you as you seem to specify a pretty high
  value for this variable.

If none of the above provides any useful hint, I'd like to identify
detailed cause of SERVFAIL by applying a simple patch (if your
operational environment allows that).

Thanks,

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.