Recursion ceases for 5-10 minutes at random intervals throughout the day

Wed Feb 27 23:17:32 UTC 2008

At Sat, 23 Feb 2008 13:56:09 -0500,
Bill Springall <springall at fuse.net> wrote:
>  > Yes, a memory graph would also help.
>  > Okay, some additional questions:
>  > - One common reason for SERVFAIL caused internally is memory
>  >   allocation failure.  are you sure that named does not hit any
>  >   (possibly implicit) limitation of memory usage?  For example, (at
>  >   least some older versions of) FreeBSD has a relatively small upper
>  >   limit of datasize.  When this occurs, you should normally see log
>  >   messages like this:
>  >   error: could not mark server as lame: out of memory
>  >   (and you don't have to raise the log level to see them because these
>  >   are generally categorized as a pretty high-level error).
> 
> I checked out our logging server and I haven't seen any references to,
> "memory" on any of the machines.
>      I have added a graph per machine to monitor the memory usage of bind
> over time.  It has almost a day of soak time.  I have put it up with a 
> few other graphs at:
> 
> http://home.fuse.net/springall/bind-022108-022208.html

According to the graph, memory footprint is pretty stable, so if the
problem happened during this period (did it?), it's probably not a
memory related issue.

>  > - It would also be helpful if you can periodically keep track of the
>  >   number of recursive clients by executing 'rndc status', and
>  >   summarize the result in a graph.  Failure of recursion due to
> 
> I have added a client connection graph for all hosts (recur and tcp
> clients) and have added it to the web page above.  So far they are 
> hovering between anywhere from 130 to ~500 across all 6.

Hmm, this doesn't look abnormal either.

>  > If none of the above provides any useful hint, I'd like to identify
>  > detailed cause of SERVFAIL by applying a simple patch (if your
>  > operational environment allows that).
> 
> That would be great.  Let me know of the graphs provide any idea and, if 
> not, I would be more than willing to introduce this patch in to find the 
> exact cause.   In the mean time, I will work on getting the patch you 
> sent into a running machine during an upcoming maintenance window.

At the moment, I still cannot think of a specific possible reason.
Please try the attached patch, which will detail the server failure
cases.  named.stats will look like this:

+++ Statistics Dump +++ (1204154136)
success 189635
referral 0
nxrrset 48338
nxdomain 109146
recursion 84185
failure 0
duplicate 0
dropped 520
failure1 0
failure2 0
failure3 0
[...]
failure16 0
failure17 0
failure18 0
failure19 5698
failure20 0
[...]
failure32 0
--- Statistics Dump --- (1204154136)

and a new graph containing all the failureXX may hopefully identify
the central cause of the trouble.

Thanks,

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.