Recursion ceases for 5-10 minutes at random intervals throughout the day
JINMEI Tatuya / 神明達哉
Jinmei_Tatuya at isc.org
Wed Feb 27 23:17:32 UTC 2008
At Sat, 23 Feb 2008 13:56:09 -0500,
Bill Springall <springall at fuse.net> wrote:
> > Yes, a memory graph would also help.
> > Okay, some additional questions:
> > - One common reason for SERVFAIL caused internally is memory
> > allocation failure. are you sure that named does not hit any
> > (possibly implicit) limitation of memory usage? For example, (at
> > least some older versions of) FreeBSD has a relatively small upper
> > limit of datasize. When this occurs, you should normally see log
> > messages like this:
> > error: could not mark server as lame: out of memory
> > (and you don't have to raise the log level to see them because these
> > are generally categorized as a pretty high-level error).
>
> I checked out our logging server and I haven't seen any references to,
> "memory" on any of the machines.
> I have added a graph per machine to monitor the memory usage of bind
> over time. It has almost a day of soak time. I have put it up with a
> few other graphs at:
>
> http://home.fuse.net/springall/bind-022108-022208.html
According to the graph, memory footprint is pretty stable, so if the
problem happened during this period (did it?), it's probably not a
memory related issue.
> > - It would also be helpful if you can periodically keep track of the
> > number of recursive clients by executing 'rndc status', and
> > summarize the result in a graph. Failure of recursion due to
>
> I have added a client connection graph for all hosts (recur and tcp
> clients) and have added it to the web page above. So far they are
> hovering between anywhere from 130 to ~500 across all 6.
Hmm, this doesn't look abnormal either.
> > If none of the above provides any useful hint, I'd like to identify
> > detailed cause of SERVFAIL by applying a simple patch (if your
> > operational environment allows that).
>
> That would be great. Let me know of the graphs provide any idea and, if
> not, I would be more than willing to introduce this patch in to find the
> exact cause. In the mean time, I will work on getting the patch you
> sent into a running machine during an upcoming maintenance window.
At the moment, I still cannot think of a specific possible reason.
Please try the attached patch, which will detail the server failure
cases. named.stats will look like this:
+++ Statistics Dump +++ (1204154136)
success 189635
referral 0
nxrrset 48338
nxdomain 109146
recursion 84185
failure 0
duplicate 0
dropped 520
failure1 0
failure2 0
failure3 0
[...]
failure16 0
failure17 0
failure18 0
failure19 5698
failure20 0
[...]
failure32 0
--- Statistics Dump --- (1204154136)
and a new graph containing all the failureXX may hopefully identify
the central cause of the trouble.
Thanks,
---
JINMEI, Tatuya
Internet Systems Consortium, Inc.
More information about the bind-users
mailing list