Recursion ceases for 5-10 minutes at random intervals throughout the day
Bill Springall
springall at fuse.net
Sat Feb 23 18:56:09 UTC 2008
> Okay (the graph helps).
>
>> The server itself has been relatively flat when it comes to memory
>> usage. It sits at about 750M. I can set up a process memory graph if
>> needed.
>
> Yes, a memory graph would also help.
> Okay, some additional questions:
> - One common reason for SERVFAIL caused internally is memory
> allocation failure. are you sure that named does not hit any
> (possibly implicit) limitation of memory usage? For example, (at
> least some older versions of) FreeBSD has a relatively small upper
> limit of datasize. When this occurs, you should normally see log
> messages like this:
> error: could not mark server as lame: out of memory
> (and you don't have to raise the log level to see them because these
> are generally categorized as a pretty high-level error).
I checked out our logging server and I haven't seen any references to,
"memory" on any of the machines.
I have added a graph per machine to monitor the memory usage of bind
over time. It has almost a day of soak time. I have put it up with a
few other graphs at:
http://home.fuse.net/springall/bind-022108-022208.html
> - Memory related troubles of BIND9 caching server are often due to
> overhead of cache cleaning. Can you identify whether cleaning is
> performed while you see the problem? To see this, you may want to
> apply the patch attached to this message and add (something like)
> the following to the logging statement of named.conf:
> <snip>
It could be, and I wonder if it is choking on something it is cleaning.
I'm not sure how to determine when they are cleaning their caches.
The server I posted has a default cache cleaning time but a max
cache-ttl and ncache-ttl set to 60 seconds. 3 have this configuration,
1 has cleaning-interval of 30min and no (n)ttl settings, and one has no
cache limiting/cleaning settings (defaults).
The problem seems to move around from one group of
primary/secondary servers to another - with different frequency - very
strange. After staring at these graphs for weeks, something makes me
believe it is a specific record or packet, or non-standard upstream
response/query, that is making it hiccup.
I do want to apply the patch you sent. I will work it into an
upcoming night maintenance on these servers to see what we can find.
> - It would also be helpful if you can periodically keep track of the
> number of recursive clients by executing 'rndc status', and
> summarize the result in a graph. Failure of recursion due to
I have added a client connection graph for all hosts (recur and tcp
clients) and have added it to the web page above. So far they are
hovering between anywhere from 130 to ~500 across all 6.
> If none of the above provides any useful hint, I'd like to identify
> detailed cause of SERVFAIL by applying a simple patch (if your
> operational environment allows that).
That would be great. Let me know of the graphs provide any idea and, if
not, I would be more than willing to introduce this patch in to find the
exact cause. In the mean time, I will work on getting the patch you
sent into a running machine during an upcoming maintenance window.
Thanks for you help!
- Bill
--
Bill Springall
Systems Engineer/UNIX Administrator
Email: springall at fuse.net
JINMEI Tatuya / ???? wrote:
> At Fri, 15 Feb 2008 14:48:24 -0500,
> Bill Springall <springall at fuse.net> wrote:
>
>> Correct, the requests themselves were answered but just with, "Server
>> Failure", messages. (always seemed to respond quickly) When it has
>> happened to me, I was unable to get anything but the error message,
>> although the graphs indicate ~100qps getting success (perhaps cache?)
>>
>> (Graph: http://home.fuse.net/springall/dns-3.png - 5 min poll)
>
> Okay (the graph helps).
>
>> The server itself has been relatively flat when it comes to memory
>> usage. It sits at about 750M. I can set up a process memory graph if
>> needed.
>
> Yes, a memory graph would also help.
>
>> The CPU does jump up to 25% CPU load from 10%, during the last spike I
>> checked.
>>
>> Unfortunately, I haven't tried Bind without thread support. We have had
>> good luck with threads in testing and prod (especially with 2xdual
>> Opterons), so I haven't tried it.
>
> Okay, some additional questions:
> - One common reason for SERVFAIL caused internally is memory
> allocation failure. are you sure that named does not hit any
> (possibly implicit) limitation of memory usage? For example, (at
> least some older versions of) FreeBSD has a relatively small upper
> limit of datasize. When this occurs, you should normally see log
> messages like this:
> error: could not mark server as lame: out of memory
> (and you don't have to raise the log level to see them because these
> are generally categorized as a pretty high-level error).
>
> - Memory related troubles of BIND9 caching server are often due to
> overhead of cache cleaning. Can you identify whether cleaning is
> performed while you see the problem? To see this, you may want to
> apply the patch attached to this message and add (something like)
> the following to the logging statement of named.conf:
>
> channel dblog {
> file "db.log" versions 5 size 10M;
> severity debug 1;
> print-severity yes;
> print-time yes;
> };
> category database { dblog; };
>
> Then you'll see something like this in the "db.log" file (under the
> appropriate directory):
>
> 20-Feb-2008 15:54:46.145 info: begin cache cleaning, mem inuse 33347457
> 20-Feb-2008 15:54:46.607 info: end cache cleaning, mem inuse 33380881
>
> The attached patch raises the required log level for the cleaning
> related log messages in order to keep the entire log output
> reasonably quiet.
>
> Frankly, however, I don't think the cleaning overhead is the main
> reason for the SERVFAIL since the overhead normally doesn't result
> in the error; it would rather cause query drop.
>
> - It would also be helpful if you can periodically keep track of the
> number of recursive clients by executing 'rndc status', and
> summarize the result in a graph. Failure of recursion due to
> recursive-clients quota would cause SERVFAIL errors, although I
> doubt this is the case for you as you seem to specify a pretty high
> value for this variable.
>
> If none of the above provides any useful hint, I'd like to identify
> detailed cause of SERVFAIL by applying a simple patch (if your
> operational environment allows that).
>
> Thanks,
>
> ---
> JINMEI, Tatuya
> Internet Systems Consortium, Inc.
>
>
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.516 / Virus Database: 269.20.9/1290 - Release Date: 2/20/2008 8:45 PM
More information about the bind-users
mailing list