CPU/Load issues (FreeBSD, BIND 9.4.1)

Meron Brandeis mbrandeis at 013barak.net.il
Wed Jul 4 15:33:09 UTC 2007


Hello there,

I've run into the situation you described over a dozen of times.  Before
I go any further please note that I'm not a professional bind user nor
am I qualified to provide you with any explanation. I'm writing this
based solely on my personal experience and research.

The situation is repetitive. Bind loads, bind responds, bind is under a
lot of recursive queries. 

After a few hours (up to a few days) bind reaches 99% cpu usage, consume
a certain amount of memory (NOTE: I have _not_ seen bind consuming all
available memory and starting to eat up swap) and starts crawling,
until, eventually, it stops responding.

I've seen this happen on the following platforms: Linux(RedHAt) and
Solaris(9).

It will not terminate upon receiving any signal other than SIGKILL (kill
-9). 

After being started again, things are back to usual. 

I've been looking up information about this problem for quite a while
now, in several places, including this list.

I've seen lots of lots of complaints about this behavior when searching
google, or even about a year ago in this very list.

Never have I come across any response that actually worked.  Some people
have been able to gain some more time (i.e. making this problem reoccur
after a longer period of time) by limiting bind to a _very_ small cache
size. I was told in such cases bind will still hit the 99% cpu usage but
will not stop responding to queries for some time. 

Other guesses I've heard were about limiting the number of concurrent
recursive clients, buying more memory, playing with the cache size, and
also the FreeBSD issue of memlimit (512) for each process by default. To
my best of knowledge, none of them really solved the problem. (this is
my opinion only, since I had available memory by the gigs that bind
didn't use, I tried limiting the concurrent recursive clients, and
played with the cache size).

This _only_ happened to me with Bind 9.X . 

I've traced the system calls done by the daemon  (strace in linux, truss
in solaris), and tried to analyze what I saw there.
Interestingly enough, when the daemon hangs and hit the 99% I noticed a
_huge_ amount of time() calls. Just millions of them calls and not much
else. I did _not_ spot that behavior on the same daemon when it was
still operating properly.

Now here's my best guess. I have no idea how close (if at all) it is to
what happens in reality, that's just the best guess I've come up with:

Bind starts. Recursion queries come in. bind fetches the response, and
caches it. Slowly, the cache builds up. Each cached entry has a ttl. So,
when bind comes across a cached entry it needs to determine if it needs
to be discarded. Hence, the time() call. My guess is that after the
cache size reaches a certain size, the daemon is overloaded by these
time() calls and cant do anything else.
This is a dubious explanation, and I don't like it either. Its just the
best guess I've had so far. 

Lacking other options, I rolled back to bind 8.4.7 which works perfectly
fine (same machines, same resources).

I hope this helps. If you come across any better explanation (or a
solution) I'll be more than happy to hear it.

Regards,
Meron



-----Original Message-----
From: bind-users-bounce at isc.org [mailto:bind-users-bounce at isc.org] On
Behalf Of Gushi
Sent: Monday, July 02, 2007 10:29 PM
To: comp-protocols-dns-bind at isc.org
Subject: CPU/Load issues (FreeBSD, BIND 9.4.1)

This is now the third posting I've made, and it's getting frustrating.

Assume the following:

I have a server which answers authoritatively for about 20 zones, and is
caching DNS for about 1000 recursive clients.  This server process
occasionally eats 90 percent of the CPU and stops responding to {rndc,
kill -HUP, kill}.  Both our DNS servers do this occasionally, but one
FAR more than the other.

What logging level would be useful in finding out what, exactly, the
server is DOING to cause it to do that.  This level activity can take
days or weeks to build up.  Sometimes the activity level drops down to
normal, sometimes it sticks at 95+ percent cpu.  I would LOVE to be able
to give more information on this, without having to muck my way through
a debugger.  How do I go about it?

If the answer is "Ask my OS mailing list", please tell me.  If it's "buy
a BIND support contract", please tell me.  If it's "screw it, the latest
FreeBSD, the latest BIND, and 2g of ram on a Dell 600SC is inadequate to
run solely a DNS server and sshd", tell me.

Thanks for any and all help,

Dan Mahoney
Frustrated in Long Island





**********************************************************************
The information contained in this e-mail message may be
privileged and confidential. The information is intended only 
for the use of the individual or entity named above. If the 
reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying
of this communication is strictly prohibited. If you have
received this communication in error, please notify us 
immediately by telephone, or by e-mail and delete the message
from your computer. Thank you!
Unless otherwise stated, any views or opinions expressed in
this e-mail are solely those of the author and do not represent those of 
Netvision 013 Barak LTD.
**********************************************************************



More information about the bind-users mailing list