Bind9 Crazy-high CPU on Linux

Matthew Schlosser mschlosser at eschelon.com
Mon Jan 22 19:00:03 UTC 2007


Thanks for the information.

"From the above figures, it looks like your client base is
making lots of queries which fail to resolve."

After some research it appears one or two customers account for this
traffic, querying broken zones over and over.

This name server acts as a resolver for a cluster of mail servers as well as
customer equipment and I'm certain some of them have mail servers as well.
RBL checks are passed to the BIND server which forwards them to an internal
RBL service.

Along these lines, where do "forwarders" fall in the stats list?  Are they
recursive?

The service has been running non-stop since the last time I sent out a
status.  Here it is now:

success 94713973
referral 8285349
nxrrset 20447734
nxdomain 56751210
recursion 122660495
failure 54669574
duplicate 3398529
dropped 438693

> -----Original Message-----
> From: bind-users-bounce at isc.org 
> [mailto:bind-users-bounce at isc.org] On Behalf Of Mark Andrews
> Sent: Thursday, January 18, 2007 7:58 PM
> To: Matthew Schlosser
> Cc: 'Stefan Puiu'; bind-users at isc.org
> Subject: Re: Bind9 Crazy-high CPU on Linux 
> 
> 
> > Thank you for all the references and help.
> > 
> > I upgraded to 9.4rc1 with the following results:
> > 
> > Massive jump in memory usage (about double).  The named 
> process now shows a
> > memory footprint of close to 900MB where before 500MB would kill it.
> > 
> > CPU stays between 20-25% and spikes to 30-35% during a 
> cleaning interval
> > which lasts only a minute or two.
> > 
> > Previously with 9.3.2, rndc status showed upwards of 2,000 
> or more recursive
> > clients.  Now it shows only less than 500 at any given time 
> and the output
> > format has changed:
> > 
> > recursive clients: 463/3900/4000
> > 
> > The server has been up a little over 36 hours.
> > 
> > I also noted three new items in the named.stats file.  
> "Duplicate" and
> > "dropped" are new values.  Does anyone know how to fit them 
> in to the
> > greater scheme?  For example recursion can be subtracted 
> from a combined
> > total of success, referral, nxrrset, nxdomain and failure 
> to generate a
> > percentage.  Where do the new values fit?
> > 
> > success 30428464
> > referral 2099872
> > nxrrset 6270659
> > nxdomain 16121686
> > recursion 29924813
> > failure 8892309
> > duplicate 1101621
> 
> 	Duplicate queries are ones where a indentical query was
> 	recieved (source address and port, qname, qtype, qclass)
> 	while a existing query was being resolved.
> 
> > dropped 159814
> 
> 	Excessive recursive queries for <qname, qtype, qclass> other
> 	than duplicate queries.  Excessive is self adjusting within
> 	10:100 queries.  Successful resolution of a query for which
> 	there were drops raises the number of simultanious recursive
> 	clients for a given <qname, qtype, qclass> tuple (provided
> 	another susccesful query has not already raised the threshhold).
> 	This is then decaded with a timer.
> 
> 	From the above figures, it looks like your client base is
> 	making lots of queries which fail to resolve.
>  
> > -Matt
> >  
> > 
> > > -----Original Message-----
> > > From: bind-users-bounce at isc.org 
> > > [mailto:bind-users-bounce at isc.org] On Behalf Of Stefan Puiu
> > > Sent: Tuesday, January 16, 2007 8:13 AM
> > > To: Schlosser, Matt D.
> > > Cc: bind-users at isc.org
> > > Subject: Re: Bind9 Crazy-high CPU on Linux
> > > 
> > > Hi,
> > > 
> > > On 1/15/07, Schlosser, Matt D. <mschlosser at eschelon.com> wrote:
> > > > The machines run between 800 and 1,000 queries/second for both
> > > > authoritative and recursive zones.  After 12-24 hours, 
> the CPU will
> > > > spike to 100% and sit there while the machine times out any more
> > > > queries.  The only resolution is to restart bind.
> > > 
> > > I haven't personally experienced this, but I've seen it 
> reported quite
> > > a few times on this list. IIRC, it's been reported that the cache
> > > cleaning can be quite heavy sometimes, so you might want 
> to adjust the
> > > cleaning interval.
> > > 
> > > Also, recompiling BIND with internal malloc support was 
> reported to
> > > help (this requires editing a header file IIRC). That 
> part seems to be
> > > detailed here:
> > > 
> > > http://groups.google.com/group/comp.protocols.dns.bind/browse_
> > > thread/thread/c830e65e2247c630/bfe3178894e98351?lnk=gst&q=jinm
> > > ei+internal+malloc&rnum=1#bfe3178894e98351
> > > 
> > > No idea why running on Windows would make a difference.
> > > 
> > > Look in the archives, I believe it's been quite well covered.
> > > 
> > > Stefan.
> > > 
> > > 
> > 
> > 
> -- 
> Mark Andrews, ISC
> 1 Seymour St., Dundas Valley, NSW 2117, Australia
> PHONE: +61 2 9871 4742                 INTERNET: Mark_Andrews at isc.org
> 
> 



More information about the bind-users mailing list