Bind9 Crazy-high CPU on Linux

Mon Jan 22 22:14:15 UTC 2007

> Thanks for the information.
> 
> "From the above figures, it looks like your client base is
> making lots of queries which fail to resolve."
> 
> After some research it appears one or two customers account for this
> traffic, querying broken zones over and over.
> 
> This name server acts as a resolver for a cluster of mail servers as well as
> customer equipment and I'm certain some of them have mail servers as well.
> RBL checks are passed to the BIND server which forwards them to an internal
> RBL service.
> 
> Along these lines, where do "forwarders" fall in the stats list?  Are they
> recursive?

	Everything that goes to a forwarder is captured by recursion.
 
> The service has been running non-stop since the last time I sent out a
> status.  Here it is now:
> 
> success 94713973
> referral 8285349
> nxrrset 20447734
> nxdomain 56751210
> recursion 122660495
> failure 54669574
> duplicate 3398529
> dropped 438693
> 
> > -----Original Message-----
> > From: bind-users-bounce at isc.org 
> > [mailto:bind-users-bounce at isc.org] On Behalf Of Mark Andrews
> > Sent: Thursday, January 18, 2007 7:58 PM
> > To: Matthew Schlosser
> > Cc: 'Stefan Puiu'; bind-users at isc.org
> > Subject: Re: Bind9 Crazy-high CPU on Linux 
> > 
> > 
> > > Thank you for all the references and help.
> > > 
> > > I upgraded to 9.4rc1 with the following results:
> > > 
> > > Massive jump in memory usage (about double).  The named 
> > process now shows a
> > > memory footprint of close to 900MB where before 500MB would kill it.
> > > 
> > > CPU stays between 20-25% and spikes to 30-35% during a 
> > cleaning interval
> > > which lasts only a minute or two.
> > > 
> > > Previously with 9.3.2, rndc status showed upwards of 2,000 
> > or more recursive
> > > clients.  Now it shows only less than 500 at any given time 
> > and the output
> > > format has changed:
> > > 
> > > recursive clients: 463/3900/4000
> > > 
> > > The server has been up a little over 36 hours.
> > > 
> > > I also noted three new items in the named.stats file.  
> > "Duplicate" and
> > > "dropped" are new values.  Does anyone know how to fit them 
> > in to the
> > > greater scheme?  For example recursion can be subtracted 
> > from a combined
> > > total of success, referral, nxrrset, nxdomain and failure 
> > to generate a
> > > percentage.  Where do the new values fit?
> > > 
> > > success 30428464
> > > referral 2099872
> > > nxrrset 6270659
> > > nxdomain 16121686
> > > recursion 29924813
> > > failure 8892309
> > > duplicate 1101621
> > 
> > 	Duplicate queries are ones where a indentical query was
> > 	recieved (source address and port, qname, qtype, qclass)
> > 	while a existing query was being resolved.
> > 
> > > dropped 159814
> > 
> > 	Excessive recursive queries for <qname, qtype, qclass> other
> > 	than duplicate queries.  Excessive is self adjusting within
> > 	10:100 queries.  Successful resolution of a query for which
> > 	there were drops raises the number of simultanious recursive
> > 	clients for a given <qname, qtype, qclass> tuple (provided
> > 	another susccesful query has not already raised the threshhold).
> > 	This is then decaded with a timer.
> > 
> > 	From the above figures, it looks like your client base is
> > 	making lots of queries which fail to resolve.
> >  
> > > -Matt
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: bind-users-bounce at isc.org 
> > > > [mailto:bind-users-bounce at isc.org] On Behalf Of Stefan Puiu
> > > > Sent: Tuesday, January 16, 2007 8:13 AM
> > > > To: Schlosser, Matt D.
> > > > Cc: bind-users at isc.org
> > > > Subject: Re: Bind9 Crazy-high CPU on Linux
> > > > 
> > > > Hi,
> > > > 
> > > > On 1/15/07, Schlosser, Matt D. <mschlosser at eschelon.com> wrote:
> > > > > The machines run between 800 and 1,000 queries/second for both
> > > > > authoritative and recursive zones.  After 12-24 hours, 
> > the CPU will
> > > > > spike to 100% and sit there while the machine times out any more
> > > > > queries.  The only resolution is to restart bind.
> > > > 
> > > > I haven't personally experienced this, but I've seen it 
> > reported quite
> > > > a few times on this list. IIRC, it's been reported that the cache
> > > > cleaning can be quite heavy sometimes, so you might want 
> > to adjust the
> > > > cleaning interval.
> > > > 
> > > > Also, recompiling BIND with internal malloc support was 
> > reported to
> > > > help (this requires editing a header file IIRC). That 
> > part seems to be
> > > > detailed here:
> > > > 
> > > > http://groups.google.com/group/comp.protocols.dns.bind/browse_
> > > > thread/thread/c830e65e2247c630/bfe3178894e98351?lnk=gst&q=jinm
> > > > ei+internal+malloc&rnum=1#bfe3178894e98351
> > > > 
> > > > No idea why running on Windows would make a difference.
> > > > 
> > > > Look in the archives, I believe it's been quite well covered.
> > > > 
> > > > Stefan.
> > > > 
> > > > 
> > > 
> > > 
> > -- 
> > Mark Andrews, ISC
> > 1 Seymour St., Dundas Valley, NSW 2117, Australia
> > PHONE: +61 2 9871 4742                 INTERNET: Mark_Andrews at isc.org
> > 
> > 
> 
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: Mark_Andrews at isc.org