Urgent problem - BIND 9.2.x on Solaris 8 going "brain-dead"

Wed Jul 13 04:40:57 UTC 2005

The baseline facts:

- Solaris 8 on Sparc, 64-bit kernel, current patch set
- Sun Forte Studio 8 compiler
- Bind 9.2.x (have experienced this with 9.2.2, 9.2.3, and now with 9.2.5)
  Compiled as 32-bit application, moderate optimizations, default configure
  options except for directories.
  Running chrooted into a heavily secured private subdirectory,
  as a non-privileged user (user named, group named)
- 2 publicly exposed but firewalled systems
  Both running BIND plus Apache, qmail, and WuFTPd.

The symptoms:

BIND 9.2.3 was running for about 6 weeks on both of these systems since a
patch cluster and reboot.  Today, both systems started issuing SERVFAIL
responses in response to *all* queries, even ones that should have been
refused and ones for which local zones are defined.  Restarting the named
processes would cure the problem, but for only about an hour or so before
the SERVFAILS would return.
I downloaded and built 9.2.5, and installed it on one of the two systems. 
Under 9.2.5, the same problem still occurs, and still within about an hour
of a restart.  I have not rebooted the systems, and I don't have that option
at this time.

trussing the named process when it's in this "brain-dead" state shows it
making repeated calls to brk() that are all failing with an ENOMEM.  A pmap
shows that the heap size is approximately 20 Megs in size.

I am currently watching both systems after a restart of named, continuously
polling them with pmap and tracking the heap size.  I can already see the
heap increasing sporadically from about 2 megs at startup.  I expect when
the heap hits 20 megs again, the named in question will go back to issuing
SERVFAILS and will need to be restarted again.

These systems experienced this same problem earlier this year while running
BIND 9.2.2, but after an install of 9.2.3 in early February, the problem
appeared to go away - until today.  Two other similarly configured systems
with the exact same binaries installed on them, that we use for our internal
name servers, have never demonstrated these symptoms over the course of a
year or more.  They are currently on the same 9.2.3 binary that was
installed in February, and have been running continuously for 4 and 5 weeks
respectively.

The question(s):

Obviously, something in named is leaking heap.
How do I figure out what is causing the leak?
Why did it all of a sudden start today after six solid weeks of continuous
running with no problems?
Why does a restart of the named process not get me six more weeks of clean
running, but only an hour or so?
Why have the two internal systems NEVER experienced this issue?
Is this a directed attack against our public name servers?
Why have three different versions of BIND all behaved this way?
Am I doing something wrong with my configure and build?
Is it possible to figure out why the heap won't grow past 20 Megs and
perhaps increase that limit?

I'd really like to get this solved as soon as possible.  I'll basically get
no more than about an hour of sleep until it's solved, or the problem goes
away on its own.  I'm willing to listen to any reasonable suggestions for
steps to take.  I'll also be happy to provide any additional information
requested.  I'm not afraid to recompile and re-install onto these sytems
while they're live, but rebooting them MUST be considered only as a last
resort.

Thanks in advance for any help.
-- 
James Noyes
(jnoyes-bind at retrogeeks.com)