bind-9.4.0b2 throws SERVFAIL

Fri Oct 6 08:35:14 UTC 2006

>>>>> On Fri, 06 Oct 2006 09:01:45 +0200, 
>>>>> Marco Schumann <schumann at strato-rz.de> said:

>   This behaviour seems to be fixed in 9.4 as mentioned in the Release
> notes as we haven't seen this in the short period we used this version
> (9.4.0b2) on that hardware. Nevertheless, we still have seen a
> significant amount of UDP drops.
>  Now we are running bind-9.4.0b2 with threading enabled and a
> max-cache-size of 3072M (4G physical memory) (on Dual Core AMD
> Opteron(tm) Processor 185). There are ~3000..4000q/s, no more UDP drops,
> either processor core uses 50% average, we have 16 worker threads
> enabled, the cleaning-interval is 15m. When the cache size hits
> 2.6...2.8G, bind stops recursing and throws SERVFAIL instead. In the
> resolver logs we find entries "resolver: error: could not mark server as
> lame: out of memory". It disappears when named is restarted.
>  We are using views, no datalimit is set. Is max-cache-size the size per
> view or a global setting for all views? Or where does the "out of
> memory" come from?

max-cache-size can apply per-view basis, but it's a global parameter
for all views by default (and it (indirectly) helps control the amount
of memory used for lame information).  But in any event, memory
shortage can happen if memory is consumed faster than it's cleaned.

So, I'd be interested in how much of memory the server uses in total
(i.e., not only for the cache).  Were the 2.8G of memory just for the
cache, or the total memory footprint?  Also, (although I suspect it's
not the direct reason for this) does your server act as an
authoritative nameserver for a large zone?  If so, it would
effectively reduce the available memory for lame info and may lead to
some weird situation like the one you saw.  In that sense it would
help if you can show us your named.conf.

BTW, I'm afraid running 16 worker threads on a dual-core machine
doesn't really make sense (or can even be harmful) because BIND9's
worker threads normally perform non-blocking tasks; using more threads
than available processors/cores would simply increase control overhead
without benefit.

					JINMEI, Tatuya
					Communication Platform Lab.
					Corporate R&D Center, Toshiba Corp.
					jinmei at isl.rdc.toshiba.co.jp