BIND freezing up randomly under "real" load

Thu Aug 4 18:44:58 UTC 2011

Am (was) prepping to deploy BIND 9.7.3-P3 (which is the version that came 
with RHEL6.1) on RHEL6.1, sitting on top of OSPF anycast.  Currently 
running BIND 9.5.0-P2 (with Novell patches) on SLES 11 (with OSPF anycast) 
just fine in production, but running into strange problem on new system, 
not encountered during testing.

SLES system runs ok, RHEL install tested ok under "loading."  When moving 
into production, named runs fine with no customers (ospf off).  I can (and 
have) queried it (dig @localhost) all day.  When I turn on ospf and let 
the real world query, it works fine for a couple minutes and then hangs 
completely:  rndc won't connect, no logs are updated, named will not 
respond at all, but it is still running (and a telnet to 53 or 953 
connects at least, but I'm not familiar with protocol level commands). 
named stop fails (hangs), and I have to pkill it to be able to restart it.

 A trace (level 99) yielded nothing in the logs before it crapped out.  I 
was running rndc and I see those and then nothing.  The few previous 
errors noted were DNS format errors from external sources.

Network guys and I note OSPF looks fine (digging to localhost, anyway, and 
that eventually fails).  lsof doesn't show anything particularly weird. 
system load, mem, etc. seems fine (comparable to SLES systems).  It just 
seems to give up the ghost...

Any ideas?  Any additional info that would be helpful?

cheers and thanks,
________________________________________________________________________
Ian Veach             NSHE System Computing Services
ian_veach at nshe.nevada.edu        Senior Systems Engineer
________________________________________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20110804/062d4fd4/attachment.html>