URGENT, PLEASE READ: 9.5.0-P1 now available
Chris Thompson
cet1 at hermes.cam.ac.uk
Thu Jul 10 11:44:29 UTC 2008
On Jul 10 2008, JINMEI Tatuya / 神明達哉 wrote:
>For anyone experiencing this problem, I'd like to get the following
>information:
This is for the BIND 9.4.2-P1 nameservers running under Solaris 10_x86
(SunOS 5.10 Generic_127112-10, in non-global zones, on a Sun X4100 M2,
if any of that matters) which I mentioned before - we have had a couple
more incidents overnight.
>- checks whether the server constantly opens such a large number of
> sockets, e.g., by using lsof
Looking at /proc/[pid]/fd shows that most of the time the high-water
mark is around 60-70. They are nearly all sockets, as expected.
>- checks how many clients the server is normally handling, by
> executing 'rndc status' several times. (note: you may have to
> specify a smaller value for the recursive-clients option so that
> there's at least one TCP socket is available for rndc)
Outstanding queries at any one time are mostly below 50, but we know
from past experience that this is *extremely* variable. Some host
starts unloading lots of slowly unanswerable queries on us when their
primary resolver doesn'r respond fast enough - these servers are much
used as backup resolvers in resolv.conf's-or-equivalent.
I wouldn't be at all suprised to find that the incidents are caused
by attempts at retrospective Apache (or other) log analysis, a common
cause of query rate spikes.
>- checks query rate, cache hit rate, number of queries sent from the
> server per some time unit. you can get these numbers by executing
> 'rndc stats' periodically and several times (note: some of the
> numbers are only available for 9.5)
A bit of query logging shows one of them running at 225 queries/sec and
the other at 85 queries/sec. But of course as above, this is a highly
variable thing. Other stats not yet to hand.
>Also, for those who can try beta versions in the operational
>environment, I'd like you to try it to see whether the problem still
>happens with them.
It would help if you could go into more detail about the differences
in socket handling in 9.4.2-P1 vs 9.4.3b2. I see that there has already
been one report of the same or similar problem against 9.4.3b2 on a
Linux system.
I have been wondering whether the problem is in fact an effective 256
file descriptor limit, despite the larger resource limit settings. The
named binary is a 32-bit executable, not a 64-bit one (default BIND make
on these Opteron processors). Has anyone tried a 64-bit one?
--
Chris Thompson
Email: cet1 at cam.ac.uk
More information about the bind-users
mailing list