bind9 is taking little Breaks for Some Reason.

Tue May 15 15:42:40 UTC 2007

	This is a mystery that has no good suspects yet.  We
have bind9.3.4 running on a Dell server with a peak DNS query
load that sometimes tops over a million queries per hour but
usually stays in the 600,000-per-hour range.  The system loading
is around 0.05 to 0.11 or so with no signs of stress.

	The problem is that at random times, attempts to update
the DNS from clients on the same network which are even on ports
in the same switch fall on deaf ears, so to speak.

	I discovered it when I noticed that the DHCP server
which is in the same subnet on the same switch, was giving
occasional timed out errors in clusters that would sometimes
span over a minute but are usually limited to a couple of
seconds.

	If I look at the logs on the DNS, I see no squawks about
anything bad.  There is simply a little gap in the normal tempo
of messages about Microsoft clients trying to update us and us
refusing, etc.  The gap is wide enough to be certain that
something is wrong because it will be significantly longer than
the normal period of time between messages.

	It also does not seem to be related to traffic levels as
we see it in the wee hours of the morning as well as mid-day on
a Wednesday which is one of our heaviest usage days.

	Our slave is on a brand new Dell 2650 running FreeBSD6.2
and apparently is exhibiting the same behavior as a client
recently had a query time out and then repeated it and it
worked.

	Some other background follows:

	The master DNS is presently on FreeBSD4.11.  We began
running bind9.3.4 on March 21 after upgrading from bind9.3.2 or
similar.  A check of logs over the last 6 months shows one
little holiday in a whole 24-hour period in September.  The
number of naps had increased to 6 or 8 or so per day by March
15 (older bind).  After March 21, (bind9.3.4), the problem did
not get any better or worse.

	I did ping our master DNS from yet another device on
that same network while we were having one of those cat-naps,
and never lost a packet.

	Does this sound familiar?  Again, bind does not appear
to be having any trouble.  I wrote an expect script to run on
our dhcp server to initiate a rndc status call to our master at
the first "timed out" message in syslog.  Each time, bind
responds with a clean bill of health.  In fact, the number of
recursive clients is always extremely low.  Our usual count is
30 to 50 per 1000.  When we do the status check after a
time-out, it may be as low as 2.

	To be clear, this also effects queries as well as
updates.  Anything that is port 53 and udp seems to be effected.
The switch being used is a Cisco 3750.

	Any ideas, advice, etc is greatly appreciated.

Martin McCormick WB5AGZ  Stillwater, OK 
Systems Engineer
OSU Information Technology Department Network Operations Group