Strange Symptom from Bind-9.1.3 server

Mon Mar 18 17:20:09 UTC 2002

	We recently had quite a bit of router/network trouble
over a period of days.  After things got back to normal, both our
master and slave domain name servers running bind-9.1.3 appeared
to be doing well, but apparently there was a problem.

	We began receiving reports that the domain seedgenes.org
could not be resolved on campus but could resolve with no trouble
for customers outside our network.

	I verified that I could even go to the system hosting our
master dns, use dig at their dns, and get a proper reply.  Normal
queries to our dns requesting information about seedgenes.org,
however, always failed.

	Our slave dns could correctly resolve this domain and I
verified that the loading of the root zone cache for both systems
was identical.

	I finally fixed the problem by accident when I attempted
to send our master dns a kill -USR1 to turn on the dump.  The
plan had been to turn on debugging output, try another attempt to
resolve seedgenes.org, and then turn the dump off since there is
so much output.

	Somehow, I accidentally completely killed named and had
to restart it from a cold start.

	As soon as it came up, seedgenes.org began to resolve.

	What may have happened?  The named process sometimes gets
close to 400,000 hits per hour and most of them are successful
or our phones would have been ringing off the wall for days.

	There appears to have been nothing unusual or wrong about
seedgenes.org except that this one box of ours would not resolve them
until I accidentally gave named a fresh start.

	I had tried rndc earlier and had had no luck.

	I wouldn't be surprised if other domains were failing to
resolve and we simply hadn't discovered that yet.

	We are extremely happy with bind, here, and this is not a
gripe.  Very few UNIX systems I have seen handle network
instability gracefully.  Weird things seem to start happening and
processes fail in ways one wouldn't even expect except that the
failures coincide with the network instability.  My theory is
that it has to do with large numbers of file descriptors being
opened as sockets are created and then, due to net problems,
maybe never closed properly.

	I would like to know if there is any kind of self-test
one can do on the named process to see if it is still as sane as
it appears to be?  If I hadn't completely killed and restarted
named, I might still be trying everything else I could think of
to solve the problem.

Martin McCormick WB5AGZ  Stillwater, OK 
OSU Center for Computing and Information Services Network Operations Group