2 simultaneous hung Bind boxes

Wed Oct 28 05:30:09 UTC 2009

I got a call from a remote tech earlier this evening.  He was at home on 
our service and couldn't get on the Internet.  His IP connectivity was 
fine and could hit my NOC website via IP only.  DNS however was hosed. 
About the time I got in a position to check the bind logs and sniff his 
traffic the problem went away.  We chocked it up to a local problem 
until a few minutes later across the SP network I too experienced the 
same thing.  My DNS requests simply timed out.  I turned on querylog on 
our boxes and could see what appeared to be successful hits and replies. 
  The boxes were just not replying to queries.  Traffic on our main 
upstream dropped by about 90% within a few short minutes (users' DNS 
stopped and outbound usage ground to a halt basically).  Not knowing 
what else to try I restart bind on both NSs.  That fixed it.

The boxes are running fairly old Bind code, 9.5.1b2.  Tomorrow I will 
upgrade to 9.6.1rc1 (unless people believe 9.7.0b1 is ready for use). 
My question is are there any known ways for a crafted query or crafted 
reply to cause what I've described on that old release of Bind?  I 
recall hearing about assorted things over the past couple of years 
though I thought that they were things that would cause actual crashing, 
not the mentally hosing my boxes appeared to take this evening.  Does 
anything else come to mind?  The views on the servers only permit 
recursive lookups internally from our customer prefixes.  Externally you 
can only get responses for things that we have authority over.  Thoughts?

Thanks
  Justin