recursive queries fail with high load?

Mon Feb 26 15:00:23 UTC 2007

Chris Michels wrote:
>  
> I have 3 DNS servers running bind 9.3.2.  Two of them are failing to resolve
> recursive queries.   Both of these servers have a higher load because they
> are used by our spam filtering software.  I have increased the
> recursive-clients option on both servers.  It seems like recursive queries
> are just taking a long time and timing out.   What is going on here or where
> should I be looking for what is wrong?
> 
> A dig of a random name returns:
> 
> [root at ruby named]# dig www.websudoku.com[1] @ns2.nau.edu
> 
> ; <<>>DiG 9.2.4 <<>>www.websudoku.com[2] @ns2.nau.edu
> ; (1 server found)
> ;; global options:  printcmd
> ;; connection timed out; no servers could be reached
> 
> But if I set the timeout high it returns:
> 
> [root at ruby named]# dig +time=240 www.websudoku.com[3] @ns2.nau.edu 
> 

Unfortunately, we seem to face the same problem with bind 9.3.3. After 
2-3 days of uptime, for no apparent reason, all answers take too long 
and usually timeout.

When this happens, we notice a drop in successful queries in 
named.stats, machine load jumps to >1 (normally around 0.50), named 
process starts consuming 100% of cpu (normally it's under 30%) and 
memory usage stays the same.

You can see relevant graphs at 
http://users.forthnet.gr/kat/mitsos/debug-ns1/athns03.html

The problem started at Sun 12:00 and I restarted the server on Mon 16:30 
  .. See the load graph, and all bind9 related graphs in category "Other".

Tail -f of querylog shows successful processing of queries, but they are 
probably the ones with a long timeout value.. We use two views, this 
happens to both of them.

I tried increasing debug level to 99, but nothing useful found so far..

# rndc status
number of zones: 20
debug level: 99
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 25/10000
tcp clients: 0/100
server is up and running

This is the top output at the time of the problem
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16380 named     23   0  870m 838m 2040 S  100 41.4   2502:39 named

The only solution so far is to restart bind..
Thoughts/suggestions of how to debug this further are more than welcome.

Sotiris.