Recursion ceases for 5-10 minutes at random intervals throughout the day

Bill Springall springall at fuse.net
Wed Feb 13 22:32:41 UTC 2008


Hello all,
    I was wondering if anyone has run into a recent problem that has me
scratching my head.  We just replaced 6 existing non-authoritative Bind
servers with new hardware and a Bind upgrade.   They are all in an OSPF
load-balance solution using Zebra (existing).  All of the servers suffer
from a lack of resolution at random times of the day for approximately 5
to 10 minutes.   The times are seemingly random and do not necessarily 
match another servers failure.   For example, one has had 18 dropouts in 
the last 24hrs, while another hasn't had any - it may be the opposite
tomorrow.   The symptoms are a total loss of resolution with an almost
equal amount of failures via bind stats. All users receive, "Server 
Failure".
     Each server handles anywhere between 500-1500 qps throughout the
day, under normal load.  Problem occurs at all loads.
     I've tried port, "monitoring", tcpdumping the traffic, and sifting 
through the requests and nothing seems out of the ordinary.   Numerous 
tweaks of the OS have not helped (state table within limits and then 
disabled, firewall deactivated/activated, eth stats good).  When the 
problems happens I can get onto the machine and it is ok (network 
upstream good, routing table hasn't inherited anything new, server calm) 
  When I turn logging up to a level that can help, named can't keep up.
     We are now have a troubleshooting process in the works that 
involves different hardware and 9.4.2, environment re-architecture,  as 
well as, <shiver>, other caching dns software.
     Is there a known problem, that I haven't been able to find, that 
could be causing this?   As I understand the, "Server Failure", message 
is a general message, could someone help to point me to the next thing 
to try?   Any help would be appreciated!

- Bill


Pertinent details:
6 servers - all with 2 dual-core AMD Opterons - 4GB Ram
Centos 5 - Linux dnshost1 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:28:47 
EST 2008 x86_64 x86_64 x86_64 GNU/Linux

ISC Bind - 9.4.1-P1
     ./configure --prefix=/usr/local/bind-9.4.1-P1 --enable-threads
--with-openssl=/usr/local/ssl/
     (Built without issue or complaint)
GNU Zebra - 0.95a
     ./configure --disable-ipv6 --enable-nssa
     (also built without issue or complaint)

options {
         directory "/usr/local/named";
         allow-transfer {"allowed";};
         allow-recursion {"allowed";};
         version "(We'll leave this out...)";
         query-source 10.0.0.4 port 32700;
         recursive-clients 10000;
         listen-on { 10.0.0.1; 10.0.0.2; 10.0.0.3; 192.168.0.1; };
}

(real world IPs substituted to protect the innocent)
10.0.0.4 - bound to eth for upstream lookup
10.0.0.1, 10.0.0.2, 10.0.0.3 are real world IPs bound to loopback for
ospfd.
192.168.0.1 is management
"allowed" is an ACL built to include customer netblocks.






More information about the bind-users mailing list