Recursion ceases for 5-10 minutes at random intervals throughout the day
Bill Springall
springall at fuse.net
Wed Feb 13 22:32:41 UTC 2008
Hello all,
I was wondering if anyone has run into a recent problem that has me
scratching my head. We just replaced 6 existing non-authoritative Bind
servers with new hardware and a Bind upgrade. They are all in an OSPF
load-balance solution using Zebra (existing). All of the servers suffer
from a lack of resolution at random times of the day for approximately 5
to 10 minutes. The times are seemingly random and do not necessarily
match another servers failure. For example, one has had 18 dropouts in
the last 24hrs, while another hasn't had any - it may be the opposite
tomorrow. The symptoms are a total loss of resolution with an almost
equal amount of failures via bind stats. All users receive, "Server
Failure".
Each server handles anywhere between 500-1500 qps throughout the
day, under normal load. Problem occurs at all loads.
I've tried port, "monitoring", tcpdumping the traffic, and sifting
through the requests and nothing seems out of the ordinary. Numerous
tweaks of the OS have not helped (state table within limits and then
disabled, firewall deactivated/activated, eth stats good). When the
problems happens I can get onto the machine and it is ok (network
upstream good, routing table hasn't inherited anything new, server calm)
When I turn logging up to a level that can help, named can't keep up.
We are now have a troubleshooting process in the works that
involves different hardware and 9.4.2, environment re-architecture, as
well as, <shiver>, other caching dns software.
Is there a known problem, that I haven't been able to find, that
could be causing this? As I understand the, "Server Failure", message
is a general message, could someone help to point me to the next thing
to try? Any help would be appreciated!
- Bill
Pertinent details:
6 servers - all with 2 dual-core AMD Opterons - 4GB Ram
Centos 5 - Linux dnshost1 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:28:47
EST 2008 x86_64 x86_64 x86_64 GNU/Linux
ISC Bind - 9.4.1-P1
./configure --prefix=/usr/local/bind-9.4.1-P1 --enable-threads
--with-openssl=/usr/local/ssl/
(Built without issue or complaint)
GNU Zebra - 0.95a
./configure --disable-ipv6 --enable-nssa
(also built without issue or complaint)
options {
directory "/usr/local/named";
allow-transfer {"allowed";};
allow-recursion {"allowed";};
version "(We'll leave this out...)";
query-source 10.0.0.4 port 32700;
recursive-clients 10000;
listen-on { 10.0.0.1; 10.0.0.2; 10.0.0.3; 192.168.0.1; };
}
(real world IPs substituted to protect the innocent)
10.0.0.4 - bound to eth for upstream lookup
10.0.0.1, 10.0.0.2, 10.0.0.3 are real world IPs bound to loopback for
ospfd.
192.168.0.1 is management
"allowed" is an ACL built to include customer netblocks.
More information about the bind-users
mailing list