Strange BIND9 issue

Wed Jan 12 01:42:15 UTC 2005

I'm having issues with one of our caching-only nameservers - a dig to
127.0.0.1 always responds pretty quickly, even when this problem is
happening.

radon: 04:56pm# while true ; do dig yahoo.com @127.0.0.1 | grep Query ; done
;; Query time: 45 msec
;; Query time: 9 msec
;; Query time: 33 msec
;; Query time: 2 msec
;; Query time: 44 msec
;; Query time: 2 msec
;; Query time: 20 msec
;; Query time: 2 msec
;; Query time: 42 msec

However, queries from external clients, or queries to the virtual IP
address become extremely sluggish after the nameserver's been running
for a while (let's say a day or so). rndc flush doesn't help, however
shutting down the nameserver and restarting it does. named is running as
user "bind".

This seems /really/ strange to me.

radon: 04:56pm# while true ; do dig yahoo.com @66.33.216.127 | grep Query  ; done
;; Query time: 790 msec
;; Query time: 868 msec
;; Query time: 753 msec
;; Query time: 798 msec
;; Query time: 982 msec
;; Query time: 1178 msec
;; Query time: 1284 msec
;; Query time: 1291 msec
;; Query time: 1208 msec
;; Query time: 738 msec

If I restart BIND, queries start responding quickly again. I don't see
any errors on the interface, and pings (from outside or to the machine
itself) don't show any packet loss... I don't think a networking problem
is the issue here.

I've seen this with Bind 9.2.3, 9.2.4 and with 9.3.X. This is on Debian
Linux, built pretty vanilla from source with no SSL, currently a 2.4.24
kernel. The machine's got about 2 Gb of memory, not a whole lot else
running on the machine.

The config, some comments removed for brevity:

options {
        directory "/var/named";
        pid-file "/var/named/named.pid";
        auth-nxdomain no;    # conform to RFC1035
        listen-on {
                127.0.0.1;              // loopback
                66.33.216.127;          // ns-cache01.sd.dreamhost.com
        };

        transfer-source 66.33.216.127;
        notify-source 66.33.216.127;
        recursive-clients 6000;
        tcp-clients 1500;
        max-cache-size 150000000;

        /* only allow queries from internal networks */
        allow-query { dh_known_networks; 127.0.0.0/8; };
};

[ logging, control, standard zones and ACL stuff omitted ]

zone "cbl.abuseat.org" {
        type forward;
        forwarders { 127.0.0.1 port 54; 66.33.216.129 port 54; };
};

zone "socks.dnsbl.sorbs.net" {
        type forward;
        forwarders { 127.0.0.1 port 54; 66.33.216.129 port 54; };
};

[ some other dnsbls, forwarding to rbldnsd running on port 54).

Any thoughts?

The recursive-clients and tcp-clients setting should be more than
enough.

The box is getting about 213 queries per second right now.

I'll be happy to provide any other information that might be helpful.
I'll try to get the output of "free", a few lines of "vmstat 5" and any
other useful information next time the problem pops up.

w