9.4.3 oddities
Imri Zvik
imriz at inter.net.il
Wed Jan 6 06:37:34 UTC 2010
Hi,
We've recently upgraded our caching servers to 9.4.3-P4/P3 (2 of them running
9.4.3-P4 and 2 running 9.4.3-P3). Few days ago I've noticed something
strange - When the server is loaded, some queries randomly fails (SERVFAIL).
It seems that only queries for which the answer is NOT cached are affected.
I've verified with host/dig and tcpdump that there is no network issue (no
unanswered packets). Digging deeper into the issue, I've found that the issue
appears when the number of sockets used by named approach 1024~ (checked with
netstat/lsof). The weirdest part, is that if I run "rndc reconfig", suddenly
named is able to use more than 1024 sockets (I've seen it using 4000-5000~
sockets), and the problem goes away for about an hour.
If I downgrade to 3.4.2-P2 the problems goes away.
I used the following command to reproduce the problem:
for i in {1..100000}; do dig mx www.cnn.com @localhost |grep status |grep -v
NOERROR; done
My servers are running RHEL 5.4 (2.6.18-164.9.1.el5) and FreeBSD 7.0 (the
problem is seen on both), and they are splitted into two, unrelated,
networks, and on two separate physical locations.
I've compiled bind from the vanilla ISC sources using the following configure
command:
./configure --enable-threads --enable-largefile --prefix=/usr/local
I've also tried the following (I've also raised the OS limits, of course):
STD_CDEFINES="-DISC_SOCKET_FDSETSIZE=1048576" ./configure --enable-threads --enable-largefile --prefix=/usr/local
As I was seeing the "general: error: socket: file descriptor exceeds limit
(4096/4096)" error a couple of days ago.
My best guess is that the problem is related to the recent move to epoll...
Any ideas on how I should proceed from here?
More information about the bind-users
mailing list