9.4.3 oddities

Imri Zvik imriz at inter.net.il
Wed Jan 6 06:37:34 UTC 2010


Hi,

We've recently upgraded our caching servers to 9.4.3-P4/P3 (2 of them running 
9.4.3-P4 and 2 running 9.4.3-P3). Few days ago I've noticed something 
strange - When the server is loaded, some queries randomly fails (SERVFAIL). 
It seems that only queries for which the answer is NOT cached are affected.
I've verified with host/dig and tcpdump that there is no network issue (no 
unanswered packets). Digging deeper into the issue, I've found that the issue 
appears when the number of sockets used by named approach 1024~ (checked with 
netstat/lsof). The weirdest part, is that if I run "rndc reconfig", suddenly 
named is able to use more than 1024 sockets (I've seen it using 4000-5000~ 
sockets), and the problem goes away for about an hour.

If I downgrade to 3.4.2-P2 the problems goes away.

I used the following command to reproduce the problem:
for i in {1..100000}; do dig mx www.cnn.com @localhost |grep status |grep -v 
NOERROR; done

My servers are running RHEL 5.4 (2.6.18-164.9.1.el5) and FreeBSD 7.0 (the 
problem is seen on both), and they are splitted into two, unrelated, 
networks, and on two separate physical locations.

I've compiled bind from the vanilla ISC sources using the following configure 
command:

./configure --enable-threads --enable-largefile --prefix=/usr/local

I've also tried the following (I've also raised the OS limits, of course):
STD_CDEFINES="-DISC_SOCKET_FDSETSIZE=1048576" ./configure --enable-threads --enable-largefile --prefix=/usr/local

As I was seeing the "general: error: socket: file descriptor exceeds limit 
(4096/4096)" error a couple of days ago.

My best guess is that the problem is related to the recent move to epoll...

Any ideas on how I should proceed from here? 



More information about the bind-users mailing list