Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
Kostas Zorbadelos
kzorba at otenet.gr
Wed Mar 5 13:16:12 UTC 2014
Greetings to all,
we operate an anycast caching resolving farm for our customer base,
based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS
package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the
stock CentOS package).
The problem is that we have noticed sporadic but noticable SERVFAILs in
3 out of 10 total machines. Cacti measurements obtained via the BIND XML
interface show traffic from 1.5K queries/sec (lowest loaded machines) to
15K queries/sec (highest). The problem is that in 3 specific machines in
a geolocation with a BIND restart we notice after a period of time that
can range between half an hour and several hours SERVFAILs in
resolutions. The 3 machines do not have the highest load in the farm
(6-8K q/sec). The resolution problems are noticable in the customers
ending up in these machines but do not show up as high numbers in the
BIND XML Resolver statistics (ServFail number).
We reproduce the problem, by querying for a specific domain name using
a loop of the form
while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1;
dig www.linux-tutorial.info @localhost; sleep 2; done | grep SERVFAIL
The www.linux-tutorial.info is not the only domain experiencing
resolution problems of course. The above loop can run for hours even
without issues on low-traffic hours (night, after a clean BIND restart)
but during the day it shows quite a few SERVFAILs, which affect other
domains as well.
During the problem we notice with tcpdump, that when SERVFAIL is
produced, no query packet exits the server for resolution. We have
noticed nothing in BIND logs (we even tried to raise debugging levels
and log all relevant categories). An example capture running the above
loop:
# tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A? www.linux-tutorial.info. (41)
14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156% [1au] A? www.linux-tutorial.info. (52)
^^^^ Success
14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A? www.linux-tutorial.info. (41)
14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53: 30244% [1au] A? www.linux-tutorial.info. (52)
^^^^ Success
14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A? www.linux-tutorial.info. (41)
^^^^ SERVFAIL
14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A? www.linux-tutorial.info. (41)
^^^^ SERVFAIL
14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A? www.linux-tutorial.info. (41)
14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53: 4287% [1au] A? www.linux-tutorial.info. (52)
^^^^ Success
The main actions we have done on the problem machines are
- change the BIND version (we initially used a custom compiled 9.9.2, we
moved to 9.9.5 and finally switched over to the CentOS stock package
9.8.2rc1). We noticed the problem in all versions
- disable IPtables (we use a ruleset with connection tracking in all of
our machines with no problems on the other machines in the
farm). Again no solution
- introduce query-source-v6 address in named.conf (we already had
query-source). Each machine has a single physical interface and 3
loopbacks with the anycast IPs, announced via Quagga ospfd to the rest
of the network. No solution.
The main difference in the 3 machines from the rest is the IPv6
operation. Those machines are dual stack, having /30 (v4) and /127 (v6)
on the physical interface. Needless to say that the next trial is to
remove the relevant IPv6 configuration.
I understand that there are many parameters to the problem, we try and
debug the issue several days now. Any suggestion, suspicion or hint is
highly welcome. I can provide all sorts of traces from the machines (I
already have pcap files at the moment of the problem, plus pstack, rndc
status, OS process limits, rndc recursing, rndc dumpdb -all, according
to
https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html)
Thanks in advance,
Kostas
--
Kostas Zorbadelos
twitter:@kzorbadelos http://gr.linkedin.com/in/kzorba
----------------------------------------------------------------------------
() www.asciiribbon.org - against HTML e-mail & proprietary attachments
/\
More information about the bind-users
mailing list