Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND

Wed Mar 5 15:15:15 UTC 2014

Does it only happen for IPv6 DNS requests? Maybe it is related to this:
https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html

klaus

On 05.03.2014 14:16, Kostas Zorbadelos wrote:
>
> Greetings to all,
>
> we operate an anycast caching resolving farm for our customer base,
> based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS
> package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the
> stock CentOS package).
>
> The problem is that we have noticed sporadic but noticable SERVFAILs in
> 3 out of 10 total machines. Cacti measurements obtained via the BIND XML
> interface show traffic from 1.5K queries/sec (lowest loaded machines) to
> 15K queries/sec (highest). The problem is that in 3 specific machines in
> a geolocation with a BIND restart we notice after a period of time that
> can range between half an hour and several hours SERVFAILs in
> resolutions. The 3 machines do not have the highest load in the farm
> (6-8K q/sec). The resolution problems are noticable in the customers
> ending up in these machines but do not show up as high numbers in the
> BIND XML Resolver statistics (ServFail number).
>
> We reproduce the problem, by querying for a specific domain name using
> a loop of the form
>
> while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1;
> dig www.linux-tutorial.info @localhost; sleep 2; done  | grep SERVFAIL
>
> The www.linux-tutorial.info is not the only domain experiencing
> resolution problems of course. The above loop can run for hours even
> without issues on low-traffic hours (night, after a clean BIND restart)
> but during the day it shows quite a few SERVFAILs, which affect other
> domains as well.
>
> During the problem we notice with tcpdump, that when SERVFAIL is
> produced, no query packet exits the server for resolution. We have
> noticed nothing in BIND logs (we even tried to raise debugging levels
> and log all relevant categories). An example capture running the above
> loop:
>
> # tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial'
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
>
> 14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A? www.linux-tutorial.info. (41)
> 14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156% [1au] A? www.linux-tutorial.info. (52)
> ^^^^ Success
>
> 14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A? www.linux-tutorial.info. (41)
> 14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53: 30244% [1au] A? www.linux-tutorial.info. (52)
> ^^^^ Success
>
> 14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A? www.linux-tutorial.info. (41)
> ^^^^ SERVFAIL
>
> 14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A? www.linux-tutorial.info. (41)
> ^^^^ SERVFAIL
>
> 14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A? www.linux-tutorial.info. (41)
> 14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53: 4287% [1au] A? www.linux-tutorial.info. (52)
> ^^^^ Success
>
> The main actions we have done on the problem machines are
>
> - change the BIND version (we initially used a custom compiled 9.9.2, we
>    moved to 9.9.5 and finally switched over to the CentOS stock package
>    9.8.2rc1). We noticed the problem in all versions
>
> - disable IPtables (we use a ruleset with connection tracking in all of
>    our machines with no problems on the other machines in the
>    farm). Again no solution
>
> - introduce query-source-v6 address in named.conf (we already had
>    query-source). Each machine has a single physical interface and 3
>    loopbacks with the anycast IPs, announced via Quagga ospfd to the rest
>    of the network. No solution.
>
> The main difference in the 3 machines from the rest is the IPv6
> operation. Those machines are dual stack, having /30 (v4) and /127 (v6)
> on the physical interface. Needless to say that the next trial is to
> remove the relevant IPv6 configuration.
>
> I understand that there are many parameters to the problem, we try and
> debug the issue several days now. Any suggestion, suspicion or hint is
> highly welcome. I can provide all sorts of traces from the machines (I
> already have pcap files at the moment of the problem, plus pstack, rndc
> status, OS process limits, rndc recursing, rndc dumpdb -all, according
> to
>
> https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html)
>
> Thanks in advance,
>
> Kostas
>
>