Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND

Fri Mar 7 11:54:27 UTC 2014

Answering myself: This bug is probably not your problem, as Bind has 
received the DNS query, otherwise it would not answer with SERVFAIL.

regards
Klaus

On 05.03.2014 16:15, Klaus Darilion wrote:
> Does it only happen for IPv6 DNS requests? Maybe it is related to this:
> https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html
>
> klaus
>
> On 05.03.2014 14:16, Kostas Zorbadelos wrote:
>>
>> Greetings to all,
>>
>> we operate an anycast caching resolving farm for our customer base,
>> based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS
>> package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the
>> stock CentOS package).
>>
>> The problem is that we have noticed sporadic but noticable SERVFAILs in
>> 3 out of 10 total machines. Cacti measurements obtained via the BIND XML
>> interface show traffic from 1.5K queries/sec (lowest loaded machines) to
>> 15K queries/sec (highest). The problem is that in 3 specific machines in
>> a geolocation with a BIND restart we notice after a period of time that
>> can range between half an hour and several hours SERVFAILs in
>> resolutions. The 3 machines do not have the highest load in the farm
>> (6-8K q/sec). The resolution problems are noticable in the customers
>> ending up in these machines but do not show up as high numbers in the
>> BIND XML Resolver statistics (ServFail number).
>>
>> We reproduce the problem, by querying for a specific domain name using
>> a loop of the form
>>
>> while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1;
>> dig www.linux-tutorial.info @localhost; sleep 2; done  | grep SERVFAIL
>>
>> The www.linux-tutorial.info is not the only domain experiencing
>> resolution problems of course. The above loop can run for hours even
>> without issues on low-traffic hours (night, after a clean BIND restart)
>> but during the day it shows quite a few SERVFAILs, which affect other
>> domains as well.
>>
>> During the problem we notice with tcpdump, that when SERVFAIL is
>> produced, no query packet exits the server for resolution. We have
>> noticed nothing in BIND logs (we even tried to raise debugging levels
>> and log all relevant categories). An example capture running the above
>> loop:
>>
>> # tcpdump -nnn -i any -p dst port 53 or src port 53 | grep
>> 'linux-tutorial'
>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>> decode
>> listening on any, link-type LINUX_SLL (Linux cooked), capture size
>> 65535 bytes
>>
>> 14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A?
>> www.linux-tutorial.info. (41)
>> 14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156%
>> [1au] A? www.linux-tutorial.info. (52)
>> ^^^^ Success
>>
>> 14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A?
>> www.linux-tutorial.info. (41)
>> 14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53:
>> 30244% [1au] A? www.linux-tutorial.info. (52)
>> ^^^^ Success
>>
>> 14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A?
>> www.linux-tutorial.info. (41)
>> ^^^^ SERVFAIL
>>
>> 14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A?
>> www.linux-tutorial.info. (41)
>> ^^^^ SERVFAIL
>>
>> 14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A?
>> www.linux-tutorial.info. (41)
>> 14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53:
>> 4287% [1au] A? www.linux-tutorial.info. (52)
>> ^^^^ Success
>>
>> The main actions we have done on the problem machines are
>>
>> - change the BIND version (we initially used a custom compiled 9.9.2, we
>>    moved to 9.9.5 and finally switched over to the CentOS stock package
>>    9.8.2rc1). We noticed the problem in all versions
>>
>> - disable IPtables (we use a ruleset with connection tracking in all of
>>    our machines with no problems on the other machines in the
>>    farm). Again no solution
>>
>> - introduce query-source-v6 address in named.conf (we already had
>>    query-source). Each machine has a single physical interface and 3
>>    loopbacks with the anycast IPs, announced via Quagga ospfd to the rest
>>    of the network. No solution.
>>
>> The main difference in the 3 machines from the rest is the IPv6
>> operation. Those machines are dual stack, having /30 (v4) and /127 (v6)
>> on the physical interface. Needless to say that the next trial is to
>> remove the relevant IPv6 configuration.
>>
>> I understand that there are many parameters to the problem, we try and
>> debug the issue several days now. Any suggestion, suspicion or hint is
>> highly welcome. I can provide all sorts of traces from the machines (I
>> already have pcap files at the moment of the problem, plus pstack, rndc
>> status, OS process limits, rndc recursing, rndc dumpdb -all, according
>> to
>>
>> https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html)
>>
>>
>> Thanks in advance,
>>
>> Kostas
>>
>>
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
>
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users