Seemingly random ServFail issues on a caching server

Florian CROUZAT gentoo at floriancrouzat.net
Wed Aug 31 15:05:33 UTC 2011


Lyle Giese wrote on 2011-08-31:

> On 8/31/2011 8:40 AM, Florian CROUZAT wrote:
>> Florian CROUZAT wrote on 2011-08-25:
>>
>>> Hi list,
>>>
>>> On a few domains (we'll consider only one domain for this example) I
>>> encounter sometimes (seemingly randoms) ServFails while resolving
>>> domain names. A client (192.168.147.2) asks my caching server
>>> (192.168.151.100) to resolve a target (www.leclercdrive.fr)
>>>
>>> Here are the relevant logs:
>>>
>>> Aug 24 17:14:19 ns named[24929]: 24-Aug-2011 17:14:19.377 queries:
>>> info: client 192.168.147.2#34502: view internal: query:
>>> www.leclercdrive.fr IN A + Aug 24 17:14:19 ns named[24929]:
>>> 24-Aug-2011 17:14:19.380 queries: info: client 192.168.147.2#34502:
>>> view internal: query: www.leclercdrive.fr IN A + Aug 24 17:14:19 ns
>>> named[24929]: 24-Aug- 2011 17:14:19.382 queries: info: client
>>> 192.168.147.2#34502: view internal: query: www.leclercdrive.fr IN A +
>>>
>>>
>>> A tcpdump on the local side of the NS server shows the A request and
>>> the instant ServFail. A tcpdump on the external side of the NS server
>>> shows no traffic at all in this case meaning it fails internally and
>>> doesn't even try to forward the A request to the Internet.
>>>
>>> 17:14:19.377608 IP 192.168.147.2.34502>  192.168.151.100.53: 26340+ A?
>>> www.leclercdrive.fr. (37) 17:14:19.378845 IP 192.168.151.100.53>
>>> 192.168.147.2.34502: 26340 ServFail 0/0/0 (37) 17:14:19.380607 IP
>>> 192.168.147.2.34502>  192.168.151.100.53: 52628+ A?
>>> www.leclercdrive.fr. (37) 17:14:19.381383 IP 192.168.151.100.53>
>>> 192.168.147.2.34502: 52628 ServFail 0/0/0 (37) 17:14:19.382605 IP
>>> 192.168.147.2.34502> 192.168.151.100.53: 58933+ A?
>>> www.leclercdrive.fr. (37) 17:14:19.383406 IP 192.168.151.100.53>
>>> 192.168.147.2.34502: 58933 ServFail 0/0/0 (37)
>>>
>>> A few minutes before, or later, it worked just fine, see:
>>>
>>> 17:15:58.736177 IP 192.168.147.2.34502>  192.168.151.100.53: 49610+ A?
>>> www.leclercdrive.fr. (37) 17:15:58.784470 IP 192.168.151.100.53>
>>> 192.168.147.2.34502: 49610 3/3/6 CNAME[|domain]
>>>
>>> The TTL of the www.leclercdrive.fr entry is 300 - which seems short to
>>> me - maybe the ServFail happens when a request is treated at the exact
>>> time of the TTL reaching zero and the cache entry beeing flushed ? I
>>> tried flushing the cache using rndc but the first request after that
>>> worked just fine (of course...)
>>>
>>> Any ideas/hints are welcome.
>>>
>>> The DNS server runs 1:9.5.1.dfsg.P3-1+lenny1
>>> cat /etc/debian_version =>  5.0.4
>>> (I have no control on the version of the tools)
>>
>>
>>
>> I found in my logfiles a few other domains where the ServFails happen,
>> their respective TTL are all different, from 300 sec to 86400. I still
>> have no idea at all how to resolve this issue and as far as I
>> investigated, I haven't been able to identify a pattern in those
>> ServFails. I'm not even sure the TTL is involved since I saw two
>> ServFail separated in time by less than the TTL value of the entry...
>>
>> Florian
>>
>
> The authorative name servers for leclercdrive.fr are a.dns.gandi.net,
> b.dns.gandi.net and c.dns.gandi.net.  I don't know how big gandi.net is,
> but traceroutes to those servers end up going through Level3 in
> Baltimore, MD from here.  They did have a hurricane go through there and
> I would not be surprised if traffic levels have been a bit high for the
> last few days.
>
> Lyle

Well, it's a french registrar, my servers are in France and my clients are
french too so from here the traceroute is pretty neat.
Anyway my problem isn't (apparently) Gandi related, or even
www.leclercdrive.fr related since the ServFails happen internally and
instantanetly in my BIND which doesn't even try to forward the A request.


Florian








More information about the bind-users mailing list