how to log all recursive query responses?

Fri Aug 8 22:44:14 UTC 2008

David Sparks wrote:
> Does the above log the responses or just the queries?
>
> I'm trying to debug why two 1000qps BIND servers side by side are giving out 
> different (cached?) results (one SERVFAIL, one correct answer) from a close 
> (one Internet hop but in the same data centre) rbldnsd server.  The SERVFAIL 
> is incorrect and I can't figure out how named got things wrong in the first place.
>
> The incorrect SERVFAIL also seems to be cached but I can't see anything about 
> the query from rndc dumpdb output.
>
> rndc dumpdb -cache shows that the server with the correct answer has cached 
> values.  What I don't understand is why the named that doesn't have a cached 
> answer doesn't resolve the query, instead it returns SERVFAIL immediately?
>
> This only happens after named has been running hard for several days.  I've 
> pasted an example below, ns1 gets SERVFAIL and ns2 gets the proper answer.
>
> daves at sentinel ~ $ host -v -t a X.X.X.213.fur.ca1.sophosxl.com. ns1
> Trying "X.X.X.213.fur.ca1.sophosxl.com"
> Received 49 bytes from 10.99.159.11#53 in 89 ms
> Trying "X.X.X.213.fur.ca1.sophosxl.com"
> Using domain server:
> Name: ns1
> Address: 10.99.159.11#53
> Aliases:
>
> Host X.X.X.213.fur.ca1.sophosxl.com not found: 2(SERVFAIL)
> Received 49 bytes from 10.99.159.11#53 in 88 ms
>
>
> daves at sentinel ~ $ host -v -t a X.X.X.213.fur.ca1.sophosxl.com. ns2
> Trying "X.X.X.213.fur.ca1.sophosxl.com"
> Using domain server:
> Name: ns2
> Address: 10.99.159.12#53
> Aliases:
>
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36177
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
>
> ;; QUESTION SECTION:
> ;X.X.X.213.fur.ca1.sophosxl.com. IN    A
>
> ;; ANSWER SECTION:
> X.X.X.213.fur.ca1.sophosxl.com. 2100 IN A      127.0.4.2
>
> ;; AUTHORITY SECTION:
> fur.ca1.sophosxl.com.   352     IN      NS      ca1.sophosxl.com.
>
> ;; ADDITIONAL SECTION:
> ca1.sophosxl.com.       569     IN      A       209.17.179.166
>
> Received 95 bytes from 10.99.159.12#53 in 26 ms
>
>   
>> If you want to capture the contents of the actual *packets* that named 
>> is generating, I'd recommend a packet capture utility such as "tcpdump". 
>> It's not too hard to restrict the captures to responses only, where the 
>> RD flag in the header is set to 1 (indicating that the original query 
>> was recursive). For the PC platform, there's also WireShark, but to be 
>> honest, I haven't played much with its filtering capabilities.
>>     
>
> I'm not sure how to filter on the RD flag?  Will this filter be sufficient or 
> do I also need the query packet to figure out what happened?:
>
> tcpdump -s 1024 src port 53 and not src host ns1
>
>   
Ugh, this is a bit of a difficult problem, especially if you're at the 
1000qps level (lots of data to wade through, eyeballing is not really an 
option).

Looking at the data between the client and the BIND server is probably 
not going to be very useful, you'll just see a question come in, and, at 
some point, a SERVFAIL response going back.

To get to the root cause, you'll probably want to look at the data 
passing back and forth between the BIND boxes and rbldnsd, to pinpoint 
why BIND caches a SERVFAIL in the first place. Is it a timeout? Is it a 
SERVFAIL response from rbldnsd? Something else?

If there is a *specific* name you want to focus on, it's possible to do 
that with tcpdump, but it's rather painful, e.g.

tcpdump -v -x udp and port 53 and 'udp[20] == 3' and 'udp[21] == 102' 
and 'udp[22] == 111' and 'udp[23] == 111'

would limit the capture to only packets with a Question Section 
containing a first label of "foo" (3 is the label size, 102 is the ASCII 
code for "f", 111 is the ASCII code for "o"). The Question Section is 
copied from the original query to the response, so this should catch 
responses too.

If, on the other hand, you're trying to answer the question "why do I 
get a SERVFAIL, some of the time, for some names, seemingly at random?", 
then I don't know that a targeted tcpdump is going to help. You might 
have to capture *everything*, detect the error, and then wade through 
the data later.

            - Kevin