Strange recursor response time pattern

Thu Sep 7 11:17:25 UTC 2017

It would be difficult, and possibly impossible, to continue to process
queries and format a report on queries simultaneously without losing
information in the report.  To have a separate thread creating the report,
it might have to stop query processing, take a snapshot of the data at that
point in time, save it somewhere, restart query processing, and then format
the report from the saved data.  In this case, there would be a brief
interval when name could not handle queries.  One might have to write a
prototype to determine how long that interruption would take.

Charles Elliott

-----Original Message-----
From: bind-users [mailto:bind-users-bounces at lists.isc.org] On Behalf Of
Havard Eidnes
Sent: Wednesday, September 6, 2017 8:40 AM
To: matt at conundrum.com
Cc: bind-users at isc.org
Subject: Re: Strange recursor response time pattern

>> Is that pulling the old-style stats file, or the HTTP-based stats
channel?

As should be evident from my other message, this is using the HTTP-based
stats channel.

> If the latter... the zone list (and by extension the root
> document) seems to take a long time to process, and involves some sort 
> of locking that blocks all query processing while the list is being 
> generated.  We encountered this on a 3+ million zone instance.. BIND 
> would stop answering queries for several minutes if anyone requested 
> the root stats document or the zone list.

Since this name server is approximately a pure recursive resolver, the list
of authoritative zones is short, in fact only
3 configured zones ("localhost", "127.in-addr.arpa" and the corresponding
for IPv6 loopback), and then there's the "automatic" zones in addition, but
still, the halting of query processing while the list of zones is processed
should not be an issue here.

That said, I'm also rather baffled that BIND would have to stop processing
all queries while traversing the zone instances; that certainly seems to
have an excessive effect on normal operations.

> As Ray says, you may be better off individually querying each of the 
> other documents and processing those rather than polling the root doc 
> to get them all in one shot.

It's not "me" who is doing the querying, it's the collectd software.  In the
syscall trace, I see indeed that it is asking for the root document:

GET / HTTP/1.1
Host: localhost:8053
User-Agent: collectd/5.7.2
Accept: */*

However, your advice to query the separate documents in individual requests
would:

 * require a rewrite of the BIND module in collectd
 * still not entirely get rid of the problem that some queries
   are put on hold while the stats channel data is processed and
   sent

Looking at the system call trace shows me that other BIND threads do process
DNS queries while this single thread which does the HTTP handling does not.
Hence my suggestion to instead use a dedicated thread for the stats / HTTP
handling.

Oh, BTW, it also seems that BIND in my case wastes 15ms doing needless
getsockname() syscalls on FD's which are invalid as part of the early stages
of stats processing:

  5645     17 named    1504698577.991440645 CALL
getsockname(0xffffffff,0x7f7fef1f06e0,0x7f7fef1f069c)
  5645     17 named    1504698577.991446511 RET   getsockname -1 errno 9 Bad
file descriptor

(repeated lots of times).

Regards,

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to
unsubscribe from this list

bind-users mailing list
bind-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users