Recursive bind becomes unresponsive with high load

Sat Apr 2 00:02:38 UTC 2016

A few thoughts:

* You can check for dropped packets on the receive path with # netstat -u -s
High numbers on "packet receive errors” can indicate an overflow in the receive buffer - this is fixable by network stack tuning as Mike Mitchell suggests. 

* You can check for dropped packets on the send path by looking for "error sending response: unset” in the named logs
...similarly fixable with sysctl tuning.

We changed the following:
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 16777216

net.core.netdev_max_backlog = 5000
net.unix.max_dgram_qlen = 100

* Try watching your incoming UDP packet buffers in tight intervals at the same time as top

# watch -n 0.1 'cat /proc/net/udp | grep ":0035 00000000:0000 "'
OR
# watch -n 0.1 'cat /proc/net/udp | grep -v "00000000:00000000 00:00000000 00000000”'

# top -d 0.1 -p $PID_OF_NAMED # where $PID_OF_NAMED is the named pid

Does the named unresponsiveness coincide with the UDP rx_queue filling up and named dropping to 0% CPU usage?

Mathew Eis
Northern Arizona University
Information Technology Services
mathew.eis at nau.edu
(928) 523-2960

-----Original Message-----
From: Michael Brunnbauer <brunni at netestate.de>
Date: Friday, April 1, 2016 at 9:29 AM
To: Mathew Eis <Mathew.Eis at nau.edu>
Cc: "bind-users at lists.isc.org" <bind-users at lists.isc.org>, <dot at dotat.at>
Subject: Re: Recursive bind becomes unresponsive with high load

>
>Hello Mathew,
>
>On Fri, Apr 01, 2016 at 04:01:04PM +0000, Mathew Ian Eis wrote:
>> What OS are you running your BIND server on? Is it virtualized?
>
>Linux Kernel 3.4.111 with glibc 2.22, 32bit, not virtualized. No distribution -
>everything was compiled by hand.
>
>> Is it fully unresponsive, or could it be simply taking longer to respond than your client timeout?
>
>Assuming that bind would report dropped queries, I guess it is the latter.
>
>Regarding the suggestion made by Tony Finch about too many TCP connections
>in the TIME_WAIT status: That would have been a good explanation. But I do not
>see more than 200 TCP connections in TIME_WAIT status when the problem occurs
>and not more than 5000 TCP/UDP connections with port 53. 
>
>cu,
>brunni
>
>-- 
>++  Michael Brunnbauer
>++  netEstate GmbH
>++  Geisenhausener Straße 11a
>++  81379 München
>++  Tel +49 89 32 19 77 80
>++  Fax +49 89 32 19 77 89 
>++  E-Mail brunni at netestate.de
>++  http://www.netestate.de/
>++
>++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
>++  USt-IdNr. DE221033342
>++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel