Bind9 stops responding for some clients

Fri May 31 02:39:35 UTC 2019

Ugh. Not wanting to packet capture. :)
[Yeah, not that hard, but it always seems to suck up so much time - it's like the black hole for time, I think.]
But, yeah, absent some other smoking gun, that's probably where we're headed. 

As for rate limiting - "rndc recursing" didn't show anything being rate limited. [No output. I assume that means there's nothing there. And ISC claims that rate limiting isn't turned on by default - I certainly haven't enabled anything related to rate limits.]

There's no firewalling/filtering between any of the affected clients and the DNS servers.
Without going into too much detail, the network's pretty flat. 
There are (4) /24 subnets, but they're all passing through a L3 switch for "routing." 
Essentially all filtering occurs at the border only, so, SPI or other stuff shouldn't be in the mix here.

Sigh. 
As my colleague said...
"Heh. How, um, fun?"

-Greg

Whilst you mentioned 150 seats and you mentioned 'no firewalls', you didn't mention the network topology at all, in particular is traffic passing through a commercial firewall/router (hardware or virtualized) to get to the DNS server? If there is, it may be worth checking what packet inspection is turned on for DNS traffic (Cisco, Juniper and Checkpoint have been known to have buggy inspection routines in the past).

I might also be worthwhile to see what your open filehandles are like and whether there's any rate limiting configured in the distributed BIND configuration.

Stuart

From: bind-users [mailto:bind-users-bounces at lists.isc.org] On Behalf Of John W. Blue
Sent: Friday, 31 May 2019 11:47 AM
To: bind-users at lists.isc.org
Subject: Re: Bind9 stops responding for some clients

Good job on the amount of troubleshooting work done so far.

Next steps should be to run tcpdump on the interface for port 53 to see what is happening when an outage is in progress.  What you will be looking for specifically is the query packet in and the response packet out.

Use the following command:

tcpdump -n -i eth0 port domain and host 172.24.67.32

Swap out eth0 for whatever you have configured and the host IP address for a host that is having problems.

John

Sent from Nine

From: Gregory Sloop <gregs at sloop.net>
Sent: Thursday, May 30, 2019 7:11 PM
To: bind-users at lists.isc.org
Subject: Bind9 stops responding for some clients

So, this is a very odd situation and I'm kind of grasping at straws here.
So, I've come to see if any of you have any good straws!

The setup.
---
Ubuntu 18.04 LTS is the distro we're running on. 
All software is packaged [from the distro] - not compiled from sources.
Bind9 acting as a recursive resolver for a smallish network. 150 seats.
They're also handling DHCP and Chrony/NTP requests.
[I actually have a pair of these handling DNS/DHCP/NTP this is the master.]

They are running on a Xen/XCP VM.

The one I'm having problems is the master for several internal zones - the one that's working fine is the slave for those same zones. None of the zones are large.

Intermittently, Bind9 simply stops handling queries from *some* hosts. 
Meaning, it simply times out for responses for those hosts.
Yet BIND *is* working fine for lots of other machines on the same networks. It's working fine doing dig queries locally on the server, and handles dns queries fine for lots of other machines. Yet, again, some machines simply get time-outs. I can't find any pattern to which machines get timeouts and which don't.

I've checked - no firewalls, fail2ban or the like that might be causing this. 
No selinux/apparmour.
Hosts that can't do dns queries can ping the dns server fine. 
[So, there's at least some network pathway to the DNS machine.]

Review of the logs for bind don't show anything that looks like a problem to me.
[But I'm not sure what keywords I ought to be looking for, in an effort to find symptoms/problems.]

Finally, the two bind/dhcp/ntp servers are currently running on the same Xen host, so if it's somehow host related, I'd expect both to have problems, but they don't.

Top doesn't show any CPU distress.
Processes look fine
Memory in use is far below what allocated to the machine. [1G allocated, like <400M used.]
Restart of BIND doesn't do anything, at least in the cases I've seen - which aren't all that many yet.
A restart of the whole VM does appear to fix the issue immediately.
These appear to occur every 3-5 days.
Oh, and if you simply wait, it eventually starts handling queries for all hosts again - but it might be a couple+ hours.

Any suggestions on things I might hunt for in the logs in an attempt to figure out what's happening?
Other suggestions for things to look for/consider?

TIA
-Greg

-- 
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gregs at sloop.net
http://www.sloop.net
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20190530/88c874ab/attachment-0001.html>