Bind9 stops responding for some clients

Fri May 31 03:08:42 UTC 2019

On Thu, May 30, 2019 at 8:10 PM Gregory Sloop <gregs at sloop.net> wrote:
>
> So, this is a very odd situation and I'm kind of grasping at straws here.
> So, I've come to see if any of you have any good straws!
>
> The setup.
> ---
> Ubuntu 18.04 LTS is the distro we're running on.
> All software is packaged [from the distro] - not compiled from sources.
> Bind9 acting as a recursive resolver for a smallish network. 150 seats.
> They're also handling DHCP and Chrony/NTP requests.
> [I actually have a pair of these handling DNS/DHCP/NTP this is the master.]
>
> They are running on a Xen/XCP VM.
>
> The one I'm having problems is the master for several internal zones - the one that's working fine is the slave for those same zones. None of the zones are large.
>
> Intermittently, Bind9 simply stops handling queries from *some* hosts.
> Meaning, it simply times out for responses for those hosts.
> Yet BIND *is* working fine for lots of other machines on the same networks. It's working fine doing dig queries locally on the server, and handles dns queries fine for lots of other machines. Yet, again, some machines simply get time-outs. I can't find any pattern to which machines get timeouts and which don't.

This is probably a really long shot, but is it possible that the
machines which don't work are trying to use TCP to query the server
(e.g because of weird MTU issues, or similar)?
I recently ran into sporadic issues where BIND would simply stop
listening on TCP -- there would be nothing in the logs, but netstat
would confirm the there was suddenly nothing listening on TCP 53.

I created a prometheus rule to monitor for this:
  - name: DNS TCP
    rules:
      - alert: DNS Port 53 down on ron.
        expr: probe_success{instance="{{server}}",job="dns_tcp_port"}
== 0 or up{job="dns_tcp_port"} ==0
        for: 5m
        labels:
          severity: page
        annotations:
          identifier: '{{ $labels.instance }}'
          summary: "DNS Port 53 down on Ron {{ $labels.instance }}"
          description: "{{ $labels.instance }} probe_success returned
{{ $value }}"

and it fired twice -- and then I upgraded to BIND 9.12.4-P1 and the
problem hasn't happened since...
The obvious questions:
1: what was I running on this machine before? I think 9.12.<something>
-- will have to check git for more detail
2: why didn't I file a bug report / take a dump / something? I kept
meaning to, but it always broke at inopportune times, so I'd just
restart and plan to do a better job next time...

W

>
> I've checked - no firewalls, fail2ban or the like that might be causing this.
> No selinux/apparmour.
> Hosts that can't do dns queries can ping the dns server fine.
> [So, there's at least some network pathway to the DNS machine.]
>
> Review of the logs for bind don't show anything that looks like a problem to me.
> [But I'm not sure what keywords I ought to be looking for, in an effort to find symptoms/problems.]
>
> Finally, the two bind/dhcp/ntp servers are currently running on the same Xen host, so if it's somehow host related, I'd expect both to have problems, but they don't.
>
> Top doesn't show any CPU distress.
> Processes look fine
> Memory in use is far below what allocated to the machine. [1G allocated, like <400M used.]
> Restart of BIND doesn't do anything, at least in the cases I've seen - which aren't all that many yet.
> A restart of the whole VM does appear to fix the issue immediately.
> These appear to occur every 3-5 days.
> Oh, and if you simply wait, it eventually starts handling queries for all hosts again - but it might be a couple+ hours.
>
> Any suggestions on things I might hunt for in the logs in an attempt to figure out what's happening?
> Other suggestions for things to look for/consider?
>
> TIA
> -Greg
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf