Bind9 resolver diagnostics for very small number of dropped requests

Wed Aug 31 18:56:21 UTC 2022

Hello,

We have a cluster of Bind9 resolvers behind load balancers (for historical reasons, mainly that we can't force people to use multiple resolver IP addresses in their configurations(static) and everything still has to work).

The load balancers do health checks to determine whether or not the hosts are responding to queries and then based the result of those checks the individual hosts are rotated in and out of operation.

We noticed that some of these health checks are failing (seemingly at random) and hosts are flapping in and out of the SLB pool, but we cannot actually figure out why those queries are failing.

43/1656 queries resulted in DNS mesg recv: no answ section

Our environment is EL7 running BIND 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.9

Checking standard logging channels the only real error we see from named is this:

"named[5821]: dispatch 0x7f70e400fad0: shutting down due to TCP receive error: (seemingly random IP address) connection reset" but the source IP that the health checks come from don't appear anywhere in the logs.

We read through this document https://kb.isc.org/docs/monitoring-recommendations-for-bind-9 which gave us some good ideas on things to look at but sadly there doesn't appear to be anything sticking out at us as a real cause.

If anyone has any thoughts on this I would be really grateful.

Thanks,
-Drew

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20220831/b64799d9/attachment.htm>