file descriptor exceeds limit

Fri Jun 19 16:44:40 UTC 2015

On 6/19/15, 5:07 AM, "bind-users-bounces at lists.isc.org on behalf of Matus
UHLAR - fantomas" <bind-users-bounces at lists.isc.org on behalf of
uhlar at fantomas.sk> wrote:

>>On 6/18/15, 7:09 PM, "Stuart Browne" <Stuart.Browne at bomboratech.com.au>
>>wrote:
>>>Just wondering.  You mention you're using RHEL6; are you also getting
>>>messages in 'dmesg' about connection tracking tables being full?  You
>>>may
>>>need some 'NOTRACK' rules in your iptables.
>
>On 18.06.15 23:11, Mike Hoskins (michoski) wrote:
>>Just following along, for the record...  On our side, iptables is
>>completely disabled.  We do that sort of thing upstream on dedicated
>>firewalls.  Just now getting time to reply to Cathy...more detail on that
>>there.
>
>aren't those firewalls overloaded?

Originally we found an older set that was, and replaced those...  but
currently no mix of metrics, logs, packet traces, etc imply this is the
case for the current network infra components I have access to.  Being
completely transparent here, because it's something everyone should
carefully consider...but certainly not always the culprit.

More than overloading, the larger issue I've worked through (repeatedly)
over the years are various "protocol fixups", "ALGs" and the like which
try to "secure" you but really break standard things like EDNS.  After
back/forth with our network team I've reached a state of nirvana where all
that stuff is disabled and external tests like OARC are happy.

I suppose the only way to avoid any "intermediate" firewalls would be to
place everything you run on a LAN segment hanging directly off your
router/Internet drop with host based firewalls.  I've used iptables, pf,
etc a lot over the years but always considered host based firewalls an
add-on (layers of security) vs supplement for other types of
filtering...even if I placed the caches in such a segment, I'd have
clients talking through various firewalls (quite a few of them) so it's
not easy to avoid in any sort of large org -- particularly those with
various business units acquired and bolted on over time.

The original post asked if this was some sort of limit on BIND's
capability...almost certainly not, and the way to validate that is lab
testing.  I've done that using resperf and nominum's query file.  It would
be great to have two query files, one with known responsive and one with
known aberrant zones.  This would be difficult to maintain of course...
but what I've seen with the default query file (a mix of good and bad from
what I verified) you can push BIND much further than the reported qps
earlier in this post or in our production environments.  In the real world
vs lab, there are obviously a lot more variables.  Some of these we can
eliminate (like the overloaded firewall or broken fixups), others we can
tune (our own named.conf), but some we must live with...  I'm just trying
to get more confidence what's observed is really the last case.  :-)

I'm most likely being too OCD here, because after all the tuning we've got
servfails down to a fraction of a percent over any given time interval.
I've been distracted with other things recently, but need to dig into the
logs and see if these are really just unresponsive or broken upstream
servers.