file descriptor exceeds limit

Thu Jun 18 13:22:02 UTC 2015

On 18/06/2015 12:00, Matus UHLAR - fantomas wrote:
> On 17.06.15 22:39, Shawn Zhou wrote:
>> BIND on my resolvers reaches the max open file limit and I am getting
>> lots
>> of SERVFAILs
>> http://pastebin.com/SxRsHLff
> 
>> After I increased the max-socks (-s 8192) to 8192, I no longer saw the
>> file
>> limit error from the log anymore; however, I am still many SERVFAILs.
> 
> no other errors?
> 
>> Our resolvers were doing about 15k queries per seconds when this was
>> happening and those were legit traffic.  I am aware that I am setting
>> recursive clients to a very high number.  Those resolvers are running on
>> 12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20% and
>> plenty of RAM left.
> 
>> I am wondering if I've reached the limit of BIND for the amount of
>> recursive queries it can serve.  Any other tunings I should try?
> 
> maybe changing number of recursive-clients, max-clients-per-query.
> 
> Does EDNS work for you? EDNS problems often result to increased number of
> TCP queries which slows down resolution ...
> 
>> By the way, the resolvers are running RHEL 6.x.
> 
> precise BIND version would help a bit more... seems RH6.6 contains 9.8.2
> but
> that may be different for older RH6 versions.
> 
> 

Unless you're running a build with --with-tuning=large (for which there
are a number of caveats around the capacity of the machine etc..), then
you don't really want to have a backlog of recursive clients that
exceeds 3000-3500.  If you're getting that many in your backlog, then as
already highlighted to you, there is Something Wrong going on.

You're probably running into other resource limits that will be what are
causing the SERVFAIL responses you're still seeing despite increasing
the maximum number of sockets that named can use.  I would tune down the
limit to 3000 and allow named to drop the oldest outstanding client
queries when new ones need to be processed.

There is another logging category you can use (query-errors) that can
tell you more, but it's probably not worth it in this instance.

And I have another suggestion for what might be causing your backlog
(apart from problems in the network path between your servers and the
Internet authoritative servers), for which we have some
soon-to-be-released new mitigation features (in 9.10.3):

https://kb.isc.org/article/AA-01178

(this will be updated to reflect the features we will actually include
in the upcoming release - but they're essentially going to be
fetches-per-server and fetches-per-zone along with with improved
logging/stats for both of those)

There's going to be a webinar about both the problem and the mitigations
on July 8th:

https://www.facebook.com/events/100311766979499/

http://goo.gl/Z8idQf

Hoping that this is useful?

Cathy