Recursive bind becomes unresponsive with high load

Mike Hoskins (michoski) michoski at cisco.com
Thu Mar 31 16:05:39 UTC 2016


If you are crawling lots of new names, the cache size won't have much
impact.  Each new query will require recursing vs hitting the cache.  Try
"rndc recursing" and look at what you have sitting around waiting for
answers.  Hopefully that provides some clues.  This can be all sorts of
things like unresponsive auth servers, network issues, firewalls munging
EDNS, etc causing the recursive client backlog.


On 3/31/16, 11:57 AM, "bind-users-bounces at lists.isc.org on behalf of
Michael Brunnbauer" <bind-users-bounces at lists.isc.org on behalf of
brunni at netestate.de> wrote:

>
>hi all,
>
>I am using bind on a server that does massive crawling with a
>multithreaded 
>Java app. This server occasionally has to do lookups for hosts in our
>local
>zone netestate.de - for which it is not authoritative - and those lookups
>tend
>to fail when the load is high (e.g. >1000 recursing clients). This
>suggests 
>some kind of congestion.
>
>I have verified that the authoritative name servers for our local zone
>are not
>hammered with requests from the bind instance in question (adding . to
>every
>hostname is important :-) I also have verified that lookups from the
>crawlers
>for the local zone on the lo interface are not excessive. The problem
>occurs
>even before max-cache-size is reached.
>
>Here is my setup:
>
>max-cache-size 1610612736;
>recursive-clients 6000;
>minimal-responses yes;
>
>Mar 31 14:04:51 bardolino named[1506]: starting BIND 9.10.3-P2
><id:f9be8b2> -t /etc/namedroot -u named
>Mar 31 14:04:51 bardolino named[1506]: built with
>'--prefix=/usr/local/bind' '--with-openssl=/usr/lib/ssl'
>'--enable-threads' '--with-tuning=large'
>Mar 31 14:04:51 bardolino named[1506]:
>----------------------------------------------------
>Mar 31 14:04:51 bardolino named[1506]: BIND 9 is maintained by Internet
>Systems Consortium,
>Mar 31 14:04:51 bardolino named[1506]: Inc. (ISC), a non-profit 501(c)(3)
>public-benefit
>Mar 31 14:04:51 bardolino named[1506]: corporation.  Support and training
>for BIND 9 are
>Mar 31 14:04:51 bardolino named[1506]: available at
>https://www.isc.org/support
>Mar 31 14:04:51 bardolino named[1506]:
>----------------------------------------------------
>Mar 31 14:04:51 bardolino named[1506]: adjusted limit on open files from
>65536 to 1048576
>Mar 31 14:04:51 bardolino named[1506]: found 4 CPUs, using 4 worker
>threads
>Mar 31 14:04:51 bardolino named[1506]: using 2 UDP listeners per interface
>Mar 31 14:04:51 bardolino named[1506]: using up to 21000 sockets
>
>/etc/resolv.conf:
>
> domain netestate.de
> nameserver 127.0.0.1
> options timeout:10 attempts:1
>
>The problem also occurs with unchanged options (timeout:5 attempts:2).
>
>I can control the number of DNS-threads of my crawling app and have
>tested it
>with up to ca. 3500 recursing clients which results in a number of
>queries/s
>of the same magnitude. With that setup, lookup errors for the local zone
>occur very often (the TTL for the local zone is 10 minutes).
>
>I would be grateful for advice on where to search or what to adjust.
>
>Here is a statistics dump while running with ca. 1000 recursing clients. A
>high number of failing queries may be natural - we have a high number of
>chinese link farms in our database.
>
>+++ Statistics Dump +++ (1459439461)
>++ Incoming Requests ++
>             7329332 QUERY
>++ Incoming Queries ++
>             7261964 A
>                1357 NS
>                   4 CNAME
>                 635 PTR
>                   7 MX
>               65365 AAAA
>++ Outgoing Queries ++
>[View: default]
>            15552970 A
>                2022 NS
>                  78 CNAME
>                  30 PTR
>                   7 MX
>               28796 AAAA
>[View: _bind]
>++ Name Server Statistics ++
>             7329332 IPv4 requests received
>              192360 requests with EDNS(0) received
>                   4 TCP requests received
>                 605 auth queries rejected
>                   1 recursive queries rejected
>             7327981 responses sent
>                   5 truncated responses sent
>              192358 responses with EDNS(0) sent
>             6063138 queries resulted in successful answer
>             6386951 queries resulted in non authoritative answer
>              115630 queries resulted in nxrrset
>              940424 queries resulted in SERVFAIL
>              208183 queries resulted in NXDOMAIN
>             6756330 queries caused recursion
>                   3 duplicate queries received
>                 348 queries dropped
>                 606 other query failures
>                1000 recursing clients
>             7328722 UDP queries received
>                   4 TCP queries received
>++ Zone Maintenance Statistics ++
>++ Resolver Statistics ++
>[Common]
>                  33 mismatch responses received
>                 999 UDP queries in progress
>                   1 TCP queries in progress
>[View: default]
>            15583903 IPv4 queries sent
>             6182728 IPv4 responses received
>              201626 NXDOMAIN received
>               14456 SERVFAIL received
>                  46 FORMERR received
>                 138 EDNS(0) query failures
>               19648 truncated responses received
>                 379 lame delegations received
>             8550865 query retries
>             9401889 query timeouts
>               15859 IPv4 NS address fetches
>                 581 IPv4 NS address fetch failed
>              242332 queries with RTT < 10ms
>              307416 queries with RTT 10-100ms
>             5575709 queries with RTT 100-500ms
>               46819 queries with RTT 500-800ms
>                1560 queries with RTT 800-1600ms
>                8729 queries with RTT > 1600ms
>                1000 active fetches
>                 523 bucket size
>               39623 REFUSED received
>[View: _bind]
>                 523 bucket size
>++ Cache Statistics ++
>[View: default]
>             7790005 cache hits
>                  14 cache misses
>              660920 cache hits (from query)
>             6766805 cache misses (from query)
>                   0 cache records deleted due to memory exhaustion
>             4272624 cache records deleted due to TTL expiration
>             1665005 cache database nodes
>             1064959 cache database hash buckets
>           446003517 cache tree memory total
>           365453290 cache tree memory in use
>           365453458 cache tree highest memory in use
>           360239104 cache heap memory total
>             6836800 cache heap memory in use
>             7262784 cache heap highest memory in use
>[View: _bind (Cache: _bind)]
>                   0 cache hits
>                   0 cache misses
>                   0 cache hits (from query)
>                   0 cache misses (from query)
>                   0 cache records deleted due to memory exhaustion
>                   0 cache records deleted due to TTL expiration
>                   0 cache database nodes
>                  64 cache database hash buckets
>              284496 cache tree memory total
>               25096 cache tree memory in use
>               25096 cache tree highest memory in use
>              262144 cache heap memory total
>                 576 cache heap memory in use
>                 576 cache heap highest memory in use
>++ Cache DB RRsets ++
>[View: default]
>             1370430 A
>              119502 NS
>               12152 CNAME
>                3024 AAAA
>                 347 DS
>               25980 RRSIG
>               24925 NSEC
>               28986 !A
>                5444 !AAAA
>              110806 NXDOMAIN
>[View: _bind (Cache: _bind)]
>++ ADB stats ++
>[View: default]
>                1021 Address hash table size
>                6266 Addresses in hash table
>                1021 Name hash table size
>                7209 Names in hash table
>[View: _bind]
>                1021 Address hash table size
>                1021 Name hash table size
>++ Socket I/O Statistics ++
>            16276052 UDP/IPv4 sockets opened
>               19652 TCP/IPv4 sockets opened
>                   1 Raw sockets opened
>            16275047 UDP/IPv4 sockets closed
>               26470 TCP/IPv4 sockets closed
>              711791 UDP/IPv4 socket bind failures
>            15564255 UDP/IPv4 connections established
>               12444 TCP/IPv4 connections established
>                6824 TCP/IPv4 connections accepted
>                 163 UDP/IPv4 recv errors
>                1005 UDP/IPv4 sockets active
>                6829 TCP/IPv4 sockets active
>                   1 Raw sockets active
>++ Per Zone Query Statistics ++
>--- Statistics Dump --- (1459439461)
>
>Regards,
>
>Michael Brunnbauer
>
>-- 
>++  Michael Brunnbauer
>++  netEstate GmbH
>++  Geisenhausener Straße 11a
>++  81379 München
>++  Tel +49 89 32 19 77 80
>++  Fax +49 89 32 19 77 89
>++  E-Mail brunni at netestate.de
>++  http://www.netestate.de/
>++
>++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
>++  USt-IdNr. DE221033342
>++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel



More information about the bind-users mailing list