max tcp sockets for bind 9.4.2-P1

Wed Sep 3 21:56:52 UTC 2008

On Jul 17, 6:09 am, "Jason Bratton" <jbrat... at rackspace.com> wrote:
> Hello all,
>
> Like many of you, I recently upgraded all of our caching nameservers.
> Since we were already running BIND 9.4.2, I chose to upgrade to 9.4.2-P1.
> After the upgrade, I started receiving complaints of DNS queries that were
> truncated and retried over TCP failing.
>
> It appears that BIND is limiting the number of open TCP connections to ~
> 100 per IP address it listens on.  For example, on one of our caching
> nameservers:
>
> cachens-4:~# netstat -an | grep tcp | grep 72.3.128.240 | wc -l
> 99
> cachens-4:~# netstat -an | grep tcp | grep 72.3.128.241 | wc -l
> 105
>
> From an rndc status:
>
> tcp clients: 0/1000
>
> Almost all (~99%) of the TCP connections in the above netstat are at a
> SYN_RECV state.  My guess would be customer servers that have bad firewall
> rules, but in any case, it's really not relevant to this particular
> problem because nothing has changed except for the upgrade from 9.4.2 to
> 9.4.2-P1.  I didn't change the named.conf or anything, and as you can see,
> tcp-clients is set to 1000.
>
> Did something change in the source code that would cause this?  I'm
> thinking a listen() call with backlog set to 100 that wasn't setup that
> way previously?  Something interesting to me is that the ARM specifies the
> default for tcp-clients to be 100, but maybe that is a coincidence.
>
> FWIW, SOMAXCONN is set to 128 on my servers.  Prior to this patch, I was
> using a Debian packaged version of 9.4.2, so maybe they had it set higher?
>  I looked all through the source and changes made by Debian to 9.4.2 and
> couldn't find anything to indicate this is the case.
>
> I'm open for suggestions!  This a Debian Etch box running kernel 2.6.18 on
> an x86_64 architecture.  Thanks.
>
> -- Jason
>
> Confidentiality Notice: This e-mail message (including any attached or
> embedded documents) is intended for the exclusive and confidential use of the
> individual or entity to which this message is addressed, and unless otherwise
> expressly indicated, is confidential and privileged information of Rackspace.
> Any dissemination, distribution or copying of the enclosed material is prohibited.
> If you receive this transmission in error, please notify us immediately by e-mail
> at ab... at rackspace.com, and delete the original message.
> Your cooperation is appreciated.

I am experiencing a similar issue with vendor supplied bind with 9.4.2-
p1 fixes:

QDDNS 4.1 Build 6 - Lucent DNS Server (BIND 9.4.1-P1), Copyright (c)
2008 Alcatel-Lucent
 + Includes security fixes from BIND 9.4.2-P1

It all started with a complaint that a query was failing on one of our
15 internal DNS servers.  All 15 servers were recently deployed and
were identical in configuration.  When I looked into the issue, I
noticed that the query generated a response which was truncated and
then reattempted using TCP.  I then tested queries against the
problematic server using "dig +tcp" and discovered that all DNS
queries using TCP were failing on this server.  netstat showed lots of
connections in SYN_RECV.  Since the same symptoms were encountered
before when our firewall team misconfigured rules, I then checked to
see if this was the cause.  I got on the problematic server and issued
queries to itself using TCP.  In doing so, I noticed something very
strange.  A "dig +tcp somehost.domain.com @127.0.0.1" would succeed
with no issues while a "dig +tcp somehost.domain.com
@ip.of.the.server" would result in:

; <<>> DiG 9.4.1-P1 <<>> +tcp xxxx.xxxx.xxxx @xxx.xxx.xxx.xxx
; (1 server found)
;; global options:  printcmd
;; connection timed out; no servers could be reached

I am still waiting for the vendor to accept this is not a firewall
issue since I can reproduce this by query the server from itself.