BIND and UDP tuning

Thu Sep 27 14:53:25 UTC 2018

Hi,

> > I reported a few weeks ago that I was experiencing a really high
> > number of "SERVFAIL" messages in my bind-9.11.4-P1 system running on
> > fedora28, and I haven't yet found a solution. This is all now running
> > on a 165/35 cable system.
> >
> > I found a program named dropwatch which is showing a significant
> > number of dropped UDP packets, particularly when there are bursts of
> > email traffic:
> >
> > 12 drops at skb_queue_purge+13 (0xffffffff9f79a0c3)
> > 1 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
> > 4 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
> > 5 drops at nf_hook_slow+a7 (0xffffffff9f7faff7)
> > 3 drops at sk_stream_kill_queues+48 (0xffffffff9f7a1158)
> > 3 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
> > ...
> >
> > # netstat -us
> > ...
> > Udp:
> >     23449482 packets received
> >     1724269 packets to unknown port received
> >     8248 packet receive errors
> >     31394909 packets sent
> >     8243 receive buffer errors
> >     0 send buffer errors
> >     InCsumErrors: 5
> >     IgnoredMulti: 43247
> >
> > The SERVFAIL messages don't necessarily correspond to the UDP packet
> > errors shown by netstat, but the dropwatch output is continuous. The
> > netstat packet receive errors also don't seem to correspond to
> > "SERVFAIL" or "Name service" errors:
> >
> > 26-Sep-2018 12:42:49.743 query-errors: info: client @0x7fb3c41634d0
> > 127.0.0.1#44104 (46.36.47.104.wl.mailspike.net): query failed
> > (SERVFAIL) for 46.36.47.104.wl.mailspike.net/IN/A at
> > ../../../bin/named/query.c:8580
> >
> > Sep 26 12:47:11 mail03 postfix/dnsblog[22821]: warning: dnsblog_query:
> > lookup error for DNS query 196.91.107.80.bl.spameatingmonkey.net: Host
> > or domain name not found. Name service error for
> > name=196.91.107.80.bl.spameatingmonkey.net type=A: Host not found, try
> > again
> >
> > I've been following this thread from some time ago, but nothing I've
> > done has made a difference. I really don't know what the buffer sizes
> > should be.
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__bind-2Dusers-
> > 2Dforum.2342410.n4.nabble.com_Tuning-2Dsuggestions-2Dfor-2Dhigh-2Dcore-
> > 2Dcount-2DLinux-2Dservers-
> > 2Dtd3899.html&d=DwICAg&c=MOptNlVtIETeDALC_lULrw&r=udvvbouEjrWNUMab5xo_vLb
> > UE6LRGu5fmxLhrDvVJS8&m=5XQNuuRQ4kxK03zqoWaJHIdaJvNdsyTKHuFlDKedbpc&s=5Dqh
> > ne-5w5V_1coBTBvTITwK2EFeankOegTaofy8S5w&e=
> >
> > Are there specific bind tunables you might recommend? edns-udp-size,
> > perhaps?
> >
> > Any ideas on other tunables such as net.core.*mem_default etc?
>
> *chuckles to self*
>
> I was just referring back to that thread myself to try remember what I did.
>
> I ended up tuning the following items:
>
>   - name: SYSCTL system tuning, basics
>     sysctl:
>       name: "{{ item.name }}"
>       value: "{{ item.value }}"
>       sysctl_set: yes
>       state: present
>     with_items:
>       - { name: 'vm.swappiness', value: 0 }
>       - { name: 'net.core.netdev_max_backlog', value: 32768 }
>       - { name: 'net.core.netdev_budget', value: 2700 }
>       - { name: 'net.ipv4.tcp_sack', value: 0 }
>       - { name: 'net.core.somaxconn', value: 2048 }
>       - { name: 'net.core.rmem_default', value: 16777216 }
>       - { name: 'net.core.rmem_max', value: 16777216 }
>       - { name: 'net.core.wmem_default', value: 16777216 }
>       - { name: 'net.core.wmem_max', value: 16777216 }

Were you troubleshooting the same problems as I'm experiencing?

Many of these values I've already tweaked and have had no effect on my
SERVFAIL issues :-(

I've also been following the performance tuning variables in this RH document:
https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf

These errors appear to occur in spurts - there is typically ten or
more in a row at a time, then any number of minutes/seconds before the
next one.

It looks like there are periods of as many as 500 queries per second,
although the usual amount is closer to 200 per second.

I don't believe this is a bind configuration problem, as the "Name
service error" errors from postfix also occur when testing with
unbound.

This is also only happening on the two identical systems connected to
the 165/35mbit cable modem. I've verified with Oponline, and they've
emphatically asserted there are no problems with the circuit. The
systems are 8-core Xeon E31240 with 16GB RAM. I've also tried other
systems, including a 12-core i7 with 32GB.

We have several other systems connected to a 10mbit DIA ethernet
circuit where these errors don't generally occur. They are also
similarly configured fedora systems with the same version of bind.

I'm really at a loss as to what the problem(s) are, but feel like it's
really impacting our ability to query RBLs for processing mail.

> Whilst mentioned in passing on that thread, there was also poking around with TOE, pause, coalesce adaptive and ring size settings (look at ethtool -K, ethtool -A, ethtool -C and ethtool -G), but sadly have lost the specific commands.

I've also tried configuring the NIC with ethtool according to the
variables defined in the RH document listed above and have had no
success.

This really is just a stock system. I can't believe these problems
would be so elusive or uncommon. Could it have to do with some
characteristic of the cable circuit itself?

I've also experimented with QoS, using tc to prioritize interactive
traffic, including tcp and udp port 53, with plenty of bandwidth.

I really hope there is someone with some additional ideas.
Thanks,
Alex