Secondaries sometimes don't respond to notify

Mon Mar 14 17:03:33 UTC 2005

This is a followup to my own queries about our problem with secondaries 
randomly missing notifies.  Adjusting the Solaris parameter 
udp_recv_hiwat on the secondaries solved the problem.

No one responded to my query on this list for experiences adjusting 
udp_recv_hiwat, so I gave it a shot based on no experience.  Online 
information about other UDP-based applications includes advice to raise 
the value of that parameter if incoming UDP packets are being dropped.

I raised the value from 8192 to 32768 on one of our secondaries, and 
restarted the daemon.  Afterward, instead of failing to respond to 
50-100 notifies out of 300, it was missing only 1-5.  Since then, I 
have used the value 65536 on both our secondaries and I am seeing no 
problems.  I have seen no other effects of the changes, adverse or 
otherwise.

We will move to faster servers soon, and this problem and solution 
suggests that it is past time we did so.  Even if the faster servers 
handle notifies without this UDP adjustment, I will likely make it in 
any case, under the theory that it gives more headroom to a resource 
limit that the application is proven capable of hitting.  Since a 
notify request has much in common with other DNS requests, I reason 
that other aspects of our DNS service were likely degraded.

One could imagine a BIND configuration parameter to cause it to adjust 
SO_RCVBUF for its UDP sockets, thus allowing this issue to be addressed 
by BIND itself.  However, I do not know how consistent this setting is 
across *NIX OSes.

John Wobus
Cornell (University) Information Technologies

On Mar 3, 2005, at 4:51 PM, John Wobus wrote:

> Re our problem with occasional dropped notifies:
>
> I'm thinking of experimenting with Solaris's setting, "udp_recv_hiwat".
> Can anyone share experience with raising udp_recv_hiwat
> on loaded query servers? Did raising it visibly help any problems
> or hurt anything?
>
> Sockets have a setting, SO_RCVBUF, controllable by setsockopt,
> that determines the size of a buffer through which the socket's
> incoming UDP packets pass, but as far as I can see, BIND9.3
> doesn't touch the setting.   Our Solaris uses 8192 by default, and
> has a system-wide setting controllable by ndd (udp_recv_hiwat)
> which is the default for SO_RCVBUF.  I'm thinking of raising
> udp_recv_hiwat some to moderately-increased value, e.g.
> doubling or quadrupling it, then restarting named, to see what
> difference it makes.
>
> John Wobus
> Cornell CIT
>
> I wrote:
>
>> I'm not solving this, so I'll give an update and see if anyone has
>> good ideas to offer me.
>>
>> The problem is two secondaries that randomly "drop" notifies from
>> the primary (BIND9.3, more details in previous message).
>>
>> The servers are on Solaris 8 and I used snoop (like tcpdump)
>> to verify that the packets do indeed cross the network to the
>> secondary.  Then, sometimes the notify works as advertised,
>> but at random times two kinds of failures occur.  Sometimes
>> named on the secondary never logs that it received the notify.
>> In much fewer instances, named does log it, but snoop never
>> shows it responding with the notify response.  I've tried looking at
>> truss output, but didn't make progress fitting together much
>> more of the picture.
>>
>> The ideas I still have left to try are (1) crank up and pore through
>> BIND debugging logging or (2) put the secondary on a bigger
>> server and see if the problem disappears.  I can believe that
>> the secondary is simply too busy, but would expect some sort
>> of logging or to hear confirmation that other sites have seen
>> this before.  In evidence, I do see the problem occurring
>> more often during busier times.
>>
>> Any ideas/inspirations appreciated.
>>
>> John Wobus
>>
>> On Jan 24, 2005, at 5:23 PM, John Wobus wrote:
>>
>>> Notifies sometimes get lost between our bind 9.3 servers.  What can I
>>> look for as a cause?
>>>
>>> Two secondary servers are showing the problem with the same primary
>>> server.  When the failure occurs, the primary server logs that it
>> sent
>>> notifies, then logs 'notify retries exceeded' for the secondary in
>>> question.  The secondary's log shows nothing.  Zones and secondaries
>>> affected at any particular instance are random: failure occurs for
>> only
>>> 10-40% of the notifications.  When one secondary fails for a
>> particular
>>> zone, the other one often succeeds in loading it.  The new zone files
>>> have updated SOA serial numbers. The failing secondary later
>> transfers
>>> the zone successfully, when the refresh interval expires.  None of
>> the
>>> servers have firewall software.  The servers serve fewer than 300
>>> zones.
>>>
>>> I've checked the network, the bind config file options (which are
>>> generally the defaults), looked for other problems in the logs,
>>> searched my bind books/manuals and searched online and I have run out
>>> of ideas.
>>>
>>> John Wobus
>>> Cornell CIT
>>>
>>>
>
>
>
>