Recursion ceases for 5-10 minutes at random intervals throughout the day

Bill Springall springall at fuse.net
Fri Feb 15 19:48:24 UTC 2008


Thanks for your reply

Correct, the requests themselves were answered but just with, "Server 
Failure", messages.   (always seemed to respond quickly)  When it has 
happened to me, I was unable to get anything but the error message, 
although the graphs indicate ~100qps getting success (perhaps cache?)

(Graph: http://home.fuse.net/springall/dns-3.png - 5 min poll)

The server itself has been relatively flat when it comes to memory 
usage.  It sits at about 750M.   I can set up a process memory graph if 
needed.

The CPU does jump up to 25% CPU load from 10%, during the last spike I 
checked.

Unfortunately, I haven't tried Bind without thread support.  We have had 
good luck with threads in testing and prod (especially with 2xdual 
Opterons), so I haven't tried it.

Thanks again!
- Bill


JINMEI Tatuya / ???? wrote:
> At Wed, 13 Feb 2008 17:32:41 -0500,
> Bill Springall <springall at fuse.net> wrote:
> 
>>      Each server handles anywhere between 500-1500 qps throughout the
>> day, under normal load.  Problem occurs at all loads.
>>      I've tried port, "monitoring", tcpdumping the traffic, and sifting 
>> through the requests and nothing seems out of the ordinary.   Numerous 
>> tweaks of the OS have not helped (state table within limits and then 
>> disabled, firewall deactivated/activated, eth stats good).  When the 
>> problems happens I can get onto the machine and it is ok (network 
>> upstream good, routing table hasn't inherited anything new, server calm) 
>>   When I turn logging up to a level that can help, named can't keep up.
>>      We are now have a troubleshooting process in the works that 
>> involves different hardware and 9.4.2, environment re-architecture,  as 
>> well as, <shiver>, other caching dns software.
>>      Is there a known problem, that I haven't been able to find, that 
>> could be causing this?   As I understand the, "Server Failure", message 
>> is a general message, could someone help to point me to the next thing 
>> to try?   Any help would be appreciated!
> 
> I cannot think of a reason, but please let me ask something first.
> 
> - according to your description, the queries were not dropped, but
>   were simply responded with server failure, right?
> - how much of memory does named use when this occurs?
> - how busy (in terms of CPU utilization) is named when this occurs?
> - does this change if you disable threads?
> 
> Thanks,
> 
> ---
> JINMEI, Tatuya
> Internet Systems Consortium, Inc.
> 

-- 
Bill Springall
Systems Engineer/UNIX Administrator
Cincinnati Bell/ZoomTown.com/Fuse.net
Email: springall at fuse.net
Desk: 513.565.9787
______________________________________________________________

  Learn avidly.
  Question repeatedly what you have learned.
  Analyze it carefully.
  Then put what you have learned into practice intelligently.
                                - Confucius
_______________________________________________________________




More information about the bind-users mailing list