BIND 9.4.2-P2-W1 stops responding

Sat Sep 6 01:22:06 UTC 2008

Vinny Abello wrote:
>> -----Original Message-----
>> From: bind-users-bounce at isc.org [mailto:bind-users-bounce at isc.org] On
>> Behalf Of Vinny Abello
>> Sent: Friday, September 05, 2008 5:20 PM
>> To: mayer at gis.net
>> Cc: bind-users at isc.org
>> Subject: RE: BIND 9.4.2-P2-W1 stops responding
>>
>>> -----Original Message-----
>>> From: Danny Mayer [mailto:mayer at gis.net]
>>> Sent: Friday, September 05, 2008 4:18 PM
>>> To: Vinny Abello
>>> Cc: bind-users at isc.org
>>> Subject: Re: BIND 9.4.2-P2-W1 stops responding
>>>
>>> Vinny Abello wrote:
>>>> OK, this happened again. This time I noticed that BIND was not
>>> responding on the primary IP bound to the server that it usually
>> would
>>> previously respond on. It kept answering queries on a secondary IP
>>> bound to the NIC however. Again, nothing in the logs indicating any
>>> type of problem that I can see. Perhaps this is related to having
>>> multiple IP's bound to the machine. I restarted the service and it
>>> started working again on both IP addresses.
>>>> Any ideas?
>>>>
>>>>> -----Original Message-----
>>>>> From: bind-users-bounce at isc.org [mailto:bind-users-bounce at isc.org]
>>> On
>>>>> Behalf Of Vinny Abello
>>>>> Sent: Friday, September 05, 2008 1:33 PM
>>>>> To: bind-users at isc.org
>>>>> Subject: BIND 9.4.2-P2-W1 stops responding
>>>>>
>>>>> I just upgraded from BIND 9.4.2 to BIND 9.4.2-P2-W1 on Windows
>>> Server
>>>>> 2003. The service no longer crashes like it did in P1 and P2,
>>> however
>>>>> after about 12 hours of load, named just stops responding to
>> queries
>>>>> completely. The service appears that it is still running but will
>>> not
>>>>> respond to any type of query. I've restarted it and it came back
>> to
>>>>> life again. I'm going to watch it more carefully to look for any
>>> other
>>>>> types of symptoms. I checked the log files and nothing out of the
>>>>> ordinary was in the logs. In fact, according to the logs, it
>> appears
>>>>> that zone transfers were still happily taking place while it was
>> not
>>>>> responding to queries.
>>>>>
>>>>> I don't know if these have anything to do with the issue, but
>> there
>>> are
>>>>> a few odd errors I noted after starting it back up that are
>>> appearing
>>>>> in the logs. They are:
>>>>>
>>>>> 05-Sep-2008 13:19:26.827 dispatch: dispatch 03E25098: shutting
>> down
>>> due
>>>>> to TCP receive error: <unknown address, family 48830>: network
>>>>> unreachable
>>>>>
>>>>> 05-Sep-2008 13:20:38.171 general: .\socket.c:2340: unexpected
>> error:
>>>>> 05-Sep-2008 13:20:38.171 general: unable to convert errno to
>>>>> isc_result: 121: The semaphore timeout period has expired.
>>>>>
>>>>> 05-Sep-2008 13:21:14.733 dispatch: dispatch 03E288B0: shutting
>> down
>>> due
>>>>> to TCP receive error: <unknown address, family 48830>: network
>>>>> unreachable
>>>>>
>>>>> 05-Sep-2008 13:21:44.122 general: .\socket.c:2340: unexpected
>> error:
>>>>> 05-Sep-2008 13:21:44.122 general: unable to convert errno to
>>>>> isc_result: 121: The semaphore timeout period has expired.
>>>>>
>>>>> 05-Sep-2008 13:23:35.351 general: .\socket.c:2340: unexpected
>> error:
>>>>> 05-Sep-2008 13:23:35.351 general: unable to convert errno to
>>>>> isc_result: 121: The semaphore timeout period has expired.
>>>>>
>>>>> 05-Sep-2008 13:24:41.300 general: .\socket.c:2340: unexpected
>> error:
>>>>> 05-Sep-2008 13:24:41.300 general: unable to convert errno to
>>>>> isc_result: 121: The semaphore timeout period has expired.
>>>>>
>>>>>
>>>>> There are other normal messages in between those errors. I just
>>> picked
>>>>> them out.
>>>>>
>>>>> Some possible information that might help with this server's
>>>>> configuration. This server has multiple IPv4 IP addresses bound to
>>> the
>>>>> same network and same NIC. There is no IPv6 stack installed on the
>>>>> server. This server currently does recursion and also hosts some
>>>>> secondary zones as well.
>>>>>
>>>>>
>>>>> -Vinny
>>> Try setting max-cache and see if that helps with the queries. Don't
>>> worry about those other error messages. They're harmless.
>>>
>>> Danny
>> OK, I've added the avoid-v4-udp-ports to my named.conf with all the UDP
>> ports I could identify were being used by other applications including
>> my RADIUS service. I've restarted and I'll see if this helps at all.
> 
> Well, that had no effect. Still seems to die pretty frequently. I can't easily catch and restart BIND every 30 minutes so I'm going to have to replace this server with a different one running an operating system that behaves better with BIND. I already did this on my other two name servers. If you have any other ideas or reasons I shouldn't abandon BIND on Windows, let me know while I can still test it.
> 
> -Vinny
>

It's very hard to tell. The code has been heavily stress tested to
ensure it does not die. Are any of your address DHCP supplied or are
they all fixed IP addresses. That could potentially have affects but I
haven't seen that be a problem in the latest code. There will also be a
9.5.0-P2-W1 soon which has additional changes that may help but it's
hard to know without seeing details. Can you file a bug report on this
to bind9-bugs at isc.org with details of your named.conf file, what you are
seeing, netstat -an to see what sockets are open, vm used, handles and
threads, CPU information and O/S and anything else that might be useful.
A dump file if you have one might also be helpful if you have one and
you can find it.

Danny