BIND 9.4.2-P2-W1 stops responding

Fri Sep 5 21:14:09 UTC 2008

> -----Original Message-----
> From: Danny Mayer [mailto:mayer at gis.net]
> Sent: Friday, September 05, 2008 4:18 PM
> To: Vinny Abello
> Cc: bind-users at isc.org
> Subject: Re: BIND 9.4.2-P2-W1 stops responding
>
> Vinny Abello wrote:
> > OK, this happened again. This time I noticed that BIND was not
> responding on the primary IP bound to the server that it usually would
> previously respond on. It kept answering queries on a secondary IP
> bound to the NIC however. Again, nothing in the logs indicating any
> type of problem that I can see. Perhaps this is related to having
> multiple IP's bound to the machine. I restarted the service and it
> started working again on both IP addresses.
> >
> > Any ideas?
> >
> >> -----Original Message-----
> >> From: bind-users-bounce at isc.org [mailto:bind-users-bounce at isc.org]
> On
> >> Behalf Of Vinny Abello
> >> Sent: Friday, September 05, 2008 1:33 PM
> >> To: bind-users at isc.org
> >> Subject: BIND 9.4.2-P2-W1 stops responding
> >>
> >> I just upgraded from BIND 9.4.2 to BIND 9.4.2-P2-W1 on Windows
> Server
> >> 2003. The service no longer crashes like it did in P1 and P2,
> however
> >> after about 12 hours of load, named just stops responding to queries
> >> completely. The service appears that it is still running but will
> not
> >> respond to any type of query. I've restarted it and it came back to
> >> life again. I'm going to watch it more carefully to look for any
> other
> >> types of symptoms. I checked the log files and nothing out of the
> >> ordinary was in the logs. In fact, according to the logs, it appears
> >> that zone transfers were still happily taking place while it was not
> >> responding to queries.
> >>
> >> I don't know if these have anything to do with the issue, but there
> are
> >> a few odd errors I noted after starting it back up that are
> appearing
> >> in the logs. They are:
> >>
> >> 05-Sep-2008 13:19:26.827 dispatch: dispatch 03E25098: shutting down
> due
> >> to TCP receive error: <unknown address, family 48830>: network
> >> unreachable
> >>
> >> 05-Sep-2008 13:20:38.171 general: .\socket.c:2340: unexpected error:
> >> 05-Sep-2008 13:20:38.171 general: unable to convert errno to
> >> isc_result: 121: The semaphore timeout period has expired.
> >>
> >> 05-Sep-2008 13:21:14.733 dispatch: dispatch 03E288B0: shutting down
> due
> >> to TCP receive error: <unknown address, family 48830>: network
> >> unreachable
> >>
> >> 05-Sep-2008 13:21:44.122 general: .\socket.c:2340: unexpected error:
> >> 05-Sep-2008 13:21:44.122 general: unable to convert errno to
> >> isc_result: 121: The semaphore timeout period has expired.
> >>
> >> 05-Sep-2008 13:23:35.351 general: .\socket.c:2340: unexpected error:
> >> 05-Sep-2008 13:23:35.351 general: unable to convert errno to
> >> isc_result: 121: The semaphore timeout period has expired.
> >>
> >> 05-Sep-2008 13:24:41.300 general: .\socket.c:2340: unexpected error:
> >> 05-Sep-2008 13:24:41.300 general: unable to convert errno to
> >> isc_result: 121: The semaphore timeout period has expired.
> >>
> >>
> >> There are other normal messages in between those errors. I just
> picked
> >> them out.
> >>
> >> Some possible information that might help with this server's
> >> configuration. This server has multiple IPv4 IP addresses bound to
> the
> >> same network and same NIC. There is no IPv6 stack installed on the
> >> server. This server currently does recursion and also hosts some
> >> secondary zones as well.
> >>
> >>
> >> -Vinny
>
> Try setting max-cache and see if that helps with the queries. Don't
> worry about those other error messages. They're harmless.

I don't think that's the problem. It's happened again and I'm doing more investigating. I see now that it's only UDP queries that stop working. If I use dig to do TCP queries, BIND responds with an answer, both authoritative and non-authoritative without a problem. If I do the same queries over UDP, it does not respond. Doing a query that requires recursion over TCP works, then I try the same one over UDP and I get no response. I do it again over TCP and see it is still in the cache and the TTL has been counting down. Restarting the service makes the UDP queries work once again.

I have RADIUS running on this server as well listening on UDP ports 1645, 1646, 1812, and 1813. Should I tell BIND to not use these ports for sourcing queries? Could that be the issue?