rndc stop hangs, named stuck at FUTEX WAIT
Chuck Anderson
cra at WPI.EDU
Sat Dec 13 16:05:52 UTC 2014
For the second time (at least), an automatic BIND update on Scientific
Linux 6 (RHEL 6 clone) failed to restart the named process. The RPM
package runs this to restart:
postuninstall scriptlet (using /bin/sh):
/sbin/ldconfig
if [ "$1" -ge 1 ]; then
/sbin/service named try-restart >/dev/null 2>&1 || :;
fi;
which boils down to:
echo -n $"Stopping named: "
check_pidfile
[ -x /usr/sbin/rndc ] && /usr/sbin/rndc stop >/dev/null 2>&1;
RETVAL=$?
# was rndc successful?
[ "$RETVAL" -eq 0 ] || \
killproc -p "$ROOTDIR$PIDFILE" "$named" -TERM >/dev/null 2>&1
timeout=0
RETVAL=0
while pidofnamed &>/dev/null; do
if [ $timeout -ge $NAMED_SHUTDOWN_TIMEOUT ]; then
RETVAL=1
break
else
sleep 2 && echo -n "."
timeout=$((timeout+2))
fi;
done
Now I believe what is happening is "rndc stop" is hanging/timing out.
This in turn causes the restart to fail, but leaves the old version of
named in a state where it no longer answers DNS queries, causing a DNS
service outage.
When I caught it in this state this time, I did a "strace" on the PID
of named and it was stuck in FUTEX WAIT:
named 22813 6.4 5.2 678320 422788 ? Ssl Sep22 7684:06 /usr/sbin/named -u named
# service named restart
Stopping named: ............. [FAILED]
Starting named: named: already running [ OK ]
# killall named
# ps auxw | grep named
named 22813 6.4 5.2 678320 422788 ? Ssl Sep22 7684:06 /usr/sbin/named -u named
# strace -p 22813
Process 22813 attached - interrupt to quit
futex(0x7fb6212f89d0, FUTEX_WAIT, 22814, NULL^C <unfinished ...>
Process 22813 detached
# kill -9 22813
# kill -9 22813
22813: No such process
# service named restart
Stopping named: [ OK ]
Starting named: [ OK ]
# ps auxw | grep named
named 13241 33.0 1.4 290968 120488 ? Ssl 04:10 0:02 /usr/sbin/named -u named
Has anyone else experienced this problem? Like I said, this is the
2nd or 3rd time this has happened to my systems.
I'll file a bug in Red Hat's bugzilla, but not having a support
contract it might get ignored for 3 years.
More information about the bind-users
mailing list