rndc stop hangs, named stuck at FUTEX WAIT

Chuck Anderson cra at WPI.EDU
Sat Dec 13 16:05:52 UTC 2014


For the second time (at least), an automatic BIND update on Scientific
Linux 6 (RHEL 6 clone) failed to restart the named process.  The RPM
package runs this to restart:

postuninstall scriptlet (using /bin/sh):
/sbin/ldconfig
if [ "$1" -ge 1 ]; then
  /sbin/service named try-restart >/dev/null 2>&1 || :;
fi;

which boils down to:

  echo -n $"Stopping named: "
  check_pidfile
  [ -x /usr/sbin/rndc ] && /usr/sbin/rndc stop >/dev/null 2>&1;
  RETVAL=$?
  # was rndc successful?
  [ "$RETVAL" -eq 0 ] || \
    killproc -p "$ROOTDIR$PIDFILE" "$named" -TERM >/dev/null 2>&1

  timeout=0
  RETVAL=0
  while pidofnamed &>/dev/null; do
    if [ $timeout -ge $NAMED_SHUTDOWN_TIMEOUT ]; then
      RETVAL=1
      break
    else
      sleep 2 && echo -n "."
      timeout=$((timeout+2))
    fi;
  done

Now I believe what is happening is "rndc stop" is hanging/timing out.
This in turn causes the restart to fail, but leaves the old version of
named in a state where it no longer answers DNS queries, causing a DNS
service outage.

When I caught it in this state this time, I did a "strace" on the PID
of named and it was stuck in FUTEX WAIT:

named    22813  6.4  5.2 678320 422788 ?       Ssl  Sep22 7684:06 /usr/sbin/named -u named

# service named restart
Stopping named: .............                              [FAILED]
Starting named: named: already running                     [  OK  ]
# killall named
# ps auxw | grep named
named    22813  6.4  5.2 678320 422788 ?       Ssl  Sep22 7684:06 /usr/sbin/named -u named
# strace -p 22813
Process 22813 attached - interrupt to quit
futex(0x7fb6212f89d0, FUTEX_WAIT, 22814, NULL^C <unfinished ...>
Process 22813 detached
# kill -9 22813
# kill -9 22813
22813: No such process
# service named restart
Stopping named:                                            [  OK  ]
Starting named:                                            [  OK  ]
# ps auxw | grep named
named    13241 33.0  1.4 290968 120488 ?       Ssl  04:10   0:02 /usr/sbin/named -u named

Has anyone else experienced this problem?  Like I said, this is the
2nd or 3rd time this has happened to my systems.

I'll file a bug in Red Hat's bugzilla, but not having a support
contract it might get ignored for 3 years.


More information about the bind-users mailing list