BIND 9.5.0 Crashes Solaris 9

bsfinkel at anl.gov bsfinkel at anl.gov
Tue Jul 8 18:41:35 UTC 2008


We are running BIND 9.5.0 on Solaris 9.  Every two hours we have a cron
job that does

     mv /var/log/named.query.log ...
     returncode=$?
     /export/home/named/rndc reconfig

to rename the current query.log and run "reconfig" to start a new 
query log.  In /var/adm/messages I see

     Jul  8 11:58:15 oberon named[18737]:
       [ID 873579 daemon.info] too many timeouts resolving
       '5.224.241.207.sbl.spamhaus.org/TXT' (in 'sbl.spamhaus.org'?):
       disabling EDNS

     Jul  8 11:58:16 oberon last message repeated 2 times

     Jul  8 11:59:01 oberon named[18737]:
       [ID 873579 daemon.info] received control channel command 'reconfig'

     Jul  8 11:59:01 oberon named[18737]:
       [ID 873579 daemon.info] loading configuration from
       '/export/home/named.oberon/named.conf.oberon'

     Jul  8 11:59:01 oberon named[18737]:
       [ID 873579 daemon.info] default max-cache-size (33554432) applies

     Jul  8 11:59:01 oberon named[18737]:
       [ID 873579 daemon.info] default max-cache-size (33554432) applies:
       view _bind

     Jul  8 11:59:01 oberon named[18737]:
       [ID 873579 daemon.info] reloading configuration succeeded

     Jul  8 11:59:01 oberon named[18737]:
       [ID 873579 daemon.info] any newly configured zones are now loaded

It appears that at 11:59:01 the "reconfig" has completed successfully.

We have a monitor script that is running.  It checks (via "ps -ef")
whether BIND is running.  If is sees that BIND is not running, it
starts it.  The monitor script sleeps for 60 seconds before its next
check.  This monitor script is not running via cron, so I do not know
exactly at what point in the minute it does its checks.  A few times
now the monitor script has discovered that BIND was not running;
this has occurred just after a "reconfig" to rename the querylog
and start a new one.  What I see in /var/adm/messages after the
Jul  8 11:59:01 "... zones are now loaded" message:

     Jul  8 11:59:57 oberon.it.anl.gov named[24233]:
       [ID 873579 daemon.notice] starting BIND 9.5.0
       -c /export/home/named.oberon/named.conf.oberon

There are no messages in /var/adm/messages that BIND 9.5.0 has crashed.

I looked at the core file  (160146520 Jul  8 11:59 core) with
gdb, and it tells me (there are some long lines with hex characters):

oberon.it.anl.gov# /usr/afsws/local/bin/gdb bind/sbin/named core
GNU gdb 6.7.1
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.9"...

warning: Can't read pathname for load map: I/O error.
Reading symbols from /usr/lib/libnsl.so.1...done.
Loaded symbols for /usr/lib/libnsl.so.1
Reading symbols from /usr/lib/libsocket.so.1...done.
Loaded symbols for /usr/lib/libsocket.so.1
Reading symbols from /usr/lib/libpthread.so.1...done.
Loaded symbols for /usr/lib/libpthread.so.1
Reading symbols from /usr/lib/libthread.so.1...done.
Loaded symbols for /usr/lib/libthread.so.1
Reading symbols from /usr/lib/libc.so.1...done.
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libdl.so.1...done.
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from /usr/lib/libmp.so.2...done.
Loaded symbols for /usr/lib/libmp.so.2
Reading symbols from /usr/platform/SUNW,Sun-Fire-V240/lib/libc_psr.so.1...done.
Loaded symbols for /usr/platform/SUNW,Sun-Fire-V240/lib/libc_psr.so.1

warning: Can't read pathname for load map: I/O error.

warning: Can't read pathname for load map: I/O error.
Core was generated by `/export/home/named.oberon/bind/sbin/named -c /export/home/named.oberon/named.co'.
Program terminated with signal 11, Segmentation fault.
#0  0x0007d4d4 in dns_dispatch_detach (dispp=0x1c) at dispatch.c:1931
1931            REQUIRE(dispp != NULL && VALID_DISPATCH(*dispp));
(gdb) where
#0  0x0007d4d4 in dns_dispatch_detach (dispp=0x1c) at dispatch.c:1931
#1  0x00104300 in disppooltimer_update (task=0x900dfd0, event=0x0)
    at resolver.c:7548
#2  0x0016ec84 in dispatch (manager=0x1eacb8) at task.c:862
#3  0x0016ee30 in run (uap=0x1eacb8) at task.c:1005
#4  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
#5  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread apply all bt full

Thread 5 (process 149809    ):
#0  0xff351bb0 in pthread_mutex_lock () from /usr/lib/libthread.so.1
No symbol table info available.
#1  0x000815b8 in dns_iptable_detach (tabp=0x17332e0) at iptable.c:152
        tab = (dns_iptable_t *) 0x105cd18
        refs = 1773520
#2  0x00069638 in destroy (dacl=0x17332b8) at acl.c:458
        i = 0
#3  0x0006977c in dns_acl_detach (aclp=0x205a58) at acl.c:471
        acl = (dns_acl_t *) 0x17332b8
        refs = 0
#4  0x0011e010 in destroy (view=0x205930) at view.c:295
        name = (dns_name_t *) 0x2059e8
        i = 256
#5  0x0011b5d8 in req_shutdown (task=0x205970, event=0x0) at view.c:542
        view = (dns_view_t *) 0x205930
        done = isc_boolean_true
#6  0x0016ec84 in dispatch (manager=0x1eacb8) at task.c:862
        dispatch_count = 2
        done = isc_boolean_false
        requeue = isc_boolean_false
        finished = isc_boolean_false
#7  0x0016ee30 in run (uap=0x1eacb8) at task.c:1005
---Type <return> to continue, or q <return> to quit---
No locals.
#8  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
#9  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (process 84273    ):
#0  0xff21cadc in _libc_sigtimedwait () from /usr/lib/libc.so.1
No symbol table info available.
#1  0xff34e49c in sigwait () from /usr/lib/libthread.so.1
No symbol table info available.
#2  0xff2177c0 in __posix_sigwait () from /usr/lib/libc.so.1
No symbol table info available.
#3  0x00171d50 in isc_app_run () at app.c:503
        result = -4195472
        event = (isc_event_t *) 0x0
        next_event = (isc_event_t *) 0xffbffb70
        task = (isc_task_t *) 0x0
        sset = {__sigbits = {16387, 0, 0, 0}}
        strbuf = "\000\000\b<\000\000\b<\000\000\b<\000\000\000\000ÿÿqqÿÿqqÿÿqqÿÿqq\000\000\000\000\035Íe\000ÿ¿û\030\000\003Qð\000\035\214\000\000\030S¨\000\037\034¸\000\035\214\000\000\030`\000\000\035\216ð", '\0' <repeats 17 times>, "\035\214\000\000\035\217\030\000\000\000\000\000\000\000\005\000\000\000n\000\000\000lÿ¿û\22---Type <return> to continue, or q <return> to quit---
0\000\003V8\000\000\000\000\000\035¶p\000\000\000\000ÿ\024â@"
        sig = 1936384
#4  0x0003564c in main (argc=1593344, argv=0x185000) at main.c:879
        result = 0

Thread 3 (process 346417    ):
#0  0xff21e23c in _poll () from /usr/lib/libc.so.1
No symbol table info available.
#1  0xff1d24d8 in _select () from /usr/lib/libc.so.1
No symbol table info available.
#2  0xff34e1cc in select () from /usr/lib/libthread.so.1
No symbol table info available.
#3  0x0017cedc in watcher (uap=0x226cc0) at socket.c:2527
        done = isc_boolean_false
        ctlfd = 8488
        cc = 1825792
        readfds = {fds_bits = {938475552, -255050557, 98304, 
    0 <repeats 29 times>}}
        writefds = {fds_bits = {0 <repeats 32 times>}}
        msg = -2
        fd = -1
        maxfd = 116
        strbuf = '\0' <repeats 127 times>
#4  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#5  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (process 280881    ):
#0  0xff3554b4 in __lwp_park () from /usr/lib/libthread.so.1
No symbol table info available.
#1  0xff3526c0 in cond_wait_queue () from /usr/lib/libthread.so.1
No symbol table info available.
#2  0xff352c38 in cond_wait_common () from /usr/lib/libthread.so.1
No symbol table info available.
#3  0xff3530c8 in _ti_cond_timedwait () from /usr/lib/libthread.so.1
No symbol table info available.
#4  0xff3530fc in cond_timedwait () from /usr/lib/libthread.so.1
No symbol table info available.
#5  0xff35313c in pthread_cond_timedwait () from /usr/lib/libthread.so.1
No symbol table info available.
#6  0x00181cfc in isc_condition_waituntil (c=0x1eccf0, m=0x1eccc0, t=0x1ecce8)
    at condition.c:59
        presult = 77
        result = 2018544
        ts = {tv_sec = 1215536341, tv_nsec = 278829000}
        strbuf = "\000\216µí\000\000\000\000Hs Y\004²M\020ÿ\fþÈÿ5.°\000\000\000\---Type <return> to continue, or q <return> to quit---
000\000\000\000\000\000í5\210\000\000\000\001\005;F\210\000\001\000\000\000\000\000\000\000\000\000\001\000\033´\000\000\035¤\000\000\036̸ÿ\fÿ\210\000\033¸`\000\033ºà\000\000\000\000\000\000\000\000ÿ\fÿ(\000\027\016\210\000\036Ìè", '\0' <repeats 12 times>, "ÿ\fÿ(\000\027\0170\000\000\000\000\000\000\000"
#7  0x00170eb0 in run (uap=0x1eccb8) at timer.c:719
        now = {seconds = 1215536341, nanoseconds = 78794000}
        result = 77
#8  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
#9  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (process 215345    ):
#0  0x0007d4d4 in dns_dispatch_detach (dispp=0x1c) at dispatch.c:1931
        killit = 28
#1  0x00104300 in disppooltimer_update (task=0x900dfd0, event=0x0)
    at resolver.c:7548
        res = (dns_resolver_t *) 0x7aa440
        addr4 = {type = {sa = {sa_family = 2, 
      sa_data = '\0' <repeats 13 times>}, sin = {sin_family = 2, sin_port = 0, 
      sin_addr = {S_un = {S_un_b = {s_b1 = 0 '\0', s_b2 = 0 '\0', 
            s_b3 = 0 '\0', s_b4 = 0 '\0'}, S_un_w = {s_w1 = 0, s_w2 = 0}, 
          S_addr = 0}}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {
---Type <return> to continue, or q <return> to quit---
      sin6_family = 2, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {
        _S6_un = {_S6_u8 = '\0' <repeats 15 times>, _S6_u32 = {0, 0, 0, 0}, 
          __S6_align = 0}}, sin6_scope_id = 0, __sin6_src_id = 0}, sunix = {
      sun_family = 2, sun_path = '\0' <repeats 107 times>}}, length = 16, 
  link = {prev = 0xffffffff, next = 0xffffffff}}
        addr6 = {type = {sa = {sa_family = 26, 
      sa_data = '\0' <repeats 13 times>}, sin = {sin_family = 26, 
      sin_port = 0, sin_addr = {S_un = {S_un_b = {s_b1 = 0 '\0', 
            s_b2 = 0 '\0', s_b3 = 0 '\0', s_b4 = 0 '\0'}, S_un_w = {s_w1 = 0, 
            s_w2 = 0}, S_addr = 0}}, 
      sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 26, 
      sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {_S6_un = {
          _S6_u8 = '\0' <repeats 15 times>, _S6_u32 = {0, 0, 0, 0}, 
          __S6_align = 0}}, sin6_scope_id = 0, __sin6_src_id = 0}, sunix = {
      sun_family = 26, sun_path = '\0' <repeats 107 times>}}, length = 32, 
  link = {prev = 0xffffffff, next = 0xffffffff}}
        disp4 = (dns_dispatch_t *) 0x72cbe58
        disp6 = (dns_dispatch_t *) 0x7aeb80
        result = 1933312
        nxt = 7
        attrs = 1933312
        attrmask = 28
#2  0x0016ec84 in dispatch (manager=0x1eacb8) at task.c:862
        dispatch_count = 0
---Type <return> to continue, or q <return> to quit---
        done = isc_boolean_false
        requeue = isc_boolean_false
        finished = isc_boolean_false
#3  0x0016ee30 in run (uap=0x1eacb8) at task.c:1005
No locals.
#4  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
#5  0xff355378 in _lwp_start () from /usr/lib/libthread.so.1
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) quit

Is there any other information that I can provide to debug this?
Thanks.
----------------------------------------------------------------------
Barry S. Finkel
Computing and Information Systems Division
Argonne National Laboratory          Phone:    +1 (630) 252-7277
9700 South Cass Avenue               Facsimile:+1 (630) 252-4601
Building 222, Room D209              Internet: BSFinkel at anl.gov
Argonne, IL   60439-4828             IBMMAIL:  I1004994


More information about the bind-users mailing list