x86 Linux bind9 -futex take much sys CPU time and cause errors

Tue Jan 18 05:21:34 UTC 2005

Hi list,

I use one x86 4 cpu machine and Linux for bind9 testing.
In a bind9 performance testing I monitored one of named threads with 
oprofile and strace for a while, and got below message:

CPU: P4 / Xeon with 2 hyper-threads, speed 2995.29 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        app name                 symbol name
6228485  39.6674  libdns.so.16.0.0         (no symbols)
2569301  16.3631  tg3                      (no symbols)
1851630  11.7925  libisc.so.7.1.5          (no symbols)
1554621   9.9009  libpthread-2.3.4.so      pthread_mutex_lock
1481064   9.4325  named                    (no symbols)
902274    5.7463  libpthread-2.3.4.so      pthread_mutex_unlock
281170    1.7907  ld-2.3.4.so              anonymous symbol from 
section .text
188602    1.2012  oprofiled                (no symbols)
148667    0.9468  libc-2.3.4.so            memcpy
89062     0.5672  oprofile                 (no symbols)
88651     0.5646  libpthread-2.3.4.so      __lll_mutex_lock_wait
47170     0.3004  libpthread-2.3.4.so      
__pthread_disable_asynccancel
43055     0.2742  libpthread-2.3.4.so      __pthread_enable_asynccancel
30374     0.1934  libpthread-2.3.4.so      __errno_location
27777     0.1769  libpthread-2.3.4.so      sendmsg
25916     0.1651  libpthread-2.3.4.so      __i686.get_pc_thunk.bx
19277     0.1228  libc-2.3.4.so            gettimeofday

# strace -p 1797 -c
Process 1797 attached - interrupt to quit
Process 1797 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.11    5.681319          13    442646    137491 futex
 19.30    1.586145          13    118467           sendmsg
 10.86    0.892565           8    118579        72 recvmsg
  0.72    0.059597           3     23725           gettimeofday
  0.01    0.000738          15        48           write
------ ----------- ----------- --------- --------- ----------------
100.00    8.220364                703465    137563 total

I traced down the futex errors:
There're many sth. like 'futex(0x818f1bc, FUTEX_WAIT, 585151, NULL) = -
1 EAGAIN (Resource temporarily unavailable)'. 

At this time the CPU utilization is:
CPU states:  cpu    user    nice  system    irq  softirq  iowait    
idle
           total   39.1%    0.0%   39.1%   1.0%    15.9%    0.0%    
4.7%
           cpu00   30.5%    0.0%   20.3%   4.2%    44.9%    0.0%    
0.0%
           cpu01   39.8%    0.0%   48.3%   0.0%     5.9%    0.0%    
5.9%
           cpu02   45.1%    0.0%   41.7%   0.0%     5.9%    0.0%    
7.2%
           cpu03   41.1%    0.0%   46.1%   0.0%     6.7%    0.0%    
5.9%
Mem:  7997692k av,  298108k used, 7699584k free,       0k shrd,   
11956k buff
        49004k active,             190420k inactive
Swap: 2096472k av,       0k used, 2096472k free                  
227280k cached

I cannot think out what named threads are competing for, one 
possibility is data cache, but I have a 100k records queryperf input 
file and another 100k records domain data file, this might not make 
sense, because in such read operation, I don't believe that one named 
thread will lock all data cache; another concern is network, named use 
udp for communication, at the time of running,  netstat shows below:
Proto Recv-Q Send-Q Local Address               Foreign Address
udp     4736    296 10.101.3.103:domain     
*:*                                 
udp     3552    296 10.101.2.103:domain     
*:*                                 
udp     4440      0 10.101.1.103:domain     
*:*                                 
udp     4440      0 10.100.0.19:domain      *:*  

I've tried both broadcom and intel 1000m card, nothing different. I've 
also tried bind 9.2.4, rhel3, rhel4, sles9,sle8, nothing different. Do 
you have some suggestions ?  

thx.