UDP packet loss

Pavel Urban urbanp at mlp.cz
Tue Jun 13 08:09:42 UTC 2006


>>>A quick question: did you enable threads?
> 
> 
>>As far as I know, RedHat's packages are compiled with threads enabled. 
>>This server is quite busy, about 4MBit of DNS traffic for hours...
> 
> 
> Okay, if it did not enable threads, and if you mean 'about every hour'
> or something by 'from time to time', then the described symptom would
> probably be query packet loss in the socket receive buffer during
> periodic cache cleaning.  In this case, enlarging the socket buffer
> may not help, depending on the query arrival rate and the workload of
> cleaning.
> 
> Some know workaround is:
> 
> - decreasing DNS_CACHE_CLEANERINCREMENT in lib/dns/cache.c
>   (see http://marc.theaimsgroup.com/?l=bind-users&m=112643028426663&w=2)
> - enabling ISC_MEM_USE_INTERNAL_MALLOC (see also the above URL).  It
>   will reduce the workload of cleaning, and may implicitly remedy the
>   packet lost issue.
> 
> As for the first tuning, it is better to try BIND 9.4 because it
> automatically adjusts the corresponding parameter run-time so that
> queries won't be dropped.
> 
> But if the server enables threads, it may be a different problem and
> the above may not help because at least one thread can keep processing
> queries during cache cleaning.
> 
> 					JINMEI, Tatuya
> 					Communication Platform Lab.
> 					Corporate R&D Center, Toshiba Corp.
> 					jinmei at isl.rdc.toshiba.co.jp
> 

First, I'd like to thank you all for your suggestions. I really 
appreciate them. I still don't know how to solve my problem, though. I 
will test several configuration on our new server that will arrive the 
next week, including kernel and Bind recompilation and 9.4.0 
prereleases. The problem is that RedHat (OS supplier) suggests to use 
their packages and cannot support custom builds...

I'll try to give you as much information as possible.

RedHat is using these configuration options in their packages:
%configure --with-libtool --localstatedir=/var \
         --enable-threads \
         --enable-ipv6 \
         --with-pic \
         --with-openssl=/usr \
         --enable-libbind
, so I assume threads are enabled. I don't know how to confirm that on a 
running system, though. I remember some versions reported it by startup, 
but I don't see such message anymore.

These problems don't appear regularly; they are followed by higher CPU 
load (about 15% idle, according to 'top'). They last for hours, but 
after named restart they disappear. When I try to watch lost UDP packets 
during this time, the count is increasing rapidly (watch 'netstat -s 
|grep "packet receive error"')

I'm using this setting:

options {
         directory "/var/named";
         dump-file "/var/named/data/cache_dump.db";
         statistics-file "/var/named/data/named-stats.log";
         recursing-file "/var/named/data/named.recursing";
          // IOL-specific settings...
          allow-recursion { 127/8; 194.228/16; 192.168/16; 172.16/12; 
10/8; 80.188/16; 160.218/16; 212.65.210.202/32; 193.109.182.62/32; 
83.208/16; 85.70/15; 88.100/14; };
          blackhole { bogon_list; divna_sit; };
          notify no;
          recursion yes;
          cleaning-interval 60;
          interface-interval 0;
          recursive-clients 150000;
          serial-query-rate 5;
          datasize 4G;
};

named is running in chroot. /var/named/chroot

'top' output while udp packets loss is increasing:

top - 09:54:53 up 48 days, 17:39,  1 user,  load average: 2.49, 2.28, 2.19
Tasks:  54 total,   2 running,  52 sleeping,   0 stopped,   0 zombie
Cpu0  : 84.3% us,  1.3% sy,  0.0% ni, 14.0% id,  0.0% wa,  0.3% hi,  0.0% si
Cpu1  : 70.3% us,  0.7% sy,  0.0% ni, 29.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   6105240k total,  2123896k used,  3981344k free,   112476k buffers
Swap:  6289436k total,        0k used,  6289436k free,   561964k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  1495 named     24   0 1404m 1.3g 2476 S 99.9 22.7   4852:23 named
23699 root      16   0  2736  908  740 R  0.3  0.0   0:00.01 top
     1 root      16   0  2856  540  464 S  0.0  0.0   0:05.41 init
     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.30 migration/0
     3 root      34  19     0    0    0 S  0.0  0.0   3:44.07 ksoftirqd/0
     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.24 migration/1
     5 root      34  19     0    0    0 S  0.0  0.0   1:29.47 ksoftirqd/1
     6 root       5 -10     0    0    0 S  0.0  0.0   0:00.09 events/0
     7 root       5 -10     0    0    0 S  0.0  0.0   0:00.14 events/1
     8 root       7 -10     0    0    0 S  0.0  0.0   0:00.00 khelper
     9 root      15 -10     0    0    0 S  0.0  0.0   0:00.00 kacpid
    46 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
    47 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/1
    57 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
    58 root      15   0     0    0    0 S  0.0  0.0   0:10.45 pdflush

System Information
                 Manufacturer: Sun Microsystems
                 Product Name: Sun Fire X4100 Server
         Processor Information
                 Type: Central Processor
                 Family: Opteron
                 Manufacturer: AMD
                 Version: AMD Opteron(tm) Processor 252
                 Current Speed: 2600 MHz
Linux dns3.iol.cz 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 
i686 athlon i386 GNU/Linux


-- 
***********************************************************************
Pavel Urban (pavel.urban at ct.cz)
IOL system disaster
Internet OnLine, www.iol.cz (owned by Czech Telecom, www.ct.cz)
***********************************************************************
    Vegetables should not operate electronic equipment.
           Computer Stupidities, http://rinkworks.com/stupid/
***********************************************************************



More information about the bind-users mailing list