BIND 9.2.4rc8 Multithreading on Win32

Mon Sep 13 01:46:28 UTC 2004

At 12:21 PM 9/12/2004, Danny Mayer wrote:
>At 10:42 AM 9/11/2004, Vinny Abello wrote:
>>At 11:29 PM 9/10/2004, you wrote:
>>>>OK... The unfortunate part is any RBL that I serve secondary for works 
>>>>in either one of two ways. The first is AXFR. I don't have control over this.
>>>
>>>You can't get them to run BIND 9 and use IXFR?
>>
>>It's a free service (sbl.spamhaus.org). I never even contacted them. They 
>>allow anyone to do zone transfers and they are not IXFR (based on the 
>>information I see in the logs anyway) so I don't believe I have any say 
>>in it, unfortunately. I could try to reach out to them to find out.
>
>The IXFR is initiated by the client, not the server. If the server is 
>unable to
>handle IXFR the client falls back to AXFR. If they're running BIND 9 there
>should be no problem.

If the server isn't lying, they're running the same version that I am now. 
9.3.0rc4. I was assuming it was AXFR because I didn't see mention of IXFR 
in the logs... but now that I'm looking at it. I don't see either. Is there 
an easy way to tell from the slave side what's happening? I don't see much 
reference to what the zone transfer types are in the logs unless they're 
from my server to a slave.

>>>>  The second way is rsync where it's reloaded in a script after being 
>>>> transferred. Both these methods result in BIND not responding to 
>>>> queries for a period of time. I've seen that this is an issue that's 
>>>> gone back as far as RBL zones have existed and people have been trying 
>>>> to use them in BIND. It seems a lot of people use alternate programs 
>>>> that handle this better. I just can't understand after all these years 
>>>> why BIND is unable to both load a large file and continue to respond 
>>>> to queries. MSDNS handles this just fine as does other DNS software, 
>>>> but I prefer BIND of course. :)
>>>
>>>Multithreading and multiple CPU's largely solves this. I don't know what you
>>>are seeing so it's hard to answer. I had set it up to have one more worker
>>>thread than CPU's (n+1) to allow for situations like this.
>>
>>It doesn't seem to work like that at all unless you have more than one 
>>CPU and in certain situations only.
>
>Just to clarify, what I said about worker threads only applies to the I/O not
>the tasks that needs to be managed. I did check the code and multithreading
>is enabled on Windows so the task manager should be using more than
>one thread to handle the zone transfer and handle queries. This should
>be okay on a multi-CPU system at least. There may be a bug in the task
>code but that's much harder to figure out.

Yes, it seems fine on multi-CPU systems as normal large zone transfers 
don't seem to cause any interruption at all on my multi-processor system. 
Just my two single processor ones. Yet you said it was n+1. Oddly enough on 
the 2 (4 logical) processor system, it doesn't display this problem so 
there it would seem there would be 5 worker threads for I/O. On the single 
processor ones it would 2 if I understand correctly. Despite the two on the 
single processor system (the default if I'm understanding right) it still 
stops responding to queries. As I think you are trying to explain, this may 
have nothing to do with worker threads for I/O, but what else would be 
different between the multiprocessor system and the single processor 
systems as far as BIND goes that might affect this?

>>As far as the rsync RBL, what I am seeing is if I do a "rndc reload 
>>zonename" on the server after the rsync is done, my server stops 
>>responding to queries for a while and the CPU usage rockets on a single 
>>CPU (actually it bounces around from one to another over the span of time 
>>this happens). This is even on a machine with two hyperthreaded 
>>processors. (Windows 2003).
>>
>>The zone is around 31MB in size. Even though BIND detects "found 4 CPUs, 
>>using 4 worker threads", whenever that zone is reloaded, I can query the 
>>server all I like and it does not reply even for zones it is master for. 
>>As long as I see the one CPU pegged, it will not respond (even though 
>>there are three other "processors" doing nothing). This is also on BIND 
>>9.3.0rc4 on Windows which I currently have upgraded to (I like some of 
>>the additional logging information and check-names and am reading up on 
>>other new features).
>>
>>The other machines with a single processor I noted that when an AXFR zone 
>>transfer occurs, they also stop responding to queries for a brief amount 
>>of time, despite your n+1 worker thread design based on # of CPU's. That 
>>zone is a lot smaller (around 5MB) and I've never detected a problem on 
>>the machine with the two hyperthreaded processors having this issue when 
>>doing a zone transfer, only the ones with a single CPU, so that is kind 
>>of interesting.
>
>As I said above the n+1 is for worker threads to handle the I/O. Anything else
>is related to the way tasks are multithreaded and I don't know exactly what
>goes on there.

Sure, I understand that now. :)

>>My synopsis is that when doing large AXFR zone transfers, multiple CPUs 
>>(or worker threads) helps in keeping BIND responding to queries. However, 
>>if a reload or reconfig is done via rndc that causes BIND to load a large 
>>zone, this does not apply and it will still stop responding to queries. 
>>That is basically what I have observed, again, even with multiple worker 
>>threads. Is there a reason for this or is this a flaw/bug? And why does 
>>this happen even with zone transfers on a single CPU server when it's 
>>supposed to be doing n+1 worker threads?
>
>It's possible that there is a bug but not in those worker threads which don't
>deal with file I/O.

It would sound that way... I was just curious if it was limited to Win32 
platforms or not seeing as I'm sure there are many BIND servers (mostly 
running on some *nix variant) that load large zones. I thought I've heard 
this problem existed just within BIND itself based on research I've done 
and combing through the list. That's why I was wondering how the GTLD 
servers reload such huge zones without downtime in answering queries... 
seeing that in my experience (although Windows and *nix are very different) 
BIND stops responding to queries when reloading a very large zone from a 
file it is master for.

By the way, thanks for all your detailed information and time in responding 
to these questions, Danny! :)

Vinny Abello
Network Engineer
Server Management
vinny at tellurian.com
(973)300-9211 x 125
(973)940-6125 (Direct)
PGP Key Fingerprint: 3BC5 9A48 FC78 03D3 82E0  E935 5325 FBCB 0100 977A

Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com (888)TELLURIAN

There are 10 kinds of people in the world. Those who understand binary and 
those that don't.