BIND 9.2.4rc8 Multithreading on Win32
Vinny Abello
vinny at tellurian.com
Mon Sep 13 01:46:28 UTC 2004
At 12:21 PM 9/12/2004, Danny Mayer wrote:
>At 10:42 AM 9/11/2004, Vinny Abello wrote:
>>At 11:29 PM 9/10/2004, you wrote:
>>>>OK... The unfortunate part is any RBL that I serve secondary for works
>>>>in either one of two ways. The first is AXFR. I don't have control over this.
>>>
>>>You can't get them to run BIND 9 and use IXFR?
>>
>>It's a free service (sbl.spamhaus.org). I never even contacted them. They
>>allow anyone to do zone transfers and they are not IXFR (based on the
>>information I see in the logs anyway) so I don't believe I have any say
>>in it, unfortunately. I could try to reach out to them to find out.
>
>The IXFR is initiated by the client, not the server. If the server is
>unable to
>handle IXFR the client falls back to AXFR. If they're running BIND 9 there
>should be no problem.
If the server isn't lying, they're running the same version that I am now.
9.3.0rc4. I was assuming it was AXFR because I didn't see mention of IXFR
in the logs... but now that I'm looking at it. I don't see either. Is there
an easy way to tell from the slave side what's happening? I don't see much
reference to what the zone transfer types are in the logs unless they're
from my server to a slave.
>>>> The second way is rsync where it's reloaded in a script after being
>>>> transferred. Both these methods result in BIND not responding to
>>>> queries for a period of time. I've seen that this is an issue that's
>>>> gone back as far as RBL zones have existed and people have been trying
>>>> to use them in BIND. It seems a lot of people use alternate programs
>>>> that handle this better. I just can't understand after all these years
>>>> why BIND is unable to both load a large file and continue to respond
>>>> to queries. MSDNS handles this just fine as does other DNS software,
>>>> but I prefer BIND of course. :)
>>>
>>>Multithreading and multiple CPU's largely solves this. I don't know what you
>>>are seeing so it's hard to answer. I had set it up to have one more worker
>>>thread than CPU's (n+1) to allow for situations like this.
>>
>>It doesn't seem to work like that at all unless you have more than one
>>CPU and in certain situations only.
>
>Just to clarify, what I said about worker threads only applies to the I/O not
>the tasks that needs to be managed. I did check the code and multithreading
>is enabled on Windows so the task manager should be using more than
>one thread to handle the zone transfer and handle queries. This should
>be okay on a multi-CPU system at least. There may be a bug in the task
>code but that's much harder to figure out.
Yes, it seems fine on multi-CPU systems as normal large zone transfers
don't seem to cause any interruption at all on my multi-processor system.
Just my two single processor ones. Yet you said it was n+1. Oddly enough on
the 2 (4 logical) processor system, it doesn't display this problem so
there it would seem there would be 5 worker threads for I/O. On the single
processor ones it would 2 if I understand correctly. Despite the two on the
single processor system (the default if I'm understanding right) it still
stops responding to queries. As I think you are trying to explain, this may
have nothing to do with worker threads for I/O, but what else would be
different between the multiprocessor system and the single processor
systems as far as BIND goes that might affect this?
>>As far as the rsync RBL, what I am seeing is if I do a "rndc reload
>>zonename" on the server after the rsync is done, my server stops
>>responding to queries for a while and the CPU usage rockets on a single
>>CPU (actually it bounces around from one to another over the span of time
>>this happens). This is even on a machine with two hyperthreaded
>>processors. (Windows 2003).
>>
>>The zone is around 31MB in size. Even though BIND detects "found 4 CPUs,
>>using 4 worker threads", whenever that zone is reloaded, I can query the
>>server all I like and it does not reply even for zones it is master for.
>>As long as I see the one CPU pegged, it will not respond (even though
>>there are three other "processors" doing nothing). This is also on BIND
>>9.3.0rc4 on Windows which I currently have upgraded to (I like some of
>>the additional logging information and check-names and am reading up on
>>other new features).
>>
>>The other machines with a single processor I noted that when an AXFR zone
>>transfer occurs, they also stop responding to queries for a brief amount
>>of time, despite your n+1 worker thread design based on # of CPU's. That
>>zone is a lot smaller (around 5MB) and I've never detected a problem on
>>the machine with the two hyperthreaded processors having this issue when
>>doing a zone transfer, only the ones with a single CPU, so that is kind
>>of interesting.
>
>As I said above the n+1 is for worker threads to handle the I/O. Anything else
>is related to the way tasks are multithreaded and I don't know exactly what
>goes on there.
Sure, I understand that now. :)
>>My synopsis is that when doing large AXFR zone transfers, multiple CPUs
>>(or worker threads) helps in keeping BIND responding to queries. However,
>>if a reload or reconfig is done via rndc that causes BIND to load a large
>>zone, this does not apply and it will still stop responding to queries.
>>That is basically what I have observed, again, even with multiple worker
>>threads. Is there a reason for this or is this a flaw/bug? And why does
>>this happen even with zone transfers on a single CPU server when it's
>>supposed to be doing n+1 worker threads?
>
>It's possible that there is a bug but not in those worker threads which don't
>deal with file I/O.
It would sound that way... I was just curious if it was limited to Win32
platforms or not seeing as I'm sure there are many BIND servers (mostly
running on some *nix variant) that load large zones. I thought I've heard
this problem existed just within BIND itself based on research I've done
and combing through the list. That's why I was wondering how the GTLD
servers reload such huge zones without downtime in answering queries...
seeing that in my experience (although Windows and *nix are very different)
BIND stops responding to queries when reloading a very large zone from a
file it is master for.
By the way, thanks for all your detailed information and time in responding
to these questions, Danny! :)
Vinny Abello
Network Engineer
Server Management
vinny at tellurian.com
(973)300-9211 x 125
(973)940-6125 (Direct)
PGP Key Fingerprint: 3BC5 9A48 FC78 03D3 82E0 E935 5325 FBCB 0100 977A
Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com (888)TELLURIAN
There are 10 kinds of people in the world. Those who understand binary and
those that don't.
More information about the bind-users
mailing list