BIND 9.x caching Performance under heavy loads

Wed Mar 2 22:33:45 UTC 2005

On Wed, 02 Mar 2005 10:26:12 +0000, Jim Reid <jim at rfc1035.com> wrote:
> >>>>> "Srini" == Srini Avirneni <avirsri at gmail.com> writes:
> 
>    Srini> We are looking to hear peoples experiences with heavily
>    Srini> loaded caching servers (> 2500 queries/Sec) running on
>    Srini> linux (RH ES 3.0 Kernel 2.4.x). We have used various builds
>    Srini> of BIND 9.x (9.2.3, 9.2.4, 9.3.0, 9.3.1rc1) and also 8.4.5.
> 
>    Srini> Our results show in all cases that BIND falls on its face
>    Srini> with CPU time increasing from 20% at startup to 80+ % over
>    Srini> 24 hour period.  Cache Size typically ranges from 800MB to
>    Srini> 1GB. Expermenting with lower cache sizes showed no
>    Srini> improvement.
> 
> If you restrict the size of the name server's cache, the name server
> has to do much more work. That should be obvious. First of all, the
> server has to maintain the cache at the appointed (self-imposed) limit.
> Secondly, it could well be cache-thrashing: discarding perfectly good
> data that it's been forced to throw away only to have to retreive it
> again. Finally, the name server is almost certainly having to waste
> resources -- bandwidth, CPU, internal memory state for queries under
> resolution, etc, etc -- because it's having to resolve more queries
> that it might have been able to answer out of cache if it had been big
> enough. BTW the size of most caches tends to stabilise in a couple of
> days after start-up: TTL values are usually no more than a day or so,
> perhaps 1 week at the outside.

Our observations about setting cache did appear to have the
servers fall over quicker (I will better define this later in the response).

Cache sizes stablize in the 800MB-1.1GB range within a day.

> 
> Please be more specific. You seem to be saying that "bad things happen
> when your name servers are under heavy load". This shouldn't exactly
> be a surprise. What's the actual problem are you experiencing and/or
> trying to solve?

Yes, it is a suprise. Its a surprise to see Bind as a process, peak
at 25% CPU day 1. Then, Day 2, peak climbs to over 50%. Load
has not changed. 

By day 3, CPU will exceed 75% and Bind will no longer respond in
a meaning full way (< 500 queries/sec). This is on very fast
hardware.

This was with peak query rates of 2500/sec ranging down to 1200/sec.

As load increased, the CPU time ran away quicker. One would expect
that a Bind server that runs at 25% CPU during peak, would
within a few points, operate as such continually, and scale somewhat linearly. 

The query loads we see are fairly consistent. 

> 
> Query rates of 2500 qps are rather high. Is this a real, operational
> load or something you've cooked up in a testbed? If it's the former,

Its real load. High is a relative term, which was one reason
for submitting this question. These are not small servers, but
Dual PIVs with plenty of RAM. Notice I stated servers, not
server. :)

This is the crux of the question: have others running Bind 9.x with
substantial load noticed any similar behavior. 

I appreciate the replies,

- s