BIND 9.2.1 - Unexpected cache behavior

Wed Nov 19 05:32:56 UTC 2003

...just some hints

did you try using dig and checking the TTL value for the cached records, 
it should be decreasing, on the caching server every second, and if it 
is fetched again, it will come with new ttl.

also the traffic distribution, it depends on your clients what is the 
primary dns server in their settings, is it balanced amongst them
equally?

Ladislav

Todd Herr wrote:
> Greetings.
> 
> Apologies for the re-post.  I first posted the below on 27
> October, but never saw any responses from the list and never saw
> it even hit comp.protocols.dns.bind.  I've also got some more data
> points to add to this, in the hopes that I might get some clues to
> explain what I'm seeing.
> 
> We've got a fleet of Sun V100 servers deployed as caching-only
> DNS servers.  These servers all run BIND 9.2.1 on Solaris 8.  The
> servers have 1GB of RAM, and one CPU.
> 
> The servers experience wide fluctuations in load, based on their
> location.  That is to say, servers in location A consistently
> receive on average 875 queries per second, while at location B
> they may only receive 200 queries per second.
> 
> What we're seeing at our busy sites is behavior that says to me
> that TTL values are not being honored in the cache on these busy
> servers.  For instance, an A record with a TTL of 24 hours
> (dns1.rr.com) shows up in our logs as a createfetch entry 58
> unique times in a recent five hour time period.  I infer from
> this that the record expired from cache 57 times in that five
> hour time period.
> 
> This behavior is consistent with max-cache-size being reached, I
> would guess, but max-cache-size does not appear in the named.conf
> file, so by default it would be unlimited, yes?  The server
> indicates, through top, that the named process has a size of
> 538MB, and that the server itself has 157MB of memory free.  top
> also indicates that the named process is consuming 85 - 90% of
> CPU resources, all in user and kernel space.
> 
> The only options explicitly specified in named.conf are:
> 
>    options {
>         directory "/";
>         dump-file "log/name_dump.db";
>         allow-query {
>           localhost;
>           ...
>         };
>         auth-nxdomain yes;
>         recursion yes;
>         statistics-file "log/server_stats";
>         allow-recursion {
>           localhost;
>           ...
>         };
>         recursive-clients 15000;
>    };
> 
> (Note: ... here substitutes for some address-match-lists that we
>  have defined.)
> 
> For logging, I'm only logging the resolver category by default;
> query logging is configured into my logging options, but is
> turned off during normal operation.
> 
> The questions I have are:
> 
> 1. Can anyone hazard a guess as to the reason I'm apparently
>    seeing entries prematurely expire from cache?  Alternatively,
>    if I'm not seeing entries prematurely expire from cache, can
>    anyone tell me what it is that I'm seeing?
> 
> 2. Are there Solaris kernel tuning options/strategies available
>    to me to make these boxes perform better?
> 
> 3. The BIND 9 ARM says that each recursive client uses on the
>    order of 20 kilobytes of RAM; does named reserve that much
>    per recursive client at startup and does that mean that 300MB
>    of my 538MB of process size is not cache but just space for
>    recursive clients?
> 
> 4. These servers are expected to be able to handle upwards of
>    1500 to 2000 queries per second.  What would be a good number
>    for recursive-clients, if not 15000?
> 
> Other data points, some likely not at all relevant to the problem:
> 
> A. Software was compiled 64 bit using gcc; specifically, per the
>    pkginfo statement, "gcc 3.2 with -O3 and -m64 and /dev/urandom
>    support".  I know the README says "Building with gcc is not
>    supported, unless gcc is the vendor's usual compiler", but I
>    didn't build it, and given that Sun ships gcc on the Software
>    Companion CD, this statement becomes a bit ambiguous.
> 
> B. named is running in a chroot jail, using the -u and -t options
>    to named.
> 
> C. The option exists, and is being actively pursued, to upgrade
>    to 9.2.3.  That will take some time, however, unless compelling
>    evidence can be provided that said upgrade will solve the
>    problems we're seeing.
> 
> D. For the particular A record that I cite (dns1.rr.com) a
>    cache dump shows the A record with an expected TTL value
>    shortly after it's been fetched, and yet our logs still show
>    it being fetched again scant minutes later.
> 
> The next two data points are to reflect what in my mind are odd
> system configuration choices, but ones which are likely to be
> unrelated to anything having to do with the operation of named on
> these servers.  I present them in the interests of full
> disclosure.
> 
> E. Permissions on /var/tmp are 1755 (drwxr-xr-t), owned by root;
>    this impacts the ability of normal users to edit files, but
>    I don't know if it impacts named in the slightest.
> 
> F. /usr/sbin/inetd has been rendered non-executable (permissions
>    are 400) as a "security measure".  inetd is not used to start
>    named, obviously, and I doubt that there's any interaction
>    between the two, anyway.
> 
> I mention both E and F here because I've been so far unable to
> duplicate this behavior on a host with /var/tmp set to 1777 and
> /usr/sbin/inetd set to 555; however, this host also is at kernel
> patch rev -23, and the ldd command indicated that the executable
> was running in 32 bit, not 64 bit, mode.
> 
> My suspicions are focused on a combination of the 64 bit
> compilation (with gcc) and the out of date kernel patch level, but
> I've not found anything in my googling to support my theories to
> date.
> 
> Thanks for any clues that you folks can provide.
>