LRU fail after switch 9.4.1 -> 9.5.0p1 ?

Fri Sep 12 16:31:27 UTC 2008

On Fri, 12 Sep 2008 14:41:28 +0200, Stefan Schmidt <stefan.schmidt at freenet.ag> said:

> Our postmasters switched the recursive named's on their smtp receivers
> from 9.4.1 to 9.5.0p1 lately and i noticed a 5 fold increase of queries
> against two of our authoritative servers. It turned out that it was the
> mailservers named's querying for the zone they reside in - repetatively -
> meaning the cache apparently was either not used at all for that zone or
> it was not LRU anymore.
> Raising the max cache size from 30 to 100mb and restarting the process
> seems to have helped.

> My question is: Are there high watermark situations where the cache in
> 9.5 does not do LRU anymore or is it more likely that this was just a
> 'glitch' in the cache (on 16 independant machines btw.)?

This sounds a lot like something that I've been seeing with the new
caching system in 9.5.  I think it works like this.  There is a high
water mark and a low water mark.  When the high water mark is crossed,
the cache agressively deletes RRsets from the end of the hash buckets
used for LRU expiry.  It does this whenever a new RRset is placed into
the cache.  The RRset to be expunged is chosen from the same hash
bucket into which the new RRset is installed.  Hashing is done on the
name only.  Once the cache size drops below the low water mark, normal
LRU purging is resumed.

I have observed that during this "overmem" phase, some hash buckets
become almost empty and can enter a state in which different RRsets
for the same name are added and deleted in quick succession.  For
example, if a name server for a zone has A and AAAA RRsets, then they
will keep replacing each other during glue-fetching for this zone.  As
a result, you will see these records getting deleted and re-fetched a
lot depending on other queries for names in that zone.  I have even
seen intermittent SERVFAIL conditions while this was going on.

There is a ticket open for this problem (17628) but it has been
stalled by the anti-cache poisoning activities.  Jinmei has
acknowledged that this is a problem and should be fixed, though.  I
wrote a crude patch (no purging is performed if the owner name of the
to-be-deleted RRset is the same as that of the new RRset) which has
been working fine for a couple of weeks.

-- 
Alex