bind 9.6.2 with threads hangs

Mon Mar 22 13:17:12 UTC 2010

2010/3/22 Cathy Almond <cathya at isc.org>

> Fabien Seisen wrote:
>
> yes, max-cache-size 512M but named process takes ~900MB
>
> The extra memory is for keeping track of recursive clients (i.e.
> in-progress client queries).
>

ok

This doesn't sound like a hugely loaded server,

exact, on my own test (with "real life" queries), the server can handle
 ~70000 queries/s with response time ~1ms at 70% cpu and no
packet lost.

else it's somewhat throttled (not particularly large cache and probably
> default

limit on recursive clients).  What kind of query rates do you have?  Do you
> get
> any logging that suggests resource problems?  If so, you might need to
> increase some of the limits.
>

We have a pool of several more or less identicals servers with a
load-balancer in front.

On average, each server gets 1800 queries/s and 4000 at peak.

The problem occurs every few weeks and never on all servers at a time.

Recursive clients config is not modified (rndc status: recursive clients:
188/2900/3000) and we have
- on avg: 200 recursive clients
- at peak 600

It's intriguing that you're seeing the same issues on two bind versions
> and two OS (and that other people's experience is different from yours)
>

only Solaris 10
- Solaris 10 U6 with bind 9.5.1-P3 with threads compiled with SUNSpro 12
- Solaris 10 U6 with bind 9.6.2      with threads compiled with gcc

> - it suggests to me that it's specific to your configuration or client
> base/queries or your environment.
>

we gets real life queries from customers (evil?).

A simple "rndc flush" revives named.

Perhaps, a bad formated packet freeze named or create a cache dead lock

Can something go wrong in the cache ?

I am not fluent with core files but i have got one in my pocket.

For troubleshooting I'd start by looking at the logging output - if
> you've got any categories going to null, un-suppress them temporarily;
> and add query-errors (see 9.6.2 ARM).  Then perhaps do some sampling of
> network traffic (perhaps there's a UDP message size/fragmentation issue)
>  to see what's happening (or not).
>

all category to non-null and we do not use specific 9.6.2 configuration.
I did not noticied weird log message (beside regular: shutting down due to
TCP receive error: 202.96.209.6#53: connection reset)

here is our log config:
    category client { client.log; };
    category config { config.log; default_syslog; };
    category database { database.log; default_syslog; };
    category default { default.log; default_syslog; };
    category delegation-only { delegation-only.log; };
    category dispatch { dispatch.log; };
    category general { default.log; };
    category lame-servers { lamers.log; };
    category network { network.log; };
    category notify { notify.log; default_syslog; };
    category queries { queries.log; };
    category resolver { resolver.log; };
    category security { security; };
    category unmatched { unmatched.log; };
    category update { update.log; };
    category xfer-in { xfer-in.log; default_syslog; };
    category xfer-out { xfer-out.log; default_syslog; };

-- 
Fabien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20100322/1596ea13/attachment.html>