bind 9.6.2 with threads hangs
Fabien Seisen
seisen at gmail.com
Mon Mar 22 13:17:12 UTC 2010
2010/3/22 Cathy Almond <cathya at isc.org>
> Fabien Seisen wrote:
>
> yes, max-cache-size 512M but named process takes ~900MB
>
> The extra memory is for keeping track of recursive clients (i.e.
> in-progress client queries).
>
ok
This doesn't sound like a hugely loaded server,
exact, on my own test (with "real life" queries), the server can handle
~70000 queries/s with response time ~1ms at 70% cpu and no
packet lost.
else it's somewhat throttled (not particularly large cache and probably
> default
limit on recursive clients). What kind of query rates do you have? Do you
> get
> any logging that suggests resource problems? If so, you might need to
> increase some of the limits.
>
We have a pool of several more or less identicals servers with a
load-balancer in front.
On average, each server gets 1800 queries/s and 4000 at peak.
The problem occurs every few weeks and never on all servers at a time.
Recursive clients config is not modified (rndc status: recursive clients:
188/2900/3000) and we have
- on avg: 200 recursive clients
- at peak 600
It's intriguing that you're seeing the same issues on two bind versions
> and two OS (and that other people's experience is different from yours)
>
only Solaris 10
- Solaris 10 U6 with bind 9.5.1-P3 with threads compiled with SUNSpro 12
- Solaris 10 U6 with bind 9.6.2 with threads compiled with gcc
> - it suggests to me that it's specific to your configuration or client
> base/queries or your environment.
>
we gets real life queries from customers (evil?).
A simple "rndc flush" revives named.
Perhaps, a bad formated packet freeze named or create a cache dead lock
Can something go wrong in the cache ?
I am not fluent with core files but i have got one in my pocket.
For troubleshooting I'd start by looking at the logging output - if
> you've got any categories going to null, un-suppress them temporarily;
> and add query-errors (see 9.6.2 ARM). Then perhaps do some sampling of
> network traffic (perhaps there's a UDP message size/fragmentation issue)
> to see what's happening (or not).
>
all category to non-null and we do not use specific 9.6.2 configuration.
I did not noticied weird log message (beside regular: shutting down due to
TCP receive error: 202.96.209.6#53: connection reset)
here is our log config:
category client { client.log; };
category config { config.log; default_syslog; };
category database { database.log; default_syslog; };
category default { default.log; default_syslog; };
category delegation-only { delegation-only.log; };
category dispatch { dispatch.log; };
category general { default.log; };
category lame-servers { lamers.log; };
category network { network.log; };
category notify { notify.log; default_syslog; };
category queries { queries.log; };
category resolver { resolver.log; };
category security { security; };
category unmatched { unmatched.log; };
category update { update.log; };
category xfer-in { xfer-in.log; default_syslog; };
category xfer-out { xfer-out.log; default_syslog; };
--
Fabien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20100322/1596ea13/attachment.html>
More information about the bind-users
mailing list