Strange Problem: Caching nameservers stopped working properly

Mark Andrews Mark_Andrews at isc.org
Wed Aug 15 02:46:13 UTC 2007


> I have a really strange problem. I have several servers with bind
> 9.2.x and bind 9.4.x running as caching nameservers, and about a week
> ago many of them stopped working properly. They stopped being able to
> perform certain lookups, and became really slow to respond to certain
> queries.
> 
> Bind is identically configured across all the machines, but they are
> spread out over four different subnets. All of the servers on two of
> the subnets still work fine, whereas all of the servers on the other
> two subnets are having problems.
> 
> Here's the actual output of two queries, first run on one of the
> servers without any problem, then on a server with the problem. I
> included one query type that works on the broken server just to prove
> that bind is indeed running:
> 
> ============ Working Server ============
> 
> $ host -t ns w3.org localhost
> Using domain server:
> Name: localhost
> Address: 127.0.0.1#53
> Aliases:
> 
> w3.org name server ns3.w3.org.
> w3.org name server ns1.w3.org.
> w3.org name server ns2.w3.org.
> 
> 
> $ host -t ns zen.spamhaus.org localhost
> Using domain server:
> Name: localhost
> Address: 127.0.0.1#53
> Aliases:
> 
> zen.spamhaus.org name server o.ns.spamhaus.org.
> zen.spamhaus.org name server q.ns.spamhaus.org.
> ...
> zen.spamhaus.org name server n.ns.spamhaus.org.
> 
> 
> ============ Broken Server =============
> 
> $ host -t ns w3.org localhost
> Using domain server:
> Name: localhost
> Address: 127.0.0.1#53
> Aliases:
> 
> w3.org name server ns3.w3.org.
> w3.org name server ns1.w3.org.
> w3.org name server ns2.w3.org.
> 
> 
> $ host -t ns zen.spamhaus.org localhost
> ;; connection timed out; no servers could be reached
> 
> ==============================
> 
> You can see that on the "broken" server, the second query for the
> zen.spamhaus.org nameservers timed out. This is very consistent;
> lookups that work always work, and lookups that are broken are always
> broken. The problem is -- I cannot figure out any pattern between the
> queries that work and the ones that don't.
> 
> All the servers are identically configured, and problem started at the
> same time across all the servers, so that seems to rule out a software
> or hardware issue.

	No.  It actuall rules in hardware problem in the routers
	and firewalls.
 
> The key point seems to be that all the servers that are failing are on
> two certain subnets, and all the servers that are working on two
> different subnets.

	If named doesn't get answers you will get timeouts.  I'd
	look at packet traces between the problem boxes and the
	servers for the problem zones.
 
> I've run an strace on the named process while it's failing and it
> gives the following output:
> 
> ==========  strace -fp <PID> =============
> 
> recvmsg(24, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53),
> sin_addr=inet_addr("192.35.51.32")},
> msg_iov(1)=[{"T\35\204\0\0\1\0\1\0\10\0\n\5henna\4ARIN\3NET\0\0\1\0\1"...,
> 4096}], msg_controllen=20, msg_control=0x81c95e8, , msg_flags=0}, 0) =
> 376
> brk(0x8201000)                          = 0x8201000
> recvmsg(24, 0xbffff850, 0)              = -1 EAGAIN (Resource
> temporarily unavailable)
> gettimeofday({1187143611, 172447}, NULL) = 0
> gettimeofday({1187143611, 172478}, NULL) = 0
> gettimeofday({1187143611, 172508}, NULL) = 0
> gettimeofday({1187143611, 172717}, NULL) = 0
> gettimeofday({1187143611, 172759}, NULL) = 0
> gettimeofday({1187143611, 172789}, NULL) = 0
> gettimeofday({1187143611, 172823}, NULL) = 0
> gettimeofday({1187143611, 172853}, NULL) = 0
> gettimeofday({1187143611, 172910}, NULL) = 0
> gettimeofday({1187143611, 172942}, NULL) = 0
> gettimeofday({1187143611, 172973}, NULL) = 0
> gettimeofday({1187143611, 173144}, NULL) = 0
> gettimeofday({1187143611, 173185}, NULL) = 0
> gettimeofday({1187143611, 173215}, NULL) = 0
> gettimeofday({1187143611, 173243}, NULL) = 0
> gettimeofday({1187143611, 173273}, NULL) = 0
> gettimeofday({1187143611, 173329}, NULL) = 0
> gettimeofday({1187143611, 173387}, NULL) = 0
> gettimeofday({1187143611, 173443}, NULL) = 0
> gettimeofday({1187143611, 173573}, NULL) = 0
> gettimeofday({1187143611, 173663}, NULL) = 0
> gettimeofday({1187143611, 173751}, NULL) = 0
> select(25, [20 21 22 23 24], [], NULL, {0, 0}) = 1 (in [24], left {0, 0})
> gettimeofday({1187143611, 173990}, NULL) = 0
> recvmsg(24, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53),
> sin_addr=inet_addr("192.35.51.32")},
> msg_iov(1)=[{"\327\217\204\0\0\1\0\1\0\10\0\n\6indigo\4ARIN\3NET\0\0"...,
> 4096}], msg_controllen=20, msg_control=0x81c95e8, , msg_flags=0}, 0) =
> 377
> recvmsg(24, 0xbffff850, 0)              = -1 EAGAIN (Resource
> temporarily unavailable)
> 
> =====================================
> 
> I'm not sure what to make of this strace output; hopefully someone
> more familiar with bind can glean useful information from it.
> 
> I can provide any other information if necessary, run diagnostics, etc
> -- I just hope someone can help me figure this out. I've had to turn
> to 3rd party nameservers in the meantime.
> 
> Thanks,
> Mike
> 
> 
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: Mark_Andrews at isc.org



More information about the bind-users mailing list