NXDOMAIN returned on while updating

Sat Dec 16 01:00:57 UTC 2006

Some more info from level 1 debug while sending 10 requests per second
to an already busy DNS server:

Zone transfer:
16-Dec-2006 01:10:59.887 soa_query: zone cern.ch/IN: enter^M
16-Dec-2006 01:11:00.003 refresh_callback: zone cern.ch/IN: enter^M
16-Dec-2006 01:11:00.003 refresh_callback: zone cern.ch/IN: serial: new
2006121605, old 2006121604^M
16-Dec-2006 01:11:00.003 queue_xfrin: zone cern.ch/IN: enter^M
16-Dec-2006 01:11:00.003 zone cern.ch/IN: Transfer started.^M
16-Dec-2006 01:11:00.003 zone cern.ch/IN: requesting IXFR from
137.138.28.176#53^M
16-Dec-2006 01:11:00.004 transfer of 'cern.ch/IN' from
137.138.28.176#53: connected using 137.138.17.9#53370^M
16-Dec-2006 01:11:03.533 zone cern.ch/IN: zone transfer finished:
success^M
16-Dec-2006 01:11:03.533 zone cern.ch/IN: transferred serial
2006121605^M
16-Dec-2006 01:11:03.534 zone_timer: zone cern.ch/IN: enter^M
16-Dec-2006 01:11:03.534 zone_maintenance: zone cern.ch/IN: enter^M
16-Dec-2006 01:11:03.534 zone cern.ch/IN: sending notifies (serial
2006121605)^M

Then for my test query I see this:

16-Dec-2006 01:12:49.813 client 137.138.28.176#56971: query:
cfmgr.cern.ch IN A +       ************** NORMAL ********************
16-Dec-2006 01:12:49.923 client 137.138.28.176#56971: query:
cfmgr.cern.ch IN A +
16-Dec-2006 01:12:50.033 client 137.138.28.176#56972: query:
cfmgr.cern.ch IN A +
16-Dec-2006 01:12:56.828 client 137.138.28.176#56972: query:
cfmgr.cern.ch IN A +
16-Dec-2006 01:12:56.836 client 137.138.28.176#56977: query: cfmgr IN A
+               **************NOT NORMAL******************
16-Dec-2006 01:12:56.836 createfetch: cfmgr A
**************NOT NORMAL******************
16-Dec-2006 01:12:57.433 client 137.138.28.176#56978: query:
cfmgr.cern.ch IN A +
16-Dec-2006 01:12:57.543 client 137.138.28.176#56978: query:
cfmgr.cern.ch IN A +
16-Dec-2006 01:12:57.653 client 137.138.28.176#56978: query:
cfmgr.cern.ch IN A +
16-Dec-2006 01:12:57.763 client 137.138.28.176#56978: query:
cfmgr.cern.ch IN A +

While my script gives this:

Sat Dec 16 01:12:57 2006: Error: NXDOMAIN
Sat Dec 16 01:12:57 2006: FAILED

It seems the DNS server is trying the to do recursion on the lookup
(which is local).  This was done on a name server running at 110 qps
where 40qps were recursive, the other were local domains. CERN, like
many other sites, is overloaded with spam, mainly detected by reverse
lookups - the recursion leads to many NXDOMAINS (48qps recursion/ 43qps
NXDOMAIN).

----------

The above seems to happen most frequently after a zone transfer, but is
not always the case.

More strangely, if I send 100qps to a non-loaded slave there is
absolutely no problem whatsoever :-(

Thus I conclude the problem is caused by a combination of zone transfers
and traffic profile.........

If no-one can shed any light on this I'll report it to ISC as a bug.

_Nick

-----Original Message-----
From: bind-users-bounce at isc.org [mailto:bind-users-bounce at isc.org] On
Behalf Of Nick Garfield
Sent: Friday, December 15, 2006 11:46 AM
To: bind-users at isc.org
Subject: RE: NXDOMAIN returned on while updating

Hi Kevin, Many thanks for your posting.

Some comments for below, to get the picture of your system.
> I've never seen the behavior you described, even though we have a 
> similar environment, i.e. many Dynamically-updated zones, a few big 
> ones that take a long time to transfer (e.g. an 87,000-record zone 
> that we transfer over the Atlantic).
I presume you mean, like CERN, the large zones are not DDNS, and
transfer by AXFR (not IXFR) - is that correct?

> I think we would have noticed
> this problem
> a long time ago, since, as you point out, most apps will simply *fail*

> when an erroneous NXDOMAIN is given for a name. Admittedly, as a 
> general rule, we don't have ordinary end-user clients querying our 
> master nameserver (it's pretty much dedicated to handling Dynamic 
> Updates and doing zone transfers)
Exactly, same setup as we are using.  Our clients query the slaves - it
is the slaves that are showing the symptoms I described in the first
email.

Normal end user applications don't seem to be to concerned, although
SMTP lookups can fail leading to undeliverable emails.

There are some CERN specific applications which suffer the worst -
unfortunately these apps query the DNS 30 times per second (please don't
comment on this, because there is nothing I can do except ask them to
install a local caching server). 

However, you have given me an idea - see if the same behaviour is seen
on the master :-)

>, but we do have various clients and processes  querying that box and 
>I'm sure we would have noticed spurious  NXDOMAINs  by now...
I had to write a script using perl Net::DNS to find it because that
avoids the complexity of the local resolver.

A further question:  What operating system/file system are you using?

Cheers,

Nick