BIND 9.7 Serial Number Decrease Problem

Barry Finkel bsfinkel at anl.gov
Fri Jun 3 15:57:16 UTC 2011


I have a problem with BIND 9.7.x on Ubuntu.
I have two servers that are running 9.7.3.
They slave 332 zones, and they also master 213,750
malware/spyware zones that we have defined to reroute these
domains to a local machine.

When I was upgrading the BIND to 9.7.3-P1 yesterday, an

      ./rndc stop

command ran over 8 minutes, and named did not stop.
A "kill" command did not work; I had to revert to a
"kill -9" command.  What was BIND doing?  Gracefully
closing all of the zones?  BIND 9.7.3-P1 came up fine, but
there are two things that concern me:

1) After BIND began responding to queries, it was using
    100% of the CPU for about three minutes.  I am not sure what
    BIND was doing.  This is not major because BIND was handling
    customer queries, and after the three minutes the CPU usage
    dropped to a normal 1%.

2) Two zones reported serial number decreases.  This is bad.

I did some research on the two zones - both Microsoft
Active Directory zones (one _tcp and one _udp) that are mastered
on a Windows Domain Controller and slaved on my BIND boxes.
I have around 44 AD zones I slave, and only these two reported
problems - on my two internal Ubuntu slaves and my two Solaris 10
slaves.  The two Solaris 10 slaves do not run the spyware zones,
so I had no problem with "./rndc stop".  I therefore am not sure
that the serial number problems are due to the "kill -9".

I looked at the serial number issue on these two zones in detail;
I capture the serial numbers on all the AD zones each morning at
6:10.  Here is information for the _tcp zone:

      Date        Zone  Mast Slav Slav
      20 Oct 2010 _tcp. 1233 1233 1233
      21 Oct 2010 _tcp. 1239 1239 1239 The master incremented the serial.
      ...
      09 Nov 2010 _tcp. 1239 1239 1239
      10 Nov 2010 _tcp. 1238 1239 1239 Master decreased due to MS patch
      11 Nov 2010 _tcp. 1238 1238 1238
      ...
      03 Dec 2010 _tcp. 1238 1238 1238
      04 Dec 2010 _tcp. 1238 1238 1239 ??
      05 Dec 2010 _tcp. 1238 1239 1238 ??
      06 Dec 2010 _tcp. 1238 1238 1238
      ...
      09 Dec 2010 _tcp. 1238 1238 1238
      10 Dec 2010 _tcp. 1238 1238 1239 ??
      11 Dec 2010 _tcp. 1238 1239 1238 ??
      12 Dec 2010 _tcp. 1238 1238 1238
      ...
      05 Jan 2011 _tcp. 1238 1238 1238
      06 Jan 2011 _tcp. 1238 1239 1239 ??
      07 Jan 2011 _tcp. 1238 1238 1238
      ...
      02 Mar 2011 _tcp. 1238 1238 1238 Upgrade 9.7.2-P3 to 9.7.3
      03 Mar 2011 _tcp. 1238 1239 1239
      04 Mar 2011 _tcp. 1238 1238 1238
      ...
      16 Apr 2011 _tcp. 1238 1238 1238
      17 Apr 2011 _tcp. 1238 1238 1238 1238 1238 Two Sol10 slaves added.
      ...
      02 Jun 2011 _tcp. 1238 1238 1238 1238 1238 Upgrade 9.7.3 to 9.7.3-P1
      03 Jun 2011 _tcp. 1238 1239 1239 1239 1239

Both Ubuntu slaves have been up for 149 days (reboot around Jan 15).
The zone serial was 1239 until a MS patch run on the Domain
Controller decreased the serial by one on the evening of Nov 9.
I did nothing to correct the problem; I waited for the two zones
to expire, and then new zones were transferred from the Windows
master server.  The serial number was 1238 on the master and
slaves.  On a few days, the serial on the slaves increased
by one, and I am not sure what happened on those days.

On Mar 02 I upgraded BIND from 9.7.2-P3 to 9.7.3, and the
serial numbers on the two upgraded BIND slaves reverted to the
higher 1239 serial.  Again, I did no fixup, and on Mar 04
the serials were the same at the lower value.  I think that the
serial number decrease was temporary during the patch run.
On Apr 17 I added the two Solaris 10 slaves to my morning report, and
all five serials were contant at 1238 until I upgraded BIND Tuesday (on
the Solaris 10 boxes) and yesterday (on the Ubuntu boxes).  Immediately
after the upgrade BIND reported the serial number problem on these two
zones.  The other AD zones have had no serial number problems.

I have no idea why BIND would remember the increased 1239
serial number, when the serial number for the zone has been constant
at 1238 since Mar 04.  I have to assume that between Mar 04 and
Jun 03 BIND would have written the zone to disk, either in the
base zone file or a .jnl file.

-- 
----------------------------------------------------------------------
Barry S. Finkel
Computing and Information Systems Division
Argonne National Laboratory          Phone:    +1 (630) 252-7277
9700 South Cass Avenue               Facsimile:+1 (630) 252-4601
Building 240, Room 5.B.8             Internet: BSFinkel at anl.gov
Argonne, IL   60439-4828             IBMMAIL:  I1004994



More information about the bind-users mailing list