Caching question BIND 9 help me please

Thu Aug 7 00:11:07 UTC 2003

Vo wrote:

> Here's my situation.  I'm using BIND 9, Redhat 9 latest version.  I have
> 2 servers in house, a primary and secondary, and I'm using my ISP as a
> tertiary.
>
> Last week my primary suffered a hard drive failure.  I thought "no big
> deal, the secondary and tertiaries will take care of servicing DNS
> requests". The primary was down for about 28 hours while I got a new
> hard drive in place.
>
> During this timeframe the secondary was reporting constantly that it
> couldn't update since it couldn't reach the primary. I expected that.
>
> At about 24 hours elapsed time, we started disappearing off the
> internet.  Using my secondary to query some names from my domain, I got
> "not founds."  I did that with our ISP tertiary DNS server as well, and
> was spewed back the root servers list.  Not good. (I quickly got the
> primary back online and everything straightened out straight away).
>
> My first thought it that I have the caching info set incorrectly for my
> domain.  Here's an obfuscated version of my DNS record for my domain:
>
> $TTL    86400
> @       IN      SOA     dns1.me.com. postmaster.me.com. (
>                 2003042101      ; serial
>                 1H      ; refresh (1 hour)
>                 30M     ; retry (30 minutes)
>                 7D      ; expire (7 days)
>                 1H      ) ; minimum (1 hour)
>
>                 NS      dns1.me.com.
>                 NS      dns2.me.com.
>                 NS      ns1.myisp.net.
>                 NS      ns2.myisp.net.
>                 NS      ns3.myisp.net.
>
>  { address records and cname records here  }
>
> Is my $TTL setting what killed me?  Everything else seems to be set for
> a 7 day expire and I'm probably missing something insanely simple.
> What's a good setting for a stable network with no significant changes
> being done for these?

$TTL only sets the default TTL for records in the zone. Slaves never see
the TTL setting; only the TTL values on each record in the zone, and they
never expire records based on those TTL values either. The only relevant
"expiration" parameter between masters and slaves is the
SOA.EXPIRE setting, and 7 days is, of course, much longer than the 28 hours
of your outage. So it's a bit of a mystery.

Is it possible that you've had replication failure for a while and just
never noticed it? Are your primary and all of the delegated slaves
currently answering authoritatively for names in the zone?

When you say "not founds", do you mean NXDOMAIN? Were you using nslookup to
do the lookups? Sometimes if nslookup hits a SERVFAIL for the regular name,
it'll proceed to do the searchlist algorithm and then if it subsequently
gets an NXDOMAIN for a searchlisted name (quite likely) it'll *misreport*
NXDOMAIN for the whole lookup. This is one of many reasons why nslookup
sucks and "dig" is the preferred DNS troubleshooting tool. At the very
least, always turn on "debug" with nslookup to see what the hell it's doing
behind the scenes.

As for your ISP's "tertiary" server, it should have at least given you a
SERVFAIL or timed out trying to resolve the name. Sounds like they turned
off or restricted recursion and never bothered to tell you. Is it working
now?

- Kevin