Everybody Resolves this Domain but Us.

Sun Jul 21 18:56:26 UTC 2002

On Sat, Jul 20, 2002 at 09:04:21PM -0400, Chris Davis wrote:
> 
> Back to the zone not loading, then.  That leaves it with the administrator
> of the broken zone.
> 
> Or would a zone load but with a big ugly log entry be so bad?
> 
> We know people are doing this.  There's always someone new coming along to
> do it after the current administrator fixes his problem.  Why not keep save
> the hassle via a sanity check at startup?

You've never really delineated how named would validate NS records,
Chris, and I don't think you've considered the failure modes you could
introduce.

Validating an NS record is decidedly non-trivial. Take this example:

@       IN      NS crazy.akbars.uucpshack.net.
        IN      NS ns1.secondary.com.
        IN      NS ns2.secondary.com.

What happens when one or more of the root or gtld servers return
NXDOMAIN for com.?  (Before you say 'that will never happen', brush up
on your history. It's happened twice in the last four years.) If the
server for this zone restarts during one of those outages, and is unable
to verify the secondary.com records as 'valid', should it refuse to load
the zone? Why?

Suppose there is a registrar foulup, and uucpshack.net is accidentally
deleted from the net. zone. Should the zone be rejected if a server
tries to load it during the outage? Why? Should the server continue to
check for the existence of crazy.akbars and load the zone only when it
is able to verify it? How will it do that?

Assume that the above records apply to danber.uucpshack.net., there is
no host record for crazy.akbars, the current server also serves
uucpshack.net, and akbars.uucpshack.net has not yet been transferred to
its secondaries. How do you bootstrap that scenario?

Suppose I make a typo when modifying the host record for
crazy.akbars.uucpshack.net., and its address becomes 192.168.113.212.
Should servers run by friends who use crazy as a secondary refuse to
load their zones until that gets fixed? Why? How will the server know
that I didn't *intend* to put crazy into 1918 space? 

Suppose that crazy is a secondary for one of my zones, and my upstream
sends 64.173.36.0/24 to null0 because they're using a BGP black hole
that a third party creates and administers. Should my server refuse to
load the zone, even though most of the rest of the internet is fully
able to reach all my name servers? Why?

Assume that I host a zone whose NS records are in zones that are
only served by my two name servers. I have a script that runs
once every two hours and builds named.conf and zone files out of an SQL
database that my customer service people update with a web form.
There has been a power outage at 60 Hudson, where one of my servers is 
hosted. It is now impossible for my surviving server to validate those 
perfectly valid NS records on a restart. Should we refuse to load the 
zone? Why?

For each of the above scenarios, modulate the existence of host records
for every name server to produce the worst case, and justify your
decision to load or not load the zone.

Supposing that you deicide just to log a message, instead of refusing to
load the zone, consider the case of a small name server that hosts 100
zones. Each zone contains eight NS records, three A records, and two MX
records. Each of the NS records is glueless. Current implementations of
bind will load those 100 zones and begin serving records in well under
10 seconds. Calculate the theoretical maximum time it will take to
validate all 800 NS records, assuming there is glue for their parent
servers, and a query may take up to 30 seconds before it times out.
Now calculate that time assuming there are 650 parent servers
in that set that are glueless. Now explain to your CEO why your name
servers periodically go silent for that ammount of time, and counter his
argument that you have to migrate to the Microsoft DNS server, because
it doesn't have this nasty tendency to keep your customers' perfectly
valid zones off line for hours on end. Turn off the logging, and explain
to your CEO that yes, there is an option that could have caught the 
mistake you just made with your biggest customer's zone, but you're not
using that option. Estimate your ability to survive those meetings with
your job intact. Estimate your ability to survive the evening after
without suffering alcohol poisoning.

Those are just off the top of my head. The DNS is enormously resiliant.
One has to be precise as to how one will preserve that resiliancy when
proposing to introduce new modes of failure.

-Pete