2 problems: "temporary name lookup failures" & updating TLD servers

Sat Jul 3 21:36:12 UTC 2004

I've been running my own bind(was 4 when I started, am now running 9) 
server
for several years now serving myself and any housemates.  It runs as 
master to
the internal domain, use to run as a slave to TLD's for EDU and a few that
allowed it so it wouldn't constantly ask for individual requests, and 
now is
limited to running as either a forward & caching only or a stub/caching 
resolver
for TLD's.

Two problems come up -- the first, rarely, but the second more 
frequently and
more bothersome.

Problem #1)

Sometimes the TLD servers change.  The old addr's may work for a while, but
eventually, the old TLD IP's are decommissioned and I stop being able to 
resolve
(like last one to fail was .org which moved to some new DNS servers).  
Maybe I
don't have my TLD files setup correctly -- for the bigger TLD's: COM, 
EDU, GOV, NET
and ORG, I have the IP addr's of their TLD servers setup in a "tops" 
config file.

Should I have a separate file for each TLD such that it would be 
automatically updated
when new TLD servers came out?  At the time I first setup the files it 
was years before
9/11 and the TLD IP's changed rarely, but I noticed most of the TLD's no 
longer
allow slaving and some won't even allow usage as a stub server.  Right 
now the upper
level config files aren't writeable by the bind(9) process as I 
presumed, at the time,
that they weren't changed that often. 

Is it common/acceptable practice to make everything but the root servers
bind-writeable.  What about the list of root-server IP's.  There have 
been a few
IP's that have changed over the years that I manually update when I do 
system
upgrades/maintenance.  But I'm wondering if this also something I should 
allow
bind to maintain.  Have been a bit slow to allow automatic changing of 
these upper
level files on the premise that they used to be very stable and would be 
better protected by me using manual intervention to upgrade changes, but 
for the TLD
files, this system no longer works because a couple of times, now, I've 
gotten bitten
by an expired zone -- usually takes a day or so for me to notice it, but 
I'd rather that "downtime" be closer to 'zero'.

Problem # 2)

I've been noticing that I am getting the error "Temporary failure in 
name resolution".
These "temporary failures can exist for hours at a time which is why 
they are annoying.

For example, I went to browse to a site at nasa but got a "site name 
couldn't be resolved" message out of my browser (using a squid proxy).

Local dig will will turn up zero answers and a +trace will turn up:
 > dig +trace nepp.nasa.gov

; <<>> DiG 9.2.2 <<>> +trace nepp.nasa.gov
;; global options:  printcmd
.                       245750  IN      NS      A.ROOT-SERVERS.NET.
...deleted list....
.                       245750  IN      NS      M.ROOT-SERVERS.NET.
;; Received 372 bytes from 127.0.0.1#53(127.0.0.1) in 2 ms

gov.                    172800  IN      NS      G.GOV.ZONEEDIT.COM.
...deleted list...
gov.                    172800  IN      NS      A.GOV.ZONEEDIT.COM.
;; Received 271 bytes from 198.41.0.4#53(A.ROOT-SERVERS.NET) in 86 ms

nasa.gov.               259200  IN      NS      NASANS1.nasa.gov.
nasa.gov.               259200  IN      NS      NASANS3.nasa.gov.
nasa.gov.               259200  IN      NS      NASANS4.nasa.gov.
;; Received 145 bytes from 66.135.32.100#53(G.GOV.ZONEEDIT.COM) in 74 ms

dig: Couldn't find server 'NASANS1.nasa.gov': Temporary failure in name 
resolution
=================

If I use my ISP's server, it seems to have the answer:
 > dig @sfo.speakeasy.net nepp.nasa.gov

; <<>> DiG 9.2.2 <<>> @sfo.speakeasy.net nepp.nasa.gov
...
;; ANSWER SECTION:
nepp.nasa.gov.          600     IN      CNAME   nepp-562.gsfc.nasa.gov.
nepp-562.gsfc.nasa.gov. 86400   IN      A       128.183.52.249

;; Query time: 138 msec
;; SERVER: 64.81.79.2#53(sfo.speakeasy.net)
;; WHEN: Sat Jul  3 11:26:02 2004
;; MSG SIZE  rcvd: 75
=================
But hours later I'm still not able to resolve the name, still getting a 
"temporary failure".

This slows down not only web-browsing, but also emails where the sender 
addresses are
verified before being relayed to my inbox.

This "temporary failure" seems to be a fairly new/recent phenomena as 
far as I can
tell and has also happened with some addresses in other TLD's, where it 
will trickle
down to the end point servers and just continue to come back empty. 

Is this the result of increased traffic hammering these end servers too 
much or some new policy/feature that is being implemented to prevent 
individuals from running their own bind cache for small 
in-house/in-company networks?

Is there some way to configure bind to fail-over to doing a lookup from 
my ISP rather than returning a failure with my bind-server caching the 
answer from my ISP as a non-authoritative answer for whatever the 
expiration period is?

Most of my computers are on an isolated subnet that couldn't query my 
ISP directly anyway but I'm sufficiently versed in bind's feature set to 
tell it to act as a
primary resolver if it can, but if that fails, try acting as a 
caching-only name
server to one or more of my ISP's servers.

Ideas?  Suggestions?  I'm quite able to read documentation given 
pointers and I have
the dns&bind manual covering bind 9, but it's not readily apparent as to 
how I would
do this in any straightforward manner even though it might be possible 
to rig some
ugly kludge up that would likely break at the next software upgrade.

Thanks in advance for any pointers/suggestions/ideas....
Linda W.

-- 
    In the marketplace of "Real goods", capitalism is limited by safety
    regulations, consumer protection laws, and product liability.  In
    the computer industry, what protects consumers (other than vendor
    good will that seems to diminish inversely to their size)?