1 hour subdomain failures

Mon Aug 23 18:14:39 UTC 1999

>>>>> "John" == John Studarus <studarus at one.net> writes:

    John> 	I've been tracking down a intermittent name server
    John> problem from a single caching DNS server.  This caching DNS
    John> server will oscilate between being able to answer queries
    John> and not being able to answer the queries for hostnames in
    John> the subdomain.  The oscillations are exactly two hours in
    John> total length (one hour it works, for the next hour it is
    John> broken).

Didn't you ask this last week? [BTW, please separate the paragraphs in
your postings by at least 1 blank line.]

    John> 	When I say it is broken I mean that when we send a
    John> query we never get a packet in reply.  When I perform the
    John> query via tcp the socket closes right after the query.

Why are you using TCP, not that it really matters. If the remote name
server isn't going to come up with an answer in time, it won't make
any difference whether you talk to it with UDP or TCP. All this has
done is suggest that the server's reply isn't getting dropped by the
network (assuming the server ever sent one). And when the TCP socket
closes, who's closing it, your end or the server's? Does that close
generate an error or do you get EOF?

    John> 	Some more details...  The ttl for the NS record for
    John> this subdomain is 1 hour.  The ttl for hosts in this
    John> subdomain is 6 minutes.

Shame you forgot to give the names and addresses of the RRs and
domain(s) in question as well as the details of the suspect name
server.

    John> 	Could it be that when the NS record expires (after 1
    John> hour) the caching server waits for an hour before it
    John> contacts the authoritative server again?

I doubt it. You probably need to analyse a dump of that server's cache
to find out what's going on. Maybe there's a subtle forwarding
intrigue? Say half the time this caching server forwards queries to
a big fast server and half the time they go to a slow server that
tends to drop packets. Or else the caching server somehow gets bogus
negative answers for the domain's NS or A records and these have a 1
hour TTL.

A simpler solution would be to avoid this caching-only server. Why
depend on a name server that doesn't seem to give reliable answers?
And if that suspect name server belongs to your ISP, why don't they
investigate?