2010 is shaping up to be a banner year in at least two areas: major steps toward the deployment of DNSSEC, and discoveries of operational snags affecting the deployment of DNSSEC.
An example of the former took place on March 25, when it was announced that the ARPA TLD had been signed. ARPA contains the sub-zones in-addr.arpa and ip6.arpa, which are used for reverse DNS: converting IP addresses to DNS names. It is an essential piece of the DNS infrastructure, and the signing of ARPA makes it possible for reverse lookups to be cryptographically authenticated via DNSSEC.
Unfortunately, an example of the latter took place a short time later. The public key for ARPA was placed in IANA’s Interim Trust Anchor Repository (ITAR), then detected and published in ISC’s DNSSEC Lookaside Validation (DLV) zone, dlv.isc.org. Suddenly, and for several hours afterward, recursive resolvers that relied on ISC DLV for DNSSEC validation were unable to answer reverse DNS queries at all.
The problem was caused by obsolete data persisting in resolver caches, and presents a good opportunity for a discussion of things that can go wrong with DNSSEC at transitional moments, such as the initial signing of a zone.
Caching trust chains
To check the validity of DNSSEC signatures, a resolver must fetch a copy the zone’s public DNSSEC key (DNSKEY), and then it must prove that the DNSKEY it fetched is also valid… which may involve checking it against yet another DNSKEY, and so on, until the validator reaches something that it unambiguously knows to be valid. That last thing is called a “trust anchor”, and is configured into named.conf using the “trusted-keys” or “managed-keys” statement. For DNSSEC validation to work, a resolver must be configured with at least one trust anchor.
A DNSKEY is considered valid if one of the following conditions is seen:
- The resolver has a trust anchor that exactly matches that DNSKEY
- The resolver is configured to use DNSSEC Lookaside Validation (DLV) and has a trust anchor for a DLV zone (such as dlv.isc.org), and the DLV zone contains a record matching the DNSKEY, or
- The zone’s parent contains a delegation signer (DS) record matching the DNSKEY, and that DS record can, in turn, be validated using the parent zone’s DNSKEY.
When the DNSKEY, DS, or DLV records are fetched, they are cached by the resolver. (When they don’t exist, the fact of their nonexistence is cached instead.) After a time, cached information expires and is removed, and is refreshed by new queries. But the different records can expire at different times, leading to inconsistency in the cache if, for example, a new DLV record is found that is inconsistent with an old DNSKEY still in the cache. Such inconsistencies can cause validation failures, which will continue until the last obsolete record has expired from the cache.
ARPA: What went wrong
In the case of last month’s signing of ARPA, here is what happened:
- The ARPA zone was signed and a DNSKEY was inserted at the zone apex. DNSSEC-aware resolvers began receiving signed answers.
- Attempting to validate, a resolver using DLV would fetch a copy of the ARPA zone DNSKEY record, then look for a matching DLV record at arpa.dlv.isc.org. It didn’t find one, so the zone was deemed not to be secure
- The DNSKEY was stored in the cache, with a trust level indicating that DNSSEC validation had not taken place.
At this point, everything in the resolver cache was consistent. The zone didn’t validate, but that’s okay–it wasn’t supposed to. But then:
- The new DLV record was inserted into dlv.isc.org
- The “negative cache” record indicating that arpa.dlv.isc.org did not exist expired, and was removed from the cache.
- The resolver received an answer from the ARPA zone, and found a cached DNSKEY record, but no information about the DLV record. So it looked up the DLV record, and this time it found one.
- Now the resolver had a valid DLV record indicating that ARPA was secure… but a DNSKEY record in its cache which had never been validated. It therefore incorrectly assumed that the cached DNSKEY had failed validation, and so it returned SERVFAIL.
The good news is that a workaround was available to resolver operators: Remove the old unvalidated DNSKEY by flushing cached data for the ARPA zone apex. The command to do this is “rndc flushname arpa”, and it forces the resolver to fetch a new copy of the DNSKEY that can now be fully validated. The bad news is, not every resolver operator was in a position to know about this.
Other transition issues
It’s not only DLV users who’ll have difficulties of this sort. An identical problem can arise if a new DS record is inserted into the parent zone, or if a trust anchor is configured into the resolver, without the cache being cleared. Simlar problems can also happen further up the DNS, e.g., if your zone already has a DS record in the parent, but then a trust anchor is created for the parent zone.
Problems like this can happen when any zone is signed, but they are more likely to occur if the zone is a popular one, such as a well-known search engine or a top-level domain. These are more likely to be in a resolver’s cache when the transition occurs.
ISC is currently working on fixes to BIND 9 to minimize or eliminate all disruptions of this type. We’re taking our time on this one, in hopes of ensuring that we cover all the possible failure modes. We don’t want to just fix the specific ARPA problem and miss some other bug that’s waiting to bite us next month. The fixes are in progress and will be available in future versions of BIND 9.
What you can do
In the meantime, we can offer a few tips, for both authoritative and recursive name server operators, that should help with transitions.
As an example, let us suppose a TLD is being newly signed: WX, controlled by the beautiful and exotic island nation of West Xylophone. The WX registry operator generates keys, signs and publishes the zone, then places a public key for “wx” in the IANA ITAR and ISC DLV (or, assuming this takes place after the signing of the root zone later in 2010, submits DS records into root). How can she ensure minimal DNS disruption?
- Reduce TTL values
Every record in a zone has an associated TTL (time to live) value, which indicates how long it should be stored in a resolver’s cache before being discarded and refreshed. Every zone also has a “negative cache” TTL value set in its SOA record, which indicates how long a resolver should remember the fact that a record does not exist before looking for it again.
Longer TTL values are often a good thing: they reduce the load on an authoritative server by ensuring that repeat queries from resolvers come in less frequently. But longer TTL values also lengthen the possible disruptions when things change.
So before placing DS records for your zone in the parent zone or submitting DLV records into dlv.isc.org, consider temporarily reducing your TTLs. If your authority servers can handle the additional load, reduce the negative cache TTL, DNSKEY TTL, and SOA TTL values to five minutes (300 seconds). Wait for at least as long as the longest of the former TTL values; this ensures that all the old records will be purged from caches and only records with the new TTL values will still be around. Now you can publish the DS or DLV record; there may still be validation failures for some resolvers, but they will last at most five minutes. After ten minutes, it’s safe to restore the TTLs to their original settings.
Meanwhile, resolver operators concerned about the ability to validate responses from the WX zone after it is signed and trusted can take steps as well:
- Flush the cache when adding a trust anchor
Some operators, instead of using ISC DLV, prefer to configure trust anchors themselves; they will track changes in the IANA ITAR or other trust anchor repositories, and update their resolver configuration whenever a new key is published. If you do this, make sure that the old key is flushed out of the cache.
The simplest way to do this is kill and restart the resolver. But that involves some downtime and some increase in latency, so you may prefer to keep your resolver running. To do this, add the newly published “wx” key to your “trusted-keys” or “managed-keys” statement in named.conf, run “rndc reconfig” to load the new configuration, and finally run “rndc flushname wx” to remove the cached DNSKEY record, forcing it to be re-fetched and validated against the trust anchor.
- Stay informed of new DNSSEC deployments
The addition of top level domains and other critical zones to DLV are announced on the “dlv-announce” mailing list. This can provide some forewarning for DLV users so that they can run “rndc flushname” quickly if it turns out to be necessary.
- Update resolvers
When the fixes to these problems are complete, installing the latest versions of BIND 9 will make the hoop-jumping much less necessary.