You may have heard recently that Response Rate Limiting (RRL) has re-opened the door on cache poisoning attacks.
ISC acknowledges that RRL can increase the effectiveness of cache poisoning attacks and appreciates the detailed research that uncovered it. This is, however, only one piece in the larger context of competing security concerns, and each operator will need to find their own balance of protection.
For those unfamiliar with it, RRL is designed to reduce the effectiveness of reflected denial of service (DoS) attacks which leverage DNS servers to amplify the attack. DNS servers are frequently used as amplifying reflectors in DoS attacks because attackers can send a small UDP query with a forged source address to the DNS server and get it to respond with a much larger answer to the target of the DoS. RRL reduces the effectiveness of these attacks by detecting when a large amount of similar traffic is being sent to a single target and suppressing responses.
This could, of course, potentially be used to create a different kind of DoS attack against a target if an attacker chooses to ask the same kinds of questions that the target is likely to ask. If this were done against all of the servers authoritative for a zone, then an attacker could potentially prevent the target from getting any answers at all for the zone.
In order to combat this risk, RRL was designed with a concept called “slip”. Slip comes into play after RRL starts suppressing responses, and works by allowing a specified fraction of responses to “slip” through the suppression. These responses contain none of the actual answer data, but do have the truncation (TC) bit set in the header of the response in order to tell the client to retry over TCP. This enables legitimate clients to get answers via TCP, which has a lot more overhead than UDP but is not vulnerable to source address forgery.
ISC’s RRL implementation will debut in our upcoming 9.9.4 release. Like the redbarn.org patches that preceded our implementation, we have chosen a default slip value of “2”, meaning that TC answers will be sent to the client/target one time in two. The other half of the time the queries will go unanswered.
It is these unanswered queries that create the increased opportunity for cache poisoning by giving an attacker a larger time window in which to get a forged reply with poison data to the victim. The data that we’ve seen indicate that–for a reasonably-configured resolver–it takes, on average, more than sixteen hours of 100Mbps of forged answers in order to get the resolver to accept a poisoned answer. During this time, the resolver is being flooded with traffic, which is usually a very visible event. We have not seen any analysis of the expected time for a “stealthy” cache poisoning attack, but we expect it to be significantly longer.
The researchers who discovered this have recommended a slip value of “1”, sending a TC answer for every response that RRL decides needs to be suppressed. This reduces the effectiveness of cache poisoning attacks while increasing the effectiveness of using DNS to amplify DoS attacks.
Note that this analysis only applies to queries and responses that are affected by RRL, while anything that causes the legitimate packets to be dropped, even simple traffic congestion, will benefit someone attempting cache poisoning. Therefore, modifications to the behavior of RRL can have, at best, a limited effect in defending against cache poisoning. The best defense is for authoritative server operators to sign their zones with DNSSEC, and for resolver operators to validate responses.
The bottom line is that there is no clear “right” answer here. Both concerns are valid and the mitigation for one increases the risk of the other.
We believe, based on what we know of the current state of the Internet, that a slip value of “2” is closer to the theoretical “sweet spot” in addressing both risks than a slip value of “1” is, which is why we are keeping “2” as our default. Since RRL is not enabled by default even when it is compiled into BIND, and the slip value is a configurable option, we believe that this provides the most useful default value while giving individual operators the freedom to choose the risk balance that they are comfortable with.
Finding the right risk balance also includes considering the effect that other features (e.g. DNSSEC) have on amplification potential and resistance to cache poisoning.
Those who are interested in learning more about how we got into this situation and where we ought to go from here may want to check out Paul Vixie’s blog post “On the Time Value of Security Features in DNS”.
Edit (1 October, 2013): This article previously stated, incorrectly, that during a cache poisoning attack the server would be unable to resolve names for the domain under attack.