Resolver DDOS Mitigation

Early in 2014 a couple of our BIND support customers told us about some intermittent periods of very heavy query activity that swamped their resolvers and asked us for help. It emerged that these were just the first signs of a long series of similar DDOS (Distributed Denial of Service) attacks that began in early 2014 and are continuing today around the Internet.  After a year of experimentation and revision, we are ready to release some BIND features that have been proven to help with this kind of attack. This article explains how we got here. It is a long narrative: if you just want to know about the new features, skip to the end and look at the references listed.

The Attack

Although the DDOS has evolved a bit over the past year or year and a half, the broad outlines have remained the same.  Actually, this type of attack was first described publicly in Beijing in 2009 at a DNS-OARC meeting. The attacker enlists a network of compromised endpoints, such as home gateways, to issue queries for non-existent sub-domains of real domains. The domains queried might be querty213.example.com and 3409uaf.example.com. This is not an asymmetric amplification attack, but a straightforward flood of queries.  Because the sub-domains being queried don’t exist, no answers are cached in the network and every query generates a request for information to the authority for the domain.  The authoritative server becomes overwhelmed and unresponsive because of the unusually high load (or because it has deployed rate-limiting techniques against perceived attackers, which is likely to include some large ISP recursive servers), and the resolvers holding open queries become bogged down waiting for answers and re-querying.  Other researchers have reported that the attackers have switched from using large networks of home gateways to using fewer query generators.  They also suspect that in some cases, the attackers are now targeting the resolvers purposely (called “sandwich” attacks).

Early Experiments

It didn’t take long to figure out what was happening, so we started looking for a good mitigation solution.

We needed to balance multiple objectives:

  1. limit the resolver resources expended on handling abuse traffic
  2. avoid further adding to the load on the authoritative server that was the victim of the attack
  3. continue to handle valid queries from legitimate users, dropping or delaying as few of them as possible

First, we tried for a simple solution. We implemented a hold down timer triggered when a server failed to respond X (configurable) times, and gave it to operators under attack to see how that worked.  This wasn’t sensitive enough to work with resolvers that were responding intermittently, because each reply reset the ‘failed to respond’ counter to zero.  The intermittent responses might have been because the authoritative servers were not completely overloaded, or it may have been that we were experiencing the effect of their server administrators deploying Response Rate Limiting algorithms.

Messing around with quotas

After we decided that the hold-down timer would not be an adequate mitigation, we started looking at why the code that already exists to handle the backlog of recursive clients wasn’t managing the situation as well as we expected that it should.  The client limit is implemented with a default of 1000, but configurable using the option recursive-clients. A user-configured value for recursive-clients provides a hard limit (as configured), and a soft limit (100 lower than the hard limit).  When a new client query is received if the hard limit has already been reached, BIND simply drops the client query.  On the other hand, if the soft limit is reached, then named accepts the new query, but at the same time looks for the oldest outstanding query already in the backlog, and sends back a SERVFAIL to that one instead.  Why was this system of backlog management not helping? 
We realized three things:
    1.  The default limit of 1000 is purely a hard limit – so those admins who had not increased the limit did not have a soft quota at all and new client queries were being dropped when the backlog was full.
    2.  Configuring limits larger than ~3500 would cause named to accept more client queries into the backlog, but named often then exhausted other resources that hadn’t scaled up correspondingly, resulting in SERVAILS to clients and dropped responses when the query-handling failed internally.
    3.  Configuring limits of ~3500 and lower was still insufficient protection for the busiest servers (ones with an already high query rate.  This is because the ‘bad’ query rate could be high enough that the backlog turnover was faster than the response time for some ‘good’ authoritative servers – so ‘good’ client queries weren’t surviving for long enough in the backlog before being aborted in favour of new queries.
We experimented with a configurable soft-client-quota setting, but reverted back to an automated value, when we realized it was not actually possible to have a large number of ‘early discards’ in progress.  We updated the default recursive-clients setting, and went looking for a better solution.

Rate-limiting by zone

Next we considered max-clients-per query. When many clients simultaneously query for the same name and type, the clients will all be attached to the same fetch, up to the max-clients-per-query limit, and only one iterative query will be sent. Limiting clients per query wasn’t going to help with the type of traffic we were seeing in this DOS attack as all the queries are unique. The queries *were* being sent for the same domain(s), so we decided to rate limit by zone.  This was configured using a new option fetches-per-zone, which defines the maximum number of simultaneous iterative queries to any one domain that the server will permit. We released fetches-per-zone in an experimental BIND build that we offered to support customers, and anyone who agreed to give us feedback, and waited for results.

Fetches-per-zone seemed to work for some users, but not for others. Fetches-per-zone was less effective in helping hosting providers and others with many zones per server. Even if you limit the fetches per zone, when you had multiple zones under attack on the same server the server would still be overwhelmed.  So, that feature did not adequately protect the victim in a shared-server environment.  We thought we probably needed a fetches-per-server, which would be harder to implement in BIND.

Looking for Better Ideas

When we first learned about the upswing in these attacks, it seemed as if the impact on resolvers was unintentional. We were reluctant to give prospective attackers useful information about the opportunity for DDOSing resolver operators, so we tried to keep quiet about the impact of these attacks on resolvers. This of course made it much harder to get data and get feedback on our experimental features.  We didn’t have enough service providers testing the software and providing feedback.  We had already spent months working on this problem, and we still didn’t have a really effective solution.

We decided to hold a forum at the regular IETF meeting in July of 2014 to ask for advice.   We were still concerned about encouraging attackers, but we couldn’t afford to spend more time on trial-and-error approaches.   We invited BIND users and developers and users of other DNS systems. Even though it was organized at the last minute, a lot of people came and contributed.  Some of them were already dealing with this DDOS problem in their own networks, others were hearing about it for the first time.

Consulting the Community

By the time of the IETF meeting (July 2014) we had implemented the hold down timer, soft quotas and fetches-per-zone. All of these had limitations.

At the meeting, we talked about how to rate limit the traffic, based on per zone or per server.  There was general support for the rate-limiting approach. Wouter Wingaards of NLNET Labs shared what he was implementing in Unbound (rate limiting queries per server) and said we should definitely look at making that adaptive, so that the number of simultaneous fetches per server adjusted based on the authority’s responsiveness. We also discussed how to best drop traffic once the resolver’s limits were reached. Our soft-quota feature dropped the oldest waiting query in favor of the newest. We thought that maybe a random early drop policy would work better. We had a few other ideas as well.  A whitelist for fetches-per zone could protect popular zones from rate-limiting.  A reusable single socket for all queries to a particular authority would protect the resolver, and the authority, but would do nothing to preserve the backlog of valid queries. We left the IETF meeting thinking that a new fetches-per-server option would give the best results, and that we needed a feature that would automatically detect when the server was overloaded and adapt, rather than enforcing a static limit.

Another outcome of that meeting, for me at least, was the happy realization that smart, informed people in the community were perfectly willing to contribute their expertise. Nobody criticized us for not having all the answers. ‘Competing’ developers were facing the same problem and were willing to share ideas. This meeting probably indirectly led us to the agreement, later in the Fall of 2014, between ISC and NLNET Labs to offer technical support for the NLNET Labs DNS servers.

 

Testing Resolver Rate-limiting in Experimental Releases

The fetches-per-server feature implemented in late summer, 2014, monitors the ratio of timeouts to successful responses from authoritative servers; when the ratio rises above a specified threshold, the number of simultaneous fetches that will be sent to that server is automatically tuned downward.  When the server recovers and the timeout ratio drops below another threshold, the fetch limit is automatically tuned higher.  The parameters for this tuning can be configured using fetch-quota-params.

Because BIND has a large installed base, spanning from the very small enterprise operator to large ISP and carrier users, we can’t release experimental features widely until we are sure of their efficacy and have verified that they cause minimal collateral damage.  We had a dozen or so service providers try out our new features and several provided detailed feedback. Jazztel, a Spanish ISP, gave us permission to use some charts from their production network in public presentations. Fetches-per-zone was very effective in their environment.  That encouraged us to go out and speak at conferences about the relative success we were seeing, and to invite other operators to request the experimental version of BIND.

Responding to Rate-limited Clients

At the conferences, we got some great questions that helped us to refine our approach further.  At one meeting, Geoff Huston, Chief Scientist at APNIC asked us whether we were sure we were sending back the right response to the clients we were rate limiting.  That led us to add a knob to specify whether to quietly drop the rate limited traffic or reply with a SERVFAIL.  (SERVFAIL  is the DNS message that means the server failed to respond, which will prompt the client to retry.) Either would be OK for ordinary clients, but it would be bad to fail to respond to another server that is forwarding, because effectively you’d be passing along to it, the bad impact of the DDoS. SERVFAIL is the ‘correct’ response per the DNS protocol – but some administrators may prefer ‘drop’ as the client behavior is going to be similar (will likely retry the query) anyway, so why send a response that will just be ignored, and which is using up your network bandwidth.

We also heard updates from other vendors about what they were doing to address the continuing DDOS problem.   Nominum, for example, was committed to a blacklisting approach, derived from mining the data they collected from some of their customers. That wasn’t really an option for ISC because we didn’t have the data to create the blacklist.

BIND 9.10.3

We have decided which new features to keep and release into the general distribution, and which to remove. We will be releasing fetches-per-server, fetches-per-zone and the drop/SERVFAIL policy knob in regular BIND releases beginning with BIND 9.10.3 (off by default, of course).  We have deprecated the soft client quota and defaulted the SERVFAIL caching to off.  Since this will now be in general production use, we have added more counters and statistics on rate-limited traffic. We have also put quite a bit of labor into maintaining RPZ, which is used for implementing blacklists.  

Longer Term

We are planning further refinements to both per server and per zone query rate limiting in BIND 9.11.  We are  hoping to be able to make DDOS mitigation more effective by leveraging cookies, once there is wider support for multiple EDNS options. We think aggressive negative caching, a new IETF proposal,  (https://datatracker.ietf.org/doc/draft-fujiwara-dnsop-nsec-aggressiveuse/?include_text=1) has a lot of promise.  We are interested in exploring other ideas for mitigating DNS DDOS on a network, rather than server-level, and for enlisting help from connected providers in combating DDOS upstream from the victim.

The effort spent on developing, testing and releasing DDOS mitigation features over the past year has come at the expense of many other things we wanted to do for BIND users.  We have mitigated some of the damage and helped operators stay up to fight abuse another day.  The DDOS-ers are not really deterred, but for now, they are somewhat less successful.  We have got some incremental controls in BIND, and a renewed sense of shared fate with everyone working to keep the Internet safe for legitimate traffic.  We are very grateful to the community of BIND users and network operators who shared data on the attack and who worked with us to test counter measures.

Your next steps are:

  • If you want to learn more about this, look at the slides from the recorded webinar on the topic, given by Cathy Almond, ISC’s Sr. Support Engineer and Support Team Lead, https://www.isc.org/mission/webinars/ or view the recording, https://youtu.be/x52OAye0sXg.
  • If you are not already participating on the bind-users@lists.isc.org mailing list, consider joining, or at least checking the archives occasionally
  • Consider becoming an active beta tester for BIND and contributing your operational experience to improve support for DDOS mitigation and other features. BIND 9.10.3 should be available for beta testing in August 2015.

 

References

Documentation on using the experimental features in ISC’s KB

Recent IETF draft Aggressive use of NSEC/NSEC3 that could substantially short-circuit this DDOS.

 

Presentations

November 13, 2014  LISA ’14, DNS Response Rate Limiting, a Mini-tutorial– Eddy Winstead, Sales Engineer 

February 3, 2015  NANOG 63, Pseudo-random Domain DDOS– Eddy Winstead, Sales Engineer

March 4, 2015   Apricot 2015 – Random DNS Query Attack and Mitigation Approaches– Eddy Winstead, Sales Engineer

March 10, 2015  Netnod Spring Meeting – Tales of the Unexpected, Handling Unusual Client Query Behaviour– Cathy Almond, Sr. Support Engineer

May 9, 2015  OARC 2015 Spring Workshop – Update on experimental BIND features to rate-limit recursive queries– Cathy Almond, Sr. Support Engineer

July 8, 2015 ISC Webinar, Random Sub-domain Resolver DDOS Mitigation– Cathy Almond, Senior Support Engineer

 

Last modified: July 15, 2015 at 5:35 pm