RIPE 86 - Rotterdam Trip Notes
Five of us from ISC attended the RIPE 86 meeting in Rotterdam, May 22 - 26, 2023.Read post
This article focuses on benchmarking resolver performance, using a new methodology that aims to provide near-real-world performance results for resolvers.1
Resolvers don’t know any DNS answers by themselves. They have to contact authoritative servers to obtain individual bits of information and then use them to assemble the final answer. Resolvers are built around the concept of DNS caching. The cache stores DNS records previously retrieved from authoritative servers. Individual records are stored in a cache up to the time limit specified by the authoritative server (Time To Live, or TTL). Caching greatly improves scalability.
Any DNS query which can be fully answered from cache (a so-called “cache hit”) is answered blazingly fast from the DNS resolver’s memory. On the other hand, any DNS query which requires a round-trip to authoritative servers (a “cache miss”) is bound to be orders of magnitude slower. Moreover, cache miss queries consume more resources because the resolver has to keep the intermediate query state in its memory until all information arrives.
This very principle of the DNS resolver has significant implications for benchmarking: in theoretical terms, each DNS query potentially changes the state of the DNS resolver cache, depending on its timing. In other words, queries are not independent of each other. Any change to how (and when) we query the resolver can impact measurement results.
In more practical terms, this implies a list of variables that we have to replicate:
The traditional approach implemented, e.g., in ISC’s Perflab or using the venerable resperf tool, cannot provide realistic results because it ignores most of these variables.
The second implication is that even the traditional QPS metric (queries answered per second) alone is too limited when evaluating resolver performance: it does not express the type of queries, answer sizes and TTLs, query timing, etc.
Other performance-relevant variables include:
But these are not fundamentally different from benchmarking authoritative servers, so we will not delve into details.
The long list of variables above makes it clear that preparing an isolated laboratory with a realistic test setup is very hard. In fact, ISC and other DNS vendors have learned that it’s impossible; realistic resolver benchmarking must be done on the live Internet.
Developers from CZ.NIC Labs wrote a test tool called DNS Shotgun for this purpose. It replays DNS queries from traffic captures and simulates individual DNS clients, including their original query timing. The resolver under test then processes queries as usual, i.e., contacts authoritative servers on the Internet and sends answers back to the simulated clients. DNS Shotgun then receives and analyzes the answers.
Obviously, benchmarking on a live network cannot provide us with perfectly stable results. To counter that, we repeat each test several times and always take fresh measurements instead of using historical data. E.g., a comparative test of BIND versions 9.16.10 and 9.16.18 (which were released half a year apart) requires us to measure both versions again. This process ensures that half a year of changes on the Internet and our test system do not skew our comparison.
For each test run, we start with a new resolver instance with an empty cache. This way, we simulate the worst case of regular operation: it is as if the resolver was restarted and now has to rebuild its cache from ground zero.
Let’s have a look at the variables we measure and how to interpret them.
The QPS metric alone is not particularly meaningful in the context of regular DNS resolver operation. Instead, we measure indications that resolver clients are getting useful answers.
a) Response rate - Does the resolver answer within a time limit?
This metric serves as a sanity check: a resolver has to answer the vast majority of queries within the client’s time limit because an answer one millisecond after the client times out is useless.
The fact that a resolver does not answer typically indicates significant overload. Still, it can also happen naturally right after resolver startup: the resolver has an empty cache, and all queries cause cache misses, require orders of magnitude more processing, and thus lead to much lower throughput. In a steady state, most queries cause a cache hit, leading to higher throughput.
Except for this startup phase, we generally expect a resolver to answer all the queries, except for packets malformed beyond recognition. The proportion of malformed queries which naturally occur in traffic depends on client behavior and changes over time.
b) Response code (RCODE) - How many failures do we observe?
Another sanity check is the proportion of RCODEs in received answers. Immediate answers are useless if all of them are SERVFAIL (or other error codes), so we need to check the proportion of RCODEs in answers we receive. Usually, the vast majority of traffic should be NOERROR and NXDOMAIN answers, but SERVFAIL, FORMERR, and REFUSED also occur naturally.
Also, the proportion of RCODEs depends on client behavior and changes over time.
c) Response latency - How quickly does the resolver respond?
Finally, we arrive at the most useful but also the most convoluted metric: response latency, which directly affects user experience. Unfortunately, DNS latency is wildly non-linear: most answers will arrive within a split-millisecond range for all cache hits. Latency increases to a range of tens to hundreds of milliseconds for normal cache misses and reaches its maximum, in the range of seconds, for cache misses which force communication with very slow or broken authoritative servers.
This inherent nonlinearity also implies that the simplest tools from descriptive statistics do not provide informative results.
To deal with this complexity, the fine people from PowerDNS developed a logarithmic percentile histogram which visualizes response latency. It allows us to see things such as:
and so on.
Even more importantly, a logarithmic percentile histogram allows us to compare the latency of various resolver setups visually.
Finally, we are finished with the theoretical introduction and can start discussing our results.
For realistic results, we need a realistic query data set. This article presents results measured using traffic captures (of course anonymized!) provided by one European telecommunications operator.
These traffic captures contain one hour of traffic directed to 10 independent DNS resolvers, all of them with roughly the same influx of queries. In practice, we have 10 PCAP files: the first with queries originally directed for resolver #1, the second with queries directed to resolver #2, etc.
These traffic captures define the basic “load unit” we use throughout this article: traffic directed to one server = load factor 1x. To simulate higher load on the resolver, we simultaneously replay traffic originally directed to N resolvers to our single resolver instance under test, thus increasing load N times. E.g., if we are testing a resolver under load factor 3x, we simultaneously replay traffic originally directed to resolvers #1, #2, and #3.
This definition of load factor allows us to avoid theoretical metrics like QPS and simulate realistic scenarios. For example, it allows us to test this scenario: “What performance will we get if nine out of 10 resolvers have an outage and the last resolver has to handle all the traffic?"2
Here is the basic testbed setup we used to compare the BIND 9.16 series of releases to equivalent BIND 9.11 versions. We intentionally are not providing the exact hardware specifications to prevent readers from an undue generalization of results.
max-cache-sizeset to 30 gigabytes. Practically, all other values are left at default settings: the resolver is doing full recursion and DNSSEC validation. Also, the resolver has both IPv4 and IPv6 connectivity.
There is one point I cannot stress enough:
Individual test results like response rate, answer latency, maximum QPS, etc., are generally valid only for the specific combination of all test parameters, the input data set, and the specific point in time.
In other words, results obtained using this method are helpful ONLY for relative comparison between versions, configurations, etc., measured on the exact same setup with precisely the same data and time.
For example, a test indicates that a residential ISP setup with a resolver on a 16-core machine can handle 160 k QPS. It’s not correct to generalize this to another scenario and say, “a resolver on the same machine will handle a population of IoT devices with 160 k QPS on average” because it very much depends on the behavior of the clients. If all of our hypothetical IoT devices query every second for
api.vendor.example.com AAAA, the resolver will surely handle the traffic because all queries cause a cache hit. On the other hand, if each device queries for a unique name every second, all queries will cause a cache miss and the throughput will be much lower. Even historical results for the very same setup are not necessarily comparable because “something” might have changed on the Internet.
Please allow me to repeat myself:
This test was designed to compare BIND 9.11 to BIND 9.16, handling a specific set of client queries at a specific point in time. Depending on the test parameters and your client population, your results could be completely different, which is why we recommend you test yourself if you can.
To establish a baseline, we replay 120 seconds of traffic from one randomly selected resolver in our data set to BIND v9.11.34. Let’s inspect the resolver performance in detail:
a) Response rate - Does the resolver answer within a time limit?
First, we plot the percentage of responses received within the 2-second time limit over time.
At first glance, the resolver is able to answer the vast majority of queries starting from the third second of the test, which is good. At the same time, we can see tiny drops distributed seemingly randomly across the time axis. Possible explanations include:
To get more data, we repeat the same test nine times - and we can see drops at the precisely same places, with very similar amplitude. Let’s zoom in on one such drop:
After nine test runs, we can see the drops are reliably reproducible, which practically rules out noise caused by the test environment or the resolver itself. Also, we are using a battle-tested version of BIND from the 9.11 series, which makes it unlikely BIND itself would be terribly broken and cause these drops.
A remaining hypothesis is that something in the data set is causing this. To verify it, we re-ran the test using data captured from other telco resolvers. We confirmed that the distribution of drops changes for each resolver and stays stable across multiple test runs.
In other words, we have confirmed that our data set (consisting of “normal” telco traffic) contains weird queries, which is something we have to live with: we are testing with real-world data!
b) Response code (RCODE) - How many failures do we observe?
We have established that the resolver answers a reasonable proportion of queries. Now we have to check if the resolver answers “sensibly,” i.e., that response codes SERVFAIL, REFUSED, and FORMERR are only a small fraction of the answers. To do this, we can take the measurement results we already have and plot each RCODE as a separate line:
We can see NOERROR answers usually represent 90-95 % of all answers, and NXDOMAIN oscillates roughly around 4 %. SERVFAIL, REFUSED, and FORMERR are also present, and their proportion randomly goes up and down, most likely depending on what weird queries clients send and how many broken authoritative servers the resolver has to contact. Also, we can see that after 100 seconds, a client sends a high volume of queries that generate REFUSED answers.
Again, we verify this is the property of our data set by inspecting test results for other telco resolvers we have data from. Only two out of ten traffic captures contain spikes in REFUSED answers, which confirms our hypothesis that the error codes we observe result from suspicious client behavior.
Again, we have confirmed that DNS traffic is the wild west, and any resolver must deal with it.
c) Response latency - How quickly does the resolver respond?
Measuring latency right after resolver startup would be misleading because the cache is empty, leading to an unrepresentative cache hit ratio. To counter this problem, we visualize latency data only from the second half of the test, which represents what users see during normal operation.
The following chart is an enhanced version of a logarithmic percentile histogram. Each test was repeated nine times, and the line shows average latency. The shaded area around the solid line denotes minimum and maximum values across all runs. The results are bi-modal, with answers served from cache shown on the lower right section of the chart, and the lines in the center and upper left sections showing the longer tail of latency for queries requiring recursion.
The Y-axis shows latency, while the X-axis is the percentile rank of the slowest queries. Translated to words:
The shading shows the minimum and maximum latency from nine test runs, which gives us an idea about result stability:
We have established a baseline, using BIND 9.11.34 and traffic from a single telco resolver, i.e., load factor 1x. The next question is: How can we usefully compare the maximum performance of a resolver running BIND 9.11.34 to one running 9.16.19?
Ideally, the resolver will answer the same percentage of queries as it did under the baseline load as the load factor increases. When the resolver starts losing queries, it is overloaded. This value is visible in the latency chart in the upper left corner as the percentile rank on the X-axis, where the line touches the timeout limit on the Y-axis. For BIND 9.11.34 under load “1x our telco resolver,” the normal percentage of unanswered queries is around 0.5 % (which consists of either severely malformed queries or queries pointing to domains that require more than 2000 ms to resolve).
The second and more sensitive criterion is overall latency. Suppose we overload the resolver only a bit. In that case, it will still manage to answer almost all the queries, but latency will increase. Latency is an area where operators can set arbitrary limits. This article uses the (admittedly vague) criterion “latency is acceptable if it does not significantly exceed the latency observed under the baseline load.” In other words, it’s bad if the latency plot for higher loads lies in the “up and right” direction from the original baseline on the latency histogram.
Here we can see that concentrating traffic from seven originally independent telco resolvers on a single machine running BIND 9.11.34 actually improves latency! The main reason is an improved cache hit rate, which happens naturally when more traffic concentrates on a single resolver. The cache also helps with getting answers from half-broken domains: even if the first query for a broken domain times out on the client side, BIND will continue resolving it and eventually cache the answer3. With more clients sending traffic to the same resolver, chances are higher that another client will send a query for the same broken domain. Then the client will get an answer from the cache, leading to a lower overall ratio of client timeouts.
Let’s try to increase load even more by sending traffic from eight resolvers to one:
This time, increasing the load to 8x the baseline did not significantly improve the ratio of answers with latency smaller than 100 ms. It somewhat increased latency for very slow answers. Even more importantly, the shaded background in the top-left quadrant indicates the resolver is working hard. We are on the verge of increasing the ratio of queries that time out.
We can push a bit harder and try to load a factor of 9x:
This chart shows that a load factor of 9x is too much for BIND 9.11.34 to handle. The proportion of queries that timed out is a bit higher. The shaded backgrounds between load factor 8x and 9x do not overlap, which indicates this relatively small difference is not a result of random noise. Also, the proportion of answers with “problematic” answers with latency higher than 100 ms is a bit higher, which indicates the resolver is working really hard but not keeping up.
Based on this data, we can conclude that load factors of 7x to 8x are about the maximum load the resolver can handle without leading to a degraded user experience. In other words, we can safely direct traffic from seven to eight “original” resolvers to a single instance, with load factor 7x being more on the safe side.
We have now found the performance limits of BIND 9.11.34, and finally, we can compare it with its successor: the BIND 9.16 series.
We use the same resolver configuration and traffic to test both versions. Let’s jump straight to tests with load factor 7x, which is about the maximum BIND 9.11.34 can safely handle, and compare it with BIND 9.16.19:
From this chart, we can see that version 9.16.19:
The higher latency for 5 % of queries was pretty disappointing for our engineering team. Users will not notice a difference between answers arriving in 5 or 6 ms, but our engineers could not get it out of their minds. This was a matter of principle! Eventually an investigation led to the removal of four lines of code which fixes this issue. This fix is scheduled for release in August 2021.
We have established that the resolver running BIND 9.16.19 is at least as performant as BIND 9.11. Let’s see what happens if we push harder and double the load on BIND 9.16.19:
Currently, we can see the resolver still works fine and answers more queries than version 9.11.34 would answer under even half the load. Doubling the load increased latency of 15 % of queries by (at most) 2.5 ms, which is very good.
Let’s see what happens under load 15x:
We can clearly see that load factor 15x is too much for BIND 9.16.19. Even though the resolver still answers queries as it should, the wide shaded background area indicates that the latency of answers is wildly unstable. Also, on average, the latency is worse than it was in all previous experiments.
We have extensively tested BIND 9.16.19 resolver performance using traffic captures from a telecommunications operator. We conclude that this new version outperforms the resolver in BIND 9.11.34. A minor glitch, which incurs about 1-2 ms latency for a small percentage of answers, is already fixed and will be released in August 2021.
We embarked on this benchmarking project because we had multiple anecdotal reports from users of performance regressions in the BIND 9.16 resolver. Using the test method described above, we were able to confirm this regression in versions of BIND 9.16 prior to 9.16.19 and identify multiple issues introduced by the refactoring in that branch. By repeating the test over several months as we modified the BIND code, we were able to eliminate the problems and confirm that 9.16.19 now performs as well as or better than the 9.11 series.
Last year, we published measurement results comparing the performance of BIND versions. The older article was primarily focused on the authoritative DNS server use-case. It also included one test for DNS resolver performance, but we have learned that the test was not realistic enough to predict real-world performance. ↩︎
To simulate higher load factors, we slice and replay the traffic using the method described in this video presentation about DNS Shotgun around time 7:20. Most importantly, this method retains the original query timing and realistically simulates N-times more load. This method works under the assumption that the additional traffic we simulate behaves the same way as the traffic we already have. I.e., if you have 100,000 clients already, the assumption is that the next 100,000 will behave similarly. This assumption allows us to re-use slices of the original traffic capture from 10 resolvers to simulate the load on 20 resolvers. ↩︎
The DNS Shotgun timeout of 2 s was selected to reflect a typical timeout on the client side. BIND uses an internal timeout of 10 s to resolve queries; the resolver continues resolving the query even after the client has given up. This extra time allows the resolver to find answers even with very broken authoritative setups and cache them. These answers are then available when the clients ask again. ↩︎
What's New from ISC