CVE 25th Anniversary
On the long-term scale of human history, 25 years is nothing.
Read postThis article focuses on benchmarking resolver performance, using a methodology that aims to provide near-real-world performance results for resolvers. Our methodology has not changed significantly since our 2021 blog post, in which we compared BIND 9.16 performance to BIND 9.11. In this post, we will show that BIND 9.18 compares favorably with 9.16, using far less memory and significantly fewer CPU cycles to handle the same query loads.
Resolvers don’t know any DNS answers by themselves. They have to contact authoritative servers to obtain individual bits of information and then use them to assemble the final answer. Resolvers are built around the concept of DNS caching. The cache stores DNS records previously retrieved from authoritative servers. Individual records are stored in a cache up to the time limit specified by the authoritative server (Time To Live, or TTL). Caching greatly improves scalability.
Any DNS query which can be fully answered from cache (a so-called “cache hit”) is answered blazingly fast from the DNS resolver’s memory. On the other hand, any DNS query which requires a round-trip to authoritative servers (a “cache miss”) is bound to be orders of magnitude slower. Moreover, cache miss queries consume more resources because the resolver has to keep the intermediate query state in its memory until all information arrives.
This very principle of the DNS resolver has significant implications for benchmarking: in theoretical terms, each DNS query potentially changes the state of the DNS resolver cache, depending on its timing. In other words, queries are not independent of each other. Any change to how (and when) we query the resolver can impact measurement results.
In more practical terms, this implies a list of variables that we have to replicate:
The traditional approach implemented, e.g., in ISC’s Perflab or using the venerable resperf tool, cannot provide realistic results because it ignores most of these variables.
The second implication is that even the traditional QPS metric (queries answered per second) alone is too limited when evaluating resolver performance: it does not express the type of queries, answer sizes and TTLs, query timing, etc.
Other performance-relevant variables include:
But these are not fundamentally different from benchmarking authoritative servers, so we will not delve into details.
The long list of variables above makes it clear that preparing an isolated laboratory with a realistic test setup is very hard. In fact, ISC and other DNS vendors have learned that it’s impossible; realistic resolver benchmarking must be done on the live Internet.
Developers from CZ.NIC Labs wrote a test tool called DNS Shotgun for this purpose. It replays DNS queries from traffic captures and simulates individual DNS clients, including their original query timing. The resolver under test then processes queries as usual, i.e., contacts authoritative servers on the Internet and sends answers back to the simulated clients. DNS Shotgun then receives and analyzes the answers.
Obviously, benchmarking on a live network cannot provide us with perfectly stable results. To counter that, we repeat each test several times and always take fresh measurements instead of using historical data. E.g., a comparative test of BIND versions 9.16.10 and 9.16.18 (which were released half a year apart) requires us to measure both versions again. This process ensures that half a year of changes on the Internet and our test system do not skew our comparison.
For each test run, we start with a new resolver instance with an empty cache. This way, we simulate the worst case of regular operation: it is as if the resolver was restarted and now has to rebuild its cache from ground zero.
Let’s have a look at the variables we measure and how to interpret them.
The QPS metric alone is not particularly meaningful in the context of regular DNS resolver operation. Instead, we measure indications that resolver clients are getting timely answers, and resource consumption on the server.
a) CPU Utilization
We monitor time BIND processes spent using the CPU as reported by the Linux kernel Control Group version 2 metric usage_usec
, and then normalize the value in a way which gives 100 % utilization = 1 fully utilized CPU. Our test machine has 16 cores, so its theoretical maximum is 1600 %. CPU usage is a cumulative metric and we plot a new data point every 0.1 seconds.
b) Memory Usage
Similarly to CPU usage, we use the Linux kernel Control Group version 2 metric memory.current
to monitor BIND 9’s memory consumption. It is documented as “the total amount of memory currently being used” and thus includes memory used by the kernel itself to support the named
process, as well as network buffers used by BIND. Resolution of the resource monitoring data is 0.1 seconds, but the memory consumption metric is a point-in-time value, so hypothetical memory usage spikes shorter than 0.1 seconds would not show on our plots.
c) Response latency - How quickly does the resolver respond?
Finally, we arrive at the most useful but also the most convoluted metric: response latency, which directly affects user experience. Unfortunately, DNS latency is wildly non-linear: most answers will arrive within a split-millisecond range for all cache hits. Latency increases to a range of tens to hundreds of milliseconds for normal cache misses and reaches its maximum, in the range of seconds, for cache misses which force communication with very slow or broken authoritative servers.
This inherent nonlinearity also implies that the simplest tools from descriptive statistics do not provide informative results.
To deal with this complexity, the fine people from PowerDNS developed a logarithmic percentile histogram which visualizes response latency. It allows us to see things such as:
and so on.
Even more importantly, a logarithmic percentile histogram allows us to compare the latency of various resolver setups visually.
For realistic results, we need a realistic query data set. This article presents results measured using traffic captures (of course anonymized!) provided by one European telecommunications operator. We would really love any samples other operators could provide, as diversity in our sample data would make our testing more representative.
These traffic captures contain one hour of traffic directed to 10 independent DNS resolvers, all of them with roughly the same influx of queries. In practice, we have 10 PCAP files: the first with queries originally directed for resolver #1, the second with queries directed to resolver #2, etc.
These traffic captures define the basic “load unit” we use throughout this article: traffic directed to one server = load factor 1x. To simulate higher load on the resolver, we simultaneously replay traffic originally directed to N resolvers to our single resolver instance under test, thus increasing load N times. E.g., if we are testing a resolver under load factor 3x, we simultaneously replay traffic originally directed to resolvers #1, #2, and #3.
This definition of load factor allows us to avoid theoretical metrics like QPS and simulate realistic scenarios. For example, it allows us to test this scenario: “What performance will we get if nine out of 10 resolvers have an outage and the last resolver has to handle all the traffic?”1
Here is the basic testbed setup we used to compare the BIND 9.18 series of releases to equivalent BIND 9.16 versions. We intentionally are not providing the exact hardware specifications to prevent readers from an undue generalization of results.
max-cache-size
set to 30 gigabytes. Practically, all other values are left at default settings: the resolver is doing full recursion and DNSSEC validation. Also, the resolver has both IPv4 and IPv6 connectivity.There is one point I cannot stress enough:
Individual test results like response rate, answer latency, maximum QPS, etc., are generally valid only for the specific combination of all test parameters, the input data set, and the specific point in time.
In other words, results obtained using this method are helpful ONLY for relative comparison between versions, configurations, etc., measured on the exact same setup with precisely the same data and time.
For example, a test indicates that a residential ISP setup with a resolver on a 16-core machine can handle 160 k QPS. It’s not correct to generalize this to another scenario and say, “a resolver on the same machine will handle a population of IoT devices with 160 k QPS on average” because it very much depends on the behavior of the clients. If all of our hypothetical IoT devices query every second for api.vendor.example.com AAAA
, the resolver will surely handle the traffic because all queries cause a cache hit. On the other hand, if each device queries for a unique name every second, all queries will cause a cache miss and the throughput will be much lower. Even historical results for the very same setup are not necessarily comparable because “something” might have changed on the Internet.
Please allow me to repeat myself:
This test was designed to compare BIND 9.16 to BIND 9.18, handling a specific set of client queries at a specific point in time. Depending on the test parameters and your client population, your results could be completely different, which is why we recommend you test yourself if you can.
We use the same resolver configuration and traffic to test both versions. We ran the test four times, increasing the load factor from our base load factor, to 5x, 10x and finally 15x, to show how the resolver performs under heavy load. For each test, we measure CPU utilization, memory usage, latency with a cold cache, and latency with a warm cache.
From these charts, we can see that version 9.18.19:
We have established that the resolver running BIND 9.18.19 is at least as performant as BIND 9.16 under ordinary loads. Let’s see what happens if we push harder and increase the load by a factor of 5:
Currently, we can see that BIND 9.18 uses half the memory, and consumes fewer CPU resources than 9.16 in the 5x load test. Latency in the warm cache test is about the same, slightly better for 9.18, but in the cold cache test, we see significantly fewer failed queries in 9.18 vs 9.16. (The intersection of the line with the top left quadrant of the chart show queries that remain unanswered after any reasonable client has timed out.)
Let’s see what happens under load 10x:
BIND 9.18 spikes to higher CPU utilization right at startup when it works hard to catch up with flood of traffic, doing more work in parallel on cache-miss queries than 9.16 was able to. After that initial spike it quickly settles down to lower utilization than 9.16. Memory usage is also initially higher in 9.18, but after about 8 seconds it drops to substantially less with 9.16 for the duration of the test. Notice how much narrower the shaded portion of the chart is around the 9.18 line than the 9.16 line on the memory usage chart. The narrower shading indicates far less variability in memory utilization in 9.18, an indicator of performance stability. At 10x load, 9.18 performs almost the same as 9.16 as far as latency, with fewer queries timing out in the cold cache scenario with 9.18 - this is the payoff of higher CPU utilization at the very beginning of the test.
In the final test scenario, with a 15x load factor, we expect to see more variability (wider shading around the lines) as BIND grows instable under heavy load.
With the 15x load factor, the resolver is practically overloaded. CPU utilization is pretty high, although after the first 5 seconds, it is again significantly lower with BIND 9.18. We see sightly wider variance bars on the memory utilization chart, with 9.18 again using less memory than BIND 9.16, after an initial spike. BIND has slightly better results from cache hits in the warm cache scenario, but probably not significantly different. From looking at the intersection of the lines at the top left of the chart, fewer queries are timing out in the cold cache scenario with BIND 9.18. The 9.18 resolver is able to better cope up with this overload situation and 3 % fewer queries time out than with 9.16.
We have extensively tested BIND 9.18.19 resolver performance using traffic captures from a telecommunications operator. We conclude that this new version outperforms the resolver in BIND 9.16.44. In steady state BIND 9.18 uses much less memory, somewhat less CPU, and delivers answers to clients with smaller latency. At the same time BIND 9.18.19 has better parallelization and is able to cope better with overload.
We embarked on this benchmarking project because we had multiple anecdotal reports from users of performance regressions in the initial versions of BIND 9.16 resolver. Using the test method described above, we were able to confirm this regression in versions of BIND 9.16 prior to 9.16.19 and identify multiple issues introduced by the refactoring in that branch. By repeting the test between BIND 9.16.44 and BIND 9.18.19 we ensure that the same mistake will not affect our users who upgrade from BIND 9.16 branch to 9.18.
Note that we are hard at work on creating BIND 9.20 now, and when BIND 9.20 is released, sometime in Q1 2024, users will have a quarter to update from BIND 9.16 before that branch goes EOL.
To simulate higher load factors, we slice and replay the traffic using the method described in this video presentation about DNS Shotgun around time 7:20. Most importantly, this method retains the original query timing and realistically simulates N-times more load. This method works under the assumption that the additional traffic we simulate behaves the same way as the traffic we already have. I.e., if you have 100,000 clients already, the assumption is that the next 100,000 will behave similarly. This assumption allows us to re-use slices of the original traffic capture from 10 resolvers to simulate the load on 20 resolvers. ↩︎
The DNS Shotgun timeout of 2 s was selected to reflect a typical timeout on the client side. BIND uses an internal timeout of 10 s to resolve queries; the resolver continues resolving the query even after the client has given up. This extra time allows the resolver to find answers even with very broken authoritative setups and cache them. These answers are then available when the clients ask again. ↩︎
What's New from ISC