Over the last several months, we’ve been working on improving (reducing) memory footprint of the BIND 10′s authoritative DNS server, b10-auth. The result is pretty satisfactory so far. According to my recent experiments, actual memory footprint after loading various types of zones is generally just a half of what would be needed in BIND 9. And, even if the BIND 10 version is still in the middle of tuning the response performance, it’s already faster than BIND 9 in my experiments. In this article I’ll explain some technical details of specifically how we achieved that with some experimental results.
We began with examining BIND 9′s internal data structure for in-memory zone data, called rbtdb. It represents a zone’s RRsets using a tree-like structure and associated data as shown in the following diagram:
More accurately, the “tree-like” structure is a tree of balanced trees (in this implementation red black trees). Each “level” of the tree of trees corresponds to a specific portion in the hierarchy of the domain name space for the zone: the top level corresponds to some highest level domain of the zone (in the example diagram, it consists of a single domain, which is the zone origin). The second level corresponds to a subdomain of one of the nodes (domain) at the top level. The third level corresponds to a subdomain of one of the nodes (domain) at the second level, and so on. Each node of the tree of trees represents some labels of the corresponding level, and all labels concatenated from a node toward the top level produce a complete domain name in the zone (e.g., ns1.child.example.com. or www.example.com.).
When it’s used for storing a zone, the owner name of an RRset of the zone must correspond to one node of the tree of trees. Such a node has data, which is a linked list of “rdataset” structures. Each rdataset represents the rest of the RRset parameters such as the type, TTL, and actual RDATAs. If the zone is signed, RRSIGs are also represented as their own rdatasets.
In terms of memory efficiency, this looked pretty reasonable. The nested tree structure helps avoid having redundant copies of common parts of the owner name, such as “example.com.”. By maintaining other parameters via the notion of rdataset, we can avoid having duplicate copy of common RR parameters such as the TTL.
As such, in BIND 10 we used this representation as a base, and made it as efficient as possible. In what follows, I’ll explain specific methods we took for this goal.
Simplify Overall Data Structures
The BIND 9′s rbtdb supports both authoritative and recursive (caching) server usage in the same single data structure. As a result, it contains many member variables in the structures that are only used for recursive servers. In BIND 10, we generally separate authoritative and recursive implementations, and it’s expected we use a completely different data structure for the cache used in the recursive server. We therefore removed all recursive-only members.
The tree of trees in BIND 9 internally maintains a hash table to improve lookup performance. We did not adopt this idea in BIND 10. It increases both memory footprint (each node has fields for the hash support) and implementation complexity, and yet it did not seem to be very effective in terms of overall response performance. We believe we can achieve reasonably good performance via other performance considerations (which seems to be supported in experimental results shown below), and if we still need to avoid the lookup bottleneck we’ll need even more drastic changes to the structure (we have specific ideas for this, but right now we are not sure if we need this level of optimization in practice).
Recent versions of BIND 9 supports memory-mapped form of zone data for faster loading. But, its implementation requires yet other members of the node and rdata structures, requiring more memory. In BIND 10, we plan to use memory-mapped (mmap) and/or shared-memory zone data, too, but we designed the data structures taking it into account from the beginning. Specifically, we use an offset-based representation for any pointer values used in the data structures of zone data. Since it doesn’t require any additional field, we can support mmap or shared-memory data without increasing memory footprint.
Unify RRSIG and Covered Data
Next, we combined the rdataset for an RRSIG and the rdataset that the RRSIG covers. This is advantageous in several points: We can avoid having a redundant copy for the TTL, because the TTL of the RRSIG and the TTL of the covered RRset must match (identical) according to RFC4034. Also, we don’t have to have a separate field for the covered type in rdataset or figure out the covered type from the RRSIG RDATA. The former helps reduce memory footprint, and the latter will be advantageous in terms of response performance. We can also save the memory for other members of the rdataset structure by avoiding to have two structures. It at least includes a pointer to construct a link.
Furthermore, we noticed we can maintain the number of RDATAs of the covered RRset and its RRSIG in a memory efficient way as a result of combining these two. One key observation is that we should be able to assume that no RRset contains more than 6553 RRs in practice; this is because any RR in the wire format in a DNS message must at least contain 10 bytes of data (for RR type, class, TTL and RDATA length) and a valid DNS message must not be larger than 65535 bytes. So we only need 13 bits (2^13 – 1 = 8191) to represent any possible number of RRs of a single RRset. Also, in the vast majority cases there should be only a few RRSIGs for RRset (in many cases up to 2, and even in rare cases only a few more). So, if we use a 3-bit field to represent the number of RRSIGs up to 6 and reserve the value of 7 (all 1) to indicate the unlikely case of more than 6 RRSIGs, we can pack the entire information in a single 16-bit field (in most cases).
We adopted another optimization in the representation of RDATA. In the original BIND 9 implementation, each RDATA is accompanied with a 2-byte length field. However, this is often redundant, as many types of RDATA, most notably the A and AAAA RRs, have fixed lengths. In BIND 10, we maintain the length of such fixed size fields within the implementation, not in the data, thereby saving the space for the redundant length fields.
The New rdataset Structure
As a result of above, the (header part of the) rdataset structure is
pretty compact in BIND 10. Conceptually it looks like this:
RdataSetPtr next; // size of pointers (normally 4 or 8 bytes)
RRType type; // 2 bytes
uint16_t rdata_and_rrsig_count; // 2bytes
uint32_t ttl; // 4 bytes
where RdataSetPtr is an offset pointer, but its size is exactly the same as that of normal pointers. The ordering of the member variables are carefully chosen so no intermediate padding is necessary due to alignment requirements, whether it’s a 32-bit or 64-bit architecture. The entire structure requires only 16 bytes in the 64-bit architecture and 12 bytes in the 32-bit architecture.
In the BIND 9 version, the corresponding structure consumes 120 bytes and it doesn’t even contain the information of the number of RDATAs. Also, two of the 120-byte structure are needed to represent a signed RRset: one for the covered RRset and one for its RRSIG. In BIND 10, a single RdataSet structure is used both the covered RRset and the RRSIG.
The following diagram shows a comparison between BIND 9.9.2 and BIND 10 regarding the total size of data structures to store an A RR for a name “www.example.com.” (the actual run time memory footprint will be larger than this due to fragmentation).
The first two are the case where it’s not DNSSEC signed. The green box shows the size for a node in the tree of trees with label data (assuming it only stores the label “www”). The size reduction on this part mainly comes from the simplification: we removed caching only members, internal hash related members, and members related to the mmap support (which is “built-in” in the BIND 10 version). The light blue box shows the size for the fixed header part of rdataset. As discussed above, the BIND 10 version is much more compact than the original BIND 9 version. Finally, the red box shows the size of the actual RDATA. In the BIND 10 version, it’s 4 bytes, the exact size of the RDATA itself. In the BIND 9 version it needs 4 additional bytes, 2 for the number of RDATA (which is encoded in the header part in the BIND 10 version), and 2 for the size of the RDATA (which is fixed in the case of A RR, so can be skipped in the BIND 10 version).
The lower two are the case where the same A RR has an RRSIG whose signature size is 120 bytes. In this case, the size of RDATA is not so different, but the BIND 10 version doesn’t need to have a separate rdataset header for the RRSIG, and since the size of the header part in BIND 9 is pretty large (120 bytes), the BIND 10 version is still much more memory-efficient in the total size.
The next example shows the same set of data for an NS RRset containing 4 NS RRs. In this case, the RDATA part needs more memory in BIND 10. This is because domain name fields of RDATA in the BIND 10 version have some additional information to improve response performance. Still, thanks to the optimization for other parts, the BIND 10 version is much more memory-efficient in this case, too.
Below I’ll show some measurement results that might suggest BIND 10 is generally better than BIND 9. Although it’s not necessarily wrong in some points, I must note that the comparisons are not entirely fair. Right now, BIND 9 supports many more features than BIND 10, some of which should be reasons for the difference. So, in general, the compared results of BIND 9 should be interpreted as a benchmark point to see the maturity of BIND 10 in terms of production level quality, rather than for claiming which one is better.
I’ve measured run time memory footprint after loading several different types of large zones:
- An old snapshot of .net zone (before it was signed), containing about 8 million records. It largely consists of NS records for delegation and corresponding A and/or AAAA glue records.
- A signed zone using the old .net zone data (using a faked key; I don’t know the private key for the .net zone:-). This is far from the actually signed .net zone; there is no secure delegation, and it’s signed with NSEC. So, in addition to the original records, it consists of many NSEC RRs and their RRSIGs. It contains about 16 million records.
- An artificially generated large zone containing 5 million A RRs (each of them has a different owner name) in addition to SOA and 2 NS RRs.
- An artificially generated large signed zone containing 1 million A RRs (each of them has a different owner name) in addition to SOA and 2 NS RRs. It contains about 4 million RRs in total.
The following graph shows the memory footprint of BIND 9.9.2 named (built disabling threads) and BIND 10 b10-auth (git repository version at commit 4e98f9c8aec4fbe707bf4f5aa8de9e3ce4737f79) just after loading the zone. This is a 64-bit machine.
According to these results, BIND 10 requires about a half of memory compared to BIND 9 in all cases.
I also measured impact of the new in-memory implementation on response performance in terms of max possible queries the server can handle. I used a root server setup with some recent snapshot of the actual root zone data and real query sample to one of root server instances taken several years ago (so the sample is old, but should still be realistic to some extent). I used an original tool similar to BIND 9′s queryperf, and tested two specific cases: Disabling EDNS and enabling DNSSEC (the former also means disabling DNSSEC, and the latter also means enabling EDNS). For comparison I used BIND 9.9.2 named (disabling threads), and with or without enabling the “additional cache” (acache) feature, which is generally effective in such a delegation centric zone. BIND 10 runs only one instance of the b10-auth process.
The following graph shows the experimental results:
In this particular setup, BIND 10 works at least as good as BIND 9, and in some cases much better. In the case of enabling DNSSEC, more performance optimizations for BIND 10 are planned, so we believe this will become much better than the current result in near future.
I’d also note that without enabling acache, BIND 10 is notably faster than BIND 9 in both cases. Since the effectiveness of acache highly depends on the characteristics of the zone and query patterns (it’s generally advantageous for top level domain servers including root servers), in the general case comparison results would be quite likely to be closer to the cases without acache.
While the initial results look pretty promising, we still need to do some more work to make it really effective.
First, we plan to improve response performance further. In practice, we believe it’s more than sufficient if it can run faster than BIND 9 in the case of authoritative servers. But since there are some planned optimizations that should be relatively easy to implement and which we believe will make it substantially faster in some cases, we would like to make these changes.
Secondly, we need to support incremental (partial) updates to existing data, e.g., after updating a zone via dynamic updates or IXFR. There is no essential reason we cannot do this using the current data structure, but there can be some implementation difficulty or performance impact that we’ve not seen yet. It may result in changing the structure in a way increasing the memory footprint. We will need to measure the result when we support it.
Also, we want to support shared memory or mmap based in-memory storage not far from now. As briefly discussed above, the internal data structure is mostly ready for such architecture, so its implementation would not require a fundamental design change. In fact, we even have a workable prototype implementation as we reported in the BIND 10 developers’ list. And, it’s not just a nice additional feature; it’s also crucial for scalability with multi-core/CPU machines. BIND 10 adopts a multi-process architecture for the authoritative service, and in the current version each server process has its own copy of the entire data, which does not scale well if the data size is large. Besides, fully mappable storage is attractive in itself; it leads to possibility of achieving nearly zero start up time, as under some, not so uncommon, circumstances and conditions, the server process only has to call mmap(2) to load the entire data.