How Did the Internet Domain Survey Work?
The Original Survey
The Internet Domain Survey was taken twice a year since 1981. The original survey methodology counted hosts by “walking” the domain name tree and doing zone transfers of domain data in order to discover hosts and further subdomains. It is described more completely in RFC1296. It looked at domains, looked at what addresses they were using, and counted them. It was called “domain survey” because it used domain data to find addresses that were in use: it surveyed domains to find addresses.
By July 1997 the Domain Survey was unable to count a significant portion of the hosts in the domain system, because many organizations began restricting download access to their domain data. The blocking of these downloads (or “zone transfers”, as they are called) had increased to the point that for the July 1997 survey we could only download address listings for 75% of the domains we discovered. We decided to try a new survey technique before the old one became useless.
The Next-generation Survey
In January 1998, we ran the first “new” Internet Domain Survey. Its methodology was the reverse of the original. It counted the number of IP addresses that have been assigned a name. This distinction is subtle but it does mean the new survey is counting something different than the old survey. This difference made it problematic to compare numbers from before and after the change, so we ran the two in parallel for a while.
How It Works
The new survey algorithm works by querying DNS for the name assigned to every possible IP address. However, if we had to send a query for each of the potential 4.3 billion (2^32) IP addresses that can exist, it would take much too long. Instead, we start with a list of all network numbers that have been delegated within the IN-ADDR.ARPA domain. The IN-ADDR.ARPA domain is a special part of the domain name space used to convert IP addresses into names.
For each IN-ADDR.ARPA network number delegation, we query for further sub-delegations at each network octet boundary below that point. This process takes a few days; when it ends we have a list of all 3-octet network number delegations that exist and the names of the authoritative domain servers that handle those queries. This process reduces the number of queries we need to do from 4.3 billion to the number of possible hosts per delegation (254) times the number of delegations found. In the January 1998 survey, there were 879,212 delegations, or just 223,319,848 possible hosts.
With the list of 3-octet delegations in hand, the next phase of the survey sent out an ordinary UDP-based PTR query for each possible host address between 1 and 254 for each delegation. In order to prevent flooding any particular server, network, or router with packets, the query order is pseudo-randomized to spread the queries evenly across the Internet. For example, a domain server that handles a single 3-octet IN-ADDR.ARPA delegation would only see one or two queries per hour. Depending on the time of day, we transmit between 600 and 1200 queries per second. The queries are streamed out asynchronously and we handle replies as they return. This phase takes about one day for each 25 million probes to run. The January 1998 probes took 8 days.
Due to the differences in the old and current surveys, it is not possible to directly compare the host counts produced by each. However, we have tried to adjust the old domain survey host counts in order to make some comparisons. We did this by assuming that if we missed a certain percentage of domains in the old survey, the final host count would be approximately that same percentage lower than the actual value. So we took the old host counts and raised them by the proper percentage of domains we couldn’t survey, we could arrive at an “adjusted host count.” This allows us to have something to compare with the new survey.
With the new survey we are now publishing five figures per top-level domain (on our distribution by TLD charts). For each TLD, we show the total number of hosts found (which equals the number of PTR records found) and the number of duplicate host names found (which usually indicates a host with many addresses); then we subtract the duplicate count to arrive at the final host count.
We also publish two new numbers: a count of hosts under the 2nd- and 3rd-level domain names for each TLD. These counts have different meanings depending on how the particular TLD is organized. For example, for the .COM domain, the number of 2nd-level names equals the number of hosts in organizations using names registered under .COM, and the number of 3rd-level names is potentially meaningless. However, some TLDs, like .UK and .AU, have a few fixed subdomains at the 2nd-level (like .CO.UK), so the 3rd-level count shows the number of hosts within the organizations at the 3rd level.
Because the current survey technique uses ordinary DNS queries, and because these types of queries are used by many standard Internet applications, it is rare to find them blocked. This allows us to gather all the data we need without the blocking problems the old survey had. It also demonstrates that organizations blocking zone transfers in order to hide their host data have a false sense of security.
We decided not to try to verify the PTR entries we collected by trying to look up the name returned and verify that its address matched the PTR record. One reason is that this process would take far longer than the PTR lookup process. However, another reason is that there are a lot of PTR entries that are wrong, even though the host actually does exist. Cases were found where an IP address was pingable and had a PTR entry, but a lookup on the hostname did not return an address.
In our distribution by TLD charts, we show an entry called “ARPA” and one called “UNKNOWN”. The count for ARPA shows you the number of administrators that tried to setup a PTR entry for a host but left off the trailing dot in their zone files. These are hosts that probably exist, but have an invalid host name. The UNKNOWN count shows you the number of PTR entries that did not have any valid TLD name. These are sometimes typos, and other times entries for unused addresses (for example, a domain administrator might put in the hostname “unassigned” for any unused address).
Note that the new survey algorithm had the same potential problems as the old: just because a hostname is assigned an IP address, or an IP address is assigned a hostname, that does not mean the host actually exists. To find out how many hosts actually exist at a given time, we ping a 1% sample of all the hosts found and apply the result to the total hostcount to obtain an estimate of the total number of pingable hosts. There are other potential survey problems, many of which are discussed in RFC1296.
While comparing host counts per country code between the current survey and the final “old” survey, we found that a very small number of countries lost a significant number of hosts. We have not analyzed the data to find out exactly why this occurred, but it may be due to a number of reasons. We may just have had very bad network connectivity or packet loss to certain foreign countries that interfered with the data collection process. Another possibility is that in certain places it is not common for providers to place entries in the IN-ADDR.ARPA tables.
Another item some may notice is that our count of hostnames (or firstnames as we call them) has interesting changes. For example, the number of hosts named “www” dropped between the old survey and the new survey. The reason for this is that in the old firstname count, if a host had two names, for example “example.com” and “www.example.com”, that were both assigned the same IP address, the name “example” and “www” would each be counted as a firstname for the same host. In the current survey, a PTR record can only return a single official hostname for a particular IP address. In the example above, the new survey would count either “example” or “www”, depending on which name the administrator set up to be the official name. Since the “www” count dropped between surveys, it appears that the “www” prefix is used heavily as an “alias” for official host names.