ASN Collisions and Human Error

There is nothing more sensational than the unexpected, and when the NANOG (North American Network Operators Group) community was recently informed that an ASN collision had occurred it caused a lot of people to sit up and take notice. This event was also very interesting in that researching takes us back to a time before ARIN and RIPE existed, creating an interesting historical twist.

One of the groups to take notice was Renesys, an “Internet Intelligence” Company as they had one of the prime data sets to research this particular problem. As part of their business they collect BGP data from many sources, and already have many analysis tools for that data. After crunching the data, they concluded there were two more ASN’s of interest. Indeed, one of these ASN’s was in use for the ISC node in Fiji, which is one of our F Root local nodes. This added a new twist, as now the problem seemed to be affecting one of the root server operators, seeming to elevate the problem to a much higher level.

Renesys had contacted ISC directly, which caused an internal investigation. Initially it looked like a similar problem, based on the dates that ISC had been issued an ASN that was already in use by another party. In order to report this to the RIR, the initial e-mail assigning the resource to ISC was located. This would provide the original ticket number, and should help speed our query to the RIR. After this e-mail was forwarded to the team and several sets of eyes took a fresh look we realized an important error had occurred.

ISC had been issued ASN 38568 by APNIC for our Fiji node. When the ASN was entered into internal databases, it was entered as ASN 35868. The 5 and 8 in the second and third positions were transposed. Once the data had been entered wrong, it then spread to other internal systems.

Fast forward to a few months ago. ISC wanted to update routing registry objects better and started a project to generate routing registry updates via script. These scripts generated objects from the internal data stores, which had the transposed ASN entry. Indeed, it is this routing registry object which renesys found in the RIPE database. Note that the object has since been removed, as it was in error.

This allows us to answer some of the questions asked by renesys in the blog entry.

Despite the fact that verification services are readily available, neither the RIRs, the companies who received the duplicated ASNs, nor their providers seems to have checked if the ASN was assigned before making and accepting the ASN assignment.

Based on the timelines involved, ASN 35868 was assigned to “Logix3” several years prior to ISC asking for an ASN from APNIC. As a result there would have been no duplicate entry for Logix3 to find when they received that ASN from ARIN. ISC was later assigned 38568 by APNIC. The RIR properly assigned the ASN, and it’s entirely likely (although there are no direct records) that an ISC engineer looked up that ASN in the APNIC database, and saw the proper entry. Indeed, renesys’s question seems predicated on the idea that a duplicate was assigned by the RIR, which did not happen in this particular instance.

Not asked, but perhaps an even better question is Why didn’t either party notice a routing issue?

There are actually many cases on the Internet where duplicate ASN’s are used on purpose. Networks may use a single ASN in multiple locations for a number of reasons, and the pitfalls are well documented. Indeed, the primary problem is that due to BGP’s loop detection, each ASN island throws away the routes from the other ASN islands, as they trigger loop detection. In short, when ISC used Logix3’s ASN by mistake it created a situation where Logix3 couldn’t see the route originated by the Fiji node, and the Fiji node couldn’t see the routes originated by Logix3. Surely someone would notice?

Well, it turns out probably not. The ISC node in Fiji is configured with a default route, and this will send traffic to its upstream ISP no matter what, and thus is able to reach all of Logix3. Logix3 may or may not be configured the same way, ISC has no way to know, however the way we route F-Root prevents a problem. First, Fiji is one of more than 50 local instances of F Root, so it’s extremely likely Logix3 would prefer one of our other instances. Secondly, ISC announces the F-Root prefix in two parts. Our local nodes, like Fiji, announce Even if this route was rejected, ISC also announces, a covering aggregate, only from our global nodes. This route would have been passed on and accepted by Logix3 (assuming they receive full routes from their upstream). Thus as far as ISC can tell, there is no situation where this mistake would have lead to a loss of connectivity for anyone involved.

ISC quickly removed the incorrect records from the RIPE routing database to help remove confusion. The process of renumbering the Fiji node was not quite as quick, but was completed after a couple of weeks. ISC considered immediately shutting off the node, but based on the fact that we don’t believe this situation is causing any issue to either party we have decided to leave it in place until an orderly transition can be arranged.

This was an embarrassing situation for ISC. We wanted to come clean with the full details to help everyone understand what happened here so proper corrective action can be taken going forward. We are glad that Renesys and other smart folks are looking at the data and trying to find these sorts of problems, but also that it appears these sorts of mistakes are few and far between.


Leave a reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Last modified: June 17, 2013 at 6:35 pm