Last weekend – a bank holiday weekend in the UK – saw a rather significant BGP-related disruption on the Internet. Fortunately it didn’t affect the mainstream router vendors, but caused service interruptions for anyone dependent on certain versions of the Quagga routing protocol suite (an open-source collection of routing protocol implementations with a configuration management interface that closely resembles mainstream Cisco routers). In a lot of cases, this was restricted to informational route server platforms that provide looking glass capabilities on to backbone networks, but several alternative vendors also produce router appliances based upon the Quagga code base.
I spoke at length with the Interoute on-call engineer for the weekend after several customers reported that their BGP routers had crashed for some unknown reason. He’d investigated the situation and, after discussion, it became apparent that other Internet users were reporting similar problems and that the routers involved were all based upon the Quagga routing protocol suite. The problem was linked to a software defect that manifests itself upon exposure to AS numbers exceeding 5 digits, eg. above 99999.
Once the nature of this defect was understood, we were able to identify the specific BGP updates that were causing the customers’ routers to crash and filter them from further advertisement so that connectivity was restored to those affected customers.
Further examination then revealed that the Quagga BGP daemon seemed to be trying to render an AS number into a string buffer dimensioned for only 5 characters. While this would have been perfectly sufficient for today’s ASNs in the range 0-65535, clearly larger-numbered ASNs could not be handled this way.
Larger-numbered ASNs require support for the recent IETF RFC4893 draft standard to extend the BGP AS number space from 2-octets to 4-octets in order to be represented in standard BGP attributes such as the AS Path. Quagga claims to fully implement this but it seems that some sections of the code did not fully consider the ramifications of dealing with 4-octet ASNs.
Rather frustratingly, the Quagga development team had actually already identified and corrected the problem in February when it was first reported but vendors offering router appliances based on the software suite had not yet had sufficient chance to re-distribute the software updates to address the problem.
Since then, the problem had became significant to production routers because a new network making use of a freshly-assigned ASN over 100000 had attached to the Internet and this was causing unpatched Quagga routing daemons around the world to crash every time they encountered an AS path containing the longer AS number!
The full irony of the situation emerged slightly later on the nanog mailling list when it turned out that the new network making use of the problematic ASN was actually a test network designed to demonstrate the production-readiness of 4-octet AS numbers to service providers!
So in summary, it seems that this was another small but painful step on the way to getting what is a complicated, but essential, upgrade to the global BGP routing system accepted for mainstream use by service providers and customers alike.
Further information:
Patched version of the Quagga routing protocol suite
Geoff Huston’s insightful analysis of AS number resource consumption from 2005