Without doubt, there can be few tools that parallel the usefulness and diagnostic ability of traceroute and ping. These tools offer invaluable insight into the operation and performance of network elements that we take for granted in today’s use of the Internet. Whether it’s a case of trying to discover why your laptop wifi or home broadband is not working, or why the office laser printer is not working, tech-savvy users and network engineers alike have long been acquainted with the use of these tools in order to pinpoint problems.
The same goes for network trouble-shooting in large-scale ISP networks. The principles are the same, even if the interface bandwidths are slightly different. Recently, though, I was offered a stark reminder of just how dependent today’s large-scale ISP networks are on link aggregation technology, and how a technology that makes simple promises can be complicated underneath.
To offer a small digressing background, link aggregation technology allows network devices to seamlessly “bond together” multiple interfaces to make a larger capacity interface. This allows a stop-gap technology extension between today’s ubiquitous 10 GigabitEthernet technology – a staple of most ISPs – and upcoming 100 GigabitEthernet technology, which to most is still lab-bound and expensive. Link aggregation is widely used by large-scale ISPs who routinely run up to 16 or more 10 GigabitEthernet interfaces bundled together to make 160Gbps trunks, while patiently waiting for equipment vendors to make 100GE more practical and cost-effective.
To be useful though, link aggregation technology needs to be augmented with an effective load-balancing algorithm: there has to be a good way of dividing up the traffic demand amongst all of the available bearers. You can’t simply dish out packets to different interfaces in a round-robin fashion. Doing so creates packet reordering issues on individual user sessions which can hurt network session efficiency immensely.
Instead, the established way to address this problem is to create a “session-aware” hash of a packet and associate it with an individual bearer link that way. This ensures that conversations between Internet endpoints “stick” to an individual 10GE bearer in an ISP’s network, while still allowing the ISP, on aggregate, to carry traffic in excess of a single bearer link.
Most network equipment vendors that support link aggregation in this way, support the idea of session-aware load-sharing and at Interoute we make great use of Juniper’s MX-960 MPLS/IP router platform which boasts a generally well-performing implementation of layer-3/layer-4 aware load-balancing. In addition to considering IP address endpoints in the hash decision, it can also include application port numbers. This removes the chance that “chatty” end-stations or proxy-servers with NAT or similar features masquerading many end-users can hog bandwidth on a single bearer.
The algorithm is so good at achieving a fair-balance across all bearer links within a bundle that for most cases, the multiple bearer links can simply be considered to be one link of a much larger size. To illustrate, see the attached graphs which show a group of bearer links within the same bundle and note the almost equal balance of traffic across all links.
In this particular incident, however, the load-sharing algorithm used on the MX-960, and more specifically, the software that under-pinned it let us down, and I was reminded of how important it is to understand what’s really going on behind the scenes, rather than accepting a conveniently abstracted technology view of the situation.
We’d performed a routine software upgrade of one of our Juniper MX-960 routers in Frankfurt during a maintenance window. The upgrade was designed to rid us of instrumentation management problems and address some perceived security vulnerabilities. It was one of several recent upgrades in a rather long and tedious quest to find a suitable, stable, secure software release that supports the latest 16-port cards that the MX boasts.
Several hours after the low-tide maintenance window closed – approaching daylight in the CET region timezone – Interoute’s Prague NOC began to receive the first of several complaints from customers regarding network performance degradation. Network-savvy customers almost always include the outputs of ping and traceroute in any fault-finding evidence that they produce, but in this case it was rare. They couldn’t quite pinpoint the fault, but they knew things weren’t working correctly. The event of closest correlation was the Frankfurt node software upgrade and experience fosters a healthy scepticism for coincidence. Sure enough, when we were able to re-route customers around this device, their problems seemingly went away.
Now it’s always a difficult decision when one has to choose between action that will likely satisfy a customer’s need for a short-term fix versus prolonging a situation in order to garner more evidence to help nail the issue long-term. We had Juniper engaged but so far none of the instrumentation that we were able to glean was presenting a smoking gun as to exactly what was misbehaving. Likewise, we were uncomfortable leaving customers on an artificially re-routed path while a suspicious and as yet not completely identified problem remained on one of our Frankfurt nodes with no progress being made. We needed to study the problem happening live. There was no option but to try to find another similarly-configured customer on the Frankfurt node suffering the same problems who could tolerate the problem for long enough so that we could gain more insight.
We didn’t have to look very far. As business hours dawned in the UK, the monitoring system associated with an internal VPN customer started to reported spurious polls in the SNMP activities underpinning its management system, specifically when connected to Frankfurt. We scrutinised on the symptoms, identified the endpoints affected and drew up a list of network elements that affect traffic. General Ping testing was fine, but occasionally SNMP polls would fail and SSH sessions would fail to connect which was extremely puzzling.
It was at this point that our attention focused on the different handling of the traffic types within the Juniper core and we were were reminded of our design decisions to include the features of the load-sharing algorithm that considered TCP and UDP port numbers in the load-sharing decisions. While this results in a very smooth distribution of traffic across bearer links within a bundle, in our case it meant that the network experience of successive network sessions could vary, consuming different network links. For example, a user downloading a file via HTTP might find his download traversing the first bearer link in a bundle, while if he were to press Stop/Refresh, he’d see it move to a different bearer link in the bundle. This would happen because the source TCP port of his client PC would change between the two different download attempts.
What seemed to be happening was that, depending upon how the traffic was hashed within the load-sharing algorithm, there was a chance that it could get lost or discarded in transmission through the Frankfurt node. Ironically, network failures like this, which are really a function of the link aggregation features, manage to completely defeat TCP’s usual reliable transfer mechanisms. If traffic for a session is hashed incorrectly, or hashed to a faulty bearer, that traffic will always suffer until something causes the hash to change. Most client/server protocols, while using a well-known server port, make use of a pseudo-random ephemeral port number for the client. As a result, the symptoms appear as some connections failing, while others are more successful.
We soon realised that, faced with such complexity, we needed a variation on the usual Ping program. In our situation, the ICMP probes produced by Ping would always hash the same way, dependent upon source and destination IP address. As a result, ICMP ping tests wouldn’t adequately testing multiple bearer links in an aggregate bundle.
We found our answer in one of the application protocols that had first alerted us to the situation. SNMP makes use of UDP datagrams to communicate between manager and agent. Without the complexity of TCP retransmissions, it was much more predictable and possible to create a small shell script that could send repeated SNMP queries to a target, originating each query from an independent UDP client port.
The results were damning and gave us a solid control check against any actions that we were performing on the network to determine if we were making things better or worse.
We persevered, re-engaged Juniper, and managed to pinpoint the problem down to a specific card configuration on our Frankfurt node – we were spanning a link aggregation bundle across two different types of line card – and this, in conjunction with the software upgrade, appeared to the most likely cause of our problem.
Under carefully controlled conditions, we were able to disable bearer links on one card, and re-enable them on another card, thereby remove the difference in card type. Our new Ping tool was able to instantly confirm the results of our endeavours, and we could breathe easy again.
We were able to make note of the incompatibility and audit the rest of the network for repetition, making corrective plans as required. But we’d re-learned some important lessons during the exercise:
- The best network monitoring technologies rely on actuating and observing real network traffic, rather than measuring network performance with instrumentation.
- The most useful and productive technologies often abstract away a detail and complexity in order to enable more sophisticated solutions. But we forget the fundamentals at our peril.
For posterity, our rather simple SNMP/UDP ping shell script wrapper, requiring a version of Net-SNMP, is reproduced here. No warranties!
#! /bin/sh # Send repeated SNMP UDP datagrams to target host # Report the response. Ensure SNMP client doesnt retry # which would mask a failure. OID=sysUpTime.0 COMMUNITY=public [ $# -ge 1 ] && HOST=$1 [ $# -ge 2 ] && COMMUNITY=$2 [ $# -ge 3 ] && OID=$3 if [ $# -eq 0 ]; then echo "Usage: $0 host [community] [snmp-oid]" exit 0 fi echo "PING $COMMUNITY@$HOST $OID" while true; do snmpget -r 0 -c $COMMUNITY $HOST $OID >/dev/null 2>&1 if [ $? -eq 0 ]; then printf \! else printf . fi sleep 1 done
This was a lovely bloog post