What happened?
Now that the Internet has become critical to our day-day lives we really feel the impact when it goes down. While the origin of the Internet – the so-called “ARPANET” was built for resiliency and redundancy there are some hidden vulnerabilities that can undermine its stability. One key issue is that it’s built on an element of trust – that the route tables that guide users to their end destination are accurate and can be trusted to be shared with other network elements. Think of it as the flight guide in the airport – you count on its accuracy to reach your gate, your flight, and your ultimate destination. Imagine what would happen if the flight guide at the airport was suddenly scrambled – or failed to update. This happened a few weeks ago to the Internet because of a vulnerability in one critical protocol, but with the right analytics and automation, these issues can be mitigated more quickly.
On Monday June 24th, 2019, a series of events occurred in the Border Gateway Protocol (BGP) network. The issues began at 10:34:26 UTC and resulted in service disruption lasting almost two hours. While there were a series of multiple events that magnified the problem leading to the outage, the root cause was distilled to a route leak caused by erroneous prefix advertisements (using the earlier analogy…it’s like bad flight details where shared). IETF RFC 7908 defines a BGP Route Leak as “the propagation of routing announcement(s) beyond their intended scope. That is, an announcement from an Autonomous System (AS) of a learned BGP route to another AS is in violation of the intended policies of the receiver, the sender, and/or one of the ASes along the preceding AS path.” BGP is a “learn and distribute protocol”, without robust safeguards to prevent the distribution of inaccurate information. Hence, it is prone to quickly disseminating incorrect information. Each Autonomous System in practice advertises only the IP addresses it owns or is authorized to advertise. However, BGP itself does not have built in safeguards to verify this. Operators have to configure filters to prevent any erroneous advertisement to be accepted by them, and then distributed to others.
Why is this significant?
The root cause did not appear to be malicious but a result of operator error. The genesis of this route leak was a BGP Optimizer. BGP Optimizers have a feature that splits IP prefixes into smaller blocks. In this case, 104.20.0.0/20 was turned into 104.20.0.0/21 and 104.20.8.0/21. The purpose of breaking IP blocks into smaller parts is to provide a mechanism to steer traffic internally within a network. This split should never have been announced externally. A small transit provider DQE was identified as source of the route leak; DQE incorrectly announced routes from its internal network to its customer. This leak was propagated through another ISP to still more. This entire sequence of events with propagation of route leaks caused outages with even other ISPs. Now multiple ISP networks were suddenly flooded with Internet traffic they should not have received and were not prepared to handle. Not only did this overwhelm their networks, the routing paths to many other networks and services were sent down the wrong path into a dead end. The routing leak affected an estimated 2,400 networks globally including many household names such as Amazon and Facebook.
The impact of BGP events can be difficult to quantify but they are very expensive from a financial and customer experience standpoint. With this route leak, the impacted tier 1/2 service providers and their experienced network operations teams took a considerable amount of time and resources to identify the root cause of the issue. The complexities of distilling the root cause resulted in an event that lasted just under two hours. Had these teams been using automation to identify the route leak, assess the impact, and resolve the incorrect route announcements, the event duration may have lasted only minutes.
Beyond route leaks, network operators are faced with many other causes resulting in network performance degradation, outages, and misdirection of traffic. In addition to erroneous prefix advertisements witnessed in this incident, other common causes include faulty ASN advertisements attempting to spoof networks and announcements from unauthorized origin ASN as compared to RKPI databases.
Is this a growing trend?
BGP Routing events that impact network operators are increasing in frequency. According to the Internet Society, in 2017 there were 14,000 routing outages, leaks and hijacks that disrupted the Internet. BGP Incidents Report says the cause of these events are as varied as the many BGP attributes that can be configured. One of the core challenges (along with benefits) of BGP is the flexibility it provides. The protocol’s ability to adapt and evolve to business requirements over 25 years also brings with it a need for constant learning and subject expertise. Like many protocols, even experienced Network Operators cannot foresee the impact changes will have on their own devices and those outside of their control. Due to the complexities of BGP, measures to prevent malicious activity are either not implemented or overlooked. With malicious efforts and the complexity of BGP management, visibility and compliance are key requirements for a network owner to maintain a secure and functional BGP environment. Tantamount are tools to address Mean-Time-To-Know (MTTK) and Mean-Time-To-Repair (MTTR).
What can a network operator do?
Why is it that large, well-funded network operators still struggle with these issues on a regular basis? In short, network operators lack the tools and visibility to determine what is happening to BGP routing inside and outside of their networks. While negative impact BGP events cannot be completely prevented, an operator can reduce MTTK and MTTR from hours to minutes with tools such as Cisco Crosswork Network Automation suite as shown below.
Challenge: Which routes are causing issues?
A foundation of the Cisco Crosswork Network Automation Suite is Cisco Crosswork Network Insights. Crosswork Network Insights provides a forensic view of all BGP routing data and is essential for network operators to quickly view a global BGP “Looking Glass” for hundreds of source routers. Just as importantly, a user can look at individual BGP route update records during a 3-month window. This ability reduces the time and complexity required to understand the scope and impact of a BGP security event. BGP security events have many different forms, and this tool’s ability gives the operator the visibility to decrease the time required to understand the problem. Crosswork Network Insights reduces the MTTK and as an example, the SaaS service identified 2646 separate BGP security events during the route leak event window. From these events, we also observed that 20,297 individual BGP routes were impacted because of several different BGP security conditions.
Challenge: Which problems are real?
As a network operator knowing about a BGP incident in a network you operate is a good start, but the goal is closed-loop automation so detected incidents can be remediated. Within the Cisco Crosswork portfolio, Cisco Crosswork Situation Manager is designed to assist IT and network operations teams in quickly understanding the availability and performance of the systems for which they are responsible. Using algorithmic and machine-learning technologies, Crosswork Situation Manager ingests performance-affecting incidents from Crosswork Network Insights in real time while simultaneously enabling smart workflows across technological and organizational boundaries to resolve the issue.
Challenge: How can you codify your operational workflow?
In concert with Situation Manager, Cisco Crosswork Change Automation automates the process of remediating problems in the BGP network by allowing an operator to match an alarm passed up from Crosswork Network Insights and Crosswork Situation Manager to undertake pre-defined remediation tasks. The result of this is a significant reduction in MTTR and cost savings from a financial and customer experience perspective.
What can you do next?
For more information on how to improve network operations and gain insight into what is happening in your network contact your Cisco account team or authorized partner.
CONNECT WITH CISCO
LET US HELP
Call us: 1.800.553.6387 - Ext 118
US/Can | 5am-5pm Pacific Other Countries