You were only six (and you may not remember this …), when you intended to be an astronaut. Intent is a wonderful thing, but for an astronaut, it ain’t enough to survive outer space. Imagine yourself entering Venus’s atmosphere when you suddenly hear a beep. Your spacesuit’s heat shield – meant to protect you from the constant rising temperature – is reaching a critical temperature! But, in your comms device you hear the comforting sounds of your command center. “No problemo, hombre, that shield will hold. We’ve got your back.” Now, that’s assurance!
When we manage large, complex systems, we wish we had intent assurance – the promise that the system is doing what you intended it to do. But, as Roland Acra mentioned in his blog, Introducing the Cisco Network Assurance Engine, almost everything we do in our data center today is reactive. Change management (oops, my new security policy prevented me from reaching my database), day-to-day operations (the primary mail server is unreachable, let’s call the network operator … at 2 am), compliance (What? You connected a LAN cable directly to a WAN port bypassing the firewall? … true story), and the list goes on.
What is Assurance?
Assurance is the guarantee that your network is working (and will keep working) as intended. But intent encompasses pretty much everything from managing networking underlays and overlays (routes, VLANs, subnets, BGP …) to tenant and application level policy and state (security, QOS, VM migrations, etc.).
Data center intent is also specified in heterogeneous devices such as load balancers, firewalls, and VM orchestrators, and can transcend even business intent such as HIPAA, PCI compliance. In fact, pretty much everything we do in a data center has an intent behind it.
Wouldn’t it be nice if we could somehow magically assure this intent?
Not so fast. Data centers are complex. We program tens of thousands of configurations. We have to ensure continuously that the interactions amongst these configurations, as well as millions of instances of dynamic and data path state, are correct.
Consider a customer experiencing intermittent connectivity issues on its Skype traffic. The root cause had nothing to do with the application, the devices, the security policies, bandwidth or hundreds of other issues. It was an overlapping subnet dynamically imported via BGP that sent video traffic the wrong way.
Imagine your conversations being leaked to an unintended recipient. Security just took a siesta. Not cool.
Complexity Galore. The magnitude of this problem becomes evident when we consider not just the inputs (the thousands of configurations, dynamic states, etc.) to the data center but also consider the state space.
For example, let’s say we have ten thousand security policies. Traditionally, security policies are defined over fields in packet headers and are specified over 144 bits (give or take a few). That means that there are 2144 possibilities of packet headers for which we want to assure correct security intent. That is a large number! The largest number I knew when I was six years old was a billion. 2144 is considerably more than that!
You cannot send that many test packets to check if we configured our policies correctly. Talk about a bad day at work!
Human Errors. What if you were tasked to add a couple security policies to the ten thousand previously added by an operator … who is no longer with your company? What if some combination of previously configured policies overlaps with your new ones … for one of the 2144 potential packet headers? What if you were asked to add a hundred? You get the – whatchamacallit – configuration drift?
It gets worse when we manage and configure devices from multiple vendors. What if your load balancer, VM and network were out of sync? Is it a human error, a dynamic issue … or something else? How do we find the root cause of these issues? How do we find the proverbial needle in the haystack?
Math to the Rescue. Large numbers are formidable to many but not to mathematicians. Some even claim to tame infinity … but I digress. The idea is to build an abstract model and develop smart techniques to reason over extremely vast abstract state spaces.
Mathematicians call these formal techniques. The idea is not new. Researchers at Stanford, Illinois, Cornell, Berkeley, UTAustin, Naval Postgraduate School, Purdue and Princeton, at industry research labs, and at early startup companies had already done seminal research and product development in this area. Industry used formal techniques where the cost of failure was prohibitive. Chip manufacturers do formal verification before shipping billions of transistors. The space industry uses formal verification. (Remember that heat shield?) And, our peers in the software industry have a bevy of formal tools (static checking, dynamic testing, code verification, memory profiling) to ensure application health.
We network operators could only sigh in envy. Until …
Model-based Assurance for the Network Industry
In late 2015, we started the Candid Alpha Team to take advantage of formal mathematical techniques to benefit network operators in the data center. We set ourselves a goal to deliver at a fast and furious pace and offer a comprehensive and highly differentiated solution. The result is a solution that – for the first time – provides closed-loop assurance for an intent-based, software-defined network data center and lays the foundation for self-healing networks.
How did we achieve this rapid innovation and execution? We learned from decades of expertise—from Advanced Services, from technical operations teams, and from software, hardware and QA engineers who build and test Cisco’s software-defined Application Centric Infrastructure (ACI) data center network.
Having access to that expertise, we were able to model what mattered most to customers:
What are the 10% of reasons that cause 70% of network failures?
What are the good practices that every data center operator should follow?
Being Comprehensive. Building a system that can behaviorally model a large intent-based network, such as ACI, is extremely complex. To be comprehensive, we modelled all aspects of the network—the underlay, overlay, and the tenant networks. To build accurate models, we sourced information from the controller, the switch software, and even the switch hardware (when appropriate). We modelled protocols, security, VLANs, end-points, VMs, ports, routes, interactions with BGP, resources, tenants, and many more.
We are rapidly expanding the scope of our models in multiple ways. First, we are building models for how 3rd party devices such as F5 and Citrix load-balancers, firewalls, and virtual machine orchestrators/hypervisors interact with the network. Second, we’ve built integrations with 3rd party operations toolchains—Splunk and ServiceNow. Third, we are building integrations with workload optimization tools such as Turbonomic. It’s been a massive undertaking, and personally a great learning experience. Perhaps you’ve heard that phrase before? — you only truly know something when you can teach model it.
We developed in-house formal verification tools, and heavily optimized our models so that they can run on low footprint servers. The first version of the product runs on just three VMs, and already supports a scale beyond 100 leaves.
The Promise of Assurance. The main idea behind formal mathematical techniques is to build a precise abstract representation of your network. With a model, you can argue about every conceivable behavior of your network, and can be prepared for any eventuality.
The upshot is that we can make network operations predictive and proactive! For example, in a network with say, fifty thousand flows, we can predict whether any conceivable new flow will work as intended or not. Ditto for new routes, new VMs that join the network, new policies that are added to the Data center etc.
This is a radically different approach from the past. It’s one thing to confirm that “your healthcare application is currently connected to your patient database server,” but it’s a whole lot more difficult to prove that “no other server can ever access your patient database” based on the current configurations and dynamic state of your data center. A model allows you to construct and prove such complex queries and assure your intent.
Over the next few blogs, we will explore in more detail the promise of this technology. We will describe the reasons why now is the time for model-based assurance and how assurance becomes even more powerful when backed by intent. We’ll talk about what assurance can do and what it can’t.
Hasta la vista. We’ll be back. We assure you.
Good to see we are finalizing reaching assurance maturity to prevent an incident from happening
Thanks Ashish, appreciate your interest. Assurance is a long and fruitful journey, and yes, preventing outages pro-actively … especially due to human errors is a large part of the promise of the technology and product.