OpenAI runs some of the most demanding AI infrastructure on the planet, and when it came time to pick the supporting networking security technology, they chose the option Cisco acquired in 2024. That technology is Isovalent, and the open-source core it’s built on – Cilium, running on eBPF – it’s the same foundation every major hyperscaler already trusts inside their managed Kubernetes offerings.
So when OpenAI’s case study went live earlier this year, I read it as the kind of decision that tells you a lot about where AI infrastructure is heading. Naturally, I wanted to dig into what they were actually buying, and why.
“The people running the most demanding AI infrastructure on the planet selected Cisco technology to help secure their critical infrastructure.”
The AI conversation no one is having
Most people think about models, but far fewer people think about the plumbing beneath them. The model is the visible part, but the network carries the weight when the visible part scales into thousands of GPUs and millions of requests per second.
AI clusters are unforgiving infrastructure. Containers, tool calls and MCPs spin up and down by the second on shared hardware that’s being run flat-out to eek out every cent of utilization & efficiency as they prepare for IPO. In that environment, the failure modes turn existential – fast. A compromised container in one corner of the cluster has to stay there. A tenant’s workload should not be able to interfere with the workload next door, let alone quietly observe it. And when something breaks, the on-call engineer needs to see exactly who was talking to what, immediately.
Traditional networking, the kind built on IP addresses and hand-written firewall rules, was never designed for any of this. In a Kubernetes environment where pods come and go by the second, an IP-based rule is obsolete the moment it’s written.
“An IP-based rule is obsolete the moment it’s written.”
What Cilium changes
Cilium replaces address-based networking with identity-based networking. Rather than ‘this IP can talk to that IP’, the rule becomes ‘this workload, with this signed identity, can reach that workload.’ When pods move or restart, the policy follows them, so nobody is left chasing changing IPs around a config file.
It runs inside the Linux kernel via eBPF. That sounds deeply technical, and it is, but the practical effect is the part worth understanding. Policy and observability live so close to the metal that they don’t slow workloads down, and they can’t be silently bypassed by anything running above them.
For an AI cluster, that translates into two things you can point at directly –
Containment. When an attacker gets into a container – through anything from a vulnerable dependency, compromised tool or skill – or even a prompt-injection that escapes a model sandbox – they hit a wall the moment they try to move sideways. The cleverness of the attack doesn’t matter, because the workload simply isn’t authorized to reach what the attacker was aiming for, and the enforcement happens in the kernel where there’s no realistic way to bypass it.
Forensics. When something does go wrong, the engineers can pull up a real-time view of every flow in the cluster and trace exactly what spoke to what. This means no four-hour root cause meeting, and no guessing.
The signal hiding in plain sight
This is the part I find most telling – Cilium isn’t a Cisco-only story. The same open-source foundation underpins managed Kubernetes at every major hyperscaler in the world, and the companies that compete fiercely on every other layer of cloud infrastructure quietly converged on the same network layer.
When the people building AWS, Google Cloud and Azure all reach for the same answer to the same problem, they’re telling you something about the problem – that there’s one solution that holds up at this scale, and everyone running serious infrastructure has standardized on it.
“Everyone running serious infrastructure has standardized on the same answer.”
What this means for your organization
You don’t need OpenAI’s scale to face OpenAI’s problems. Anyone running AI workloads on Kubernetes is dealing with the same failure modes, just at smaller numbers.
A container compromise that spreads laterally hurts you whether your cluster has fifty pods or fifty thousand. None of these risks scale with cluster size – they scale with how well your infrastructure was set up to contain them in the first place.
So should you care?
If you’re building anything AI-shaped on Kubernetes, yes.
Cisco acquired Isovalent in 2024, but the case for Cilium predates that. The people running the biggest AI workloads in the world chose it on the merits, and they chose it before Cisco ever owned it. The acquisition just means the open-source foundation now sits inside a portfolio with the security and observability layers around it that turn it into something an enterprise can actually deploy and govern.
The infrastructure layer rarely gets the spotlight, but it’s what determines whether your AI ambitions hold up when the load is real.
The full OpenAI case study is here: OpenAI Uses Isovalent for a Common Networking Foundation