Avatar

I’ve always believed that when it comes to building scalable AI infrastructure, you have to find a balance between performance and reliability. Technology offers speed, but a dependable infrastructure is what makes AI truly sustainable. My time at Cisco has reinforced that belief, so I wanted to share a few key thoughts around the topic.

Before I joined the business, I knew Cisco had always taken on a balanced approach to infrastructure. The first internal cluster, built before ChatGPT was released, used 48 Volta-era GPUs on OpenStack and was created to explore and learn. To me, this showed an understanding that you can’t have capability without first having clarity. As demand for generative AI surged, Cisco then scaled methodically. A 128-GPU H100 cluster, dubbed Zeus, was built to validate Ethernet AI fabrics. Unsurprisingly, the benchmark numbers confirmed my standpoint – AI training is often constrained not by compute, but by the network. Here’s the fundamental truth that businesses should consider when it comes to making any decision about their infrastructure – GPUs are only as fast as the data flowing between them – networking ability matters.

Returning to the present, I can see how the same logic has guided subsequent builds. A 512-GPU H200 cluster now supports Cisco’s Customer Experience (CX) team, where a custom Mistral model ingests more than 50 data signals to streamline support workflows. The result is a 20% reduction in renewal processing time. So, a more reliable network resulted in faster output.

This idea of balance doesn’t just cover performance, but security as well, because reliable functionality isn’t possible without a certain level of built-in safety. In Cisco’s case, this is evident in the fact that segmentation, encryption, visibility, and zero trust are treated as defaults. If we’re talking about wanting AI to be part of the core operating fabric of modern enterprises, then it has to be built on a foundation that is as reliable as it is secure. This is only doable when every cluster, model, and workload is governed by security-first principles.

Beyond technical capability, businesses need to understand the balance between performance and reliability if they want their AI to transition from experimental use to large-scale operational deployment. To me, this is key to success. I’ll keep sharing more learnings as, and when, I can. In the meantime, I’m interested to hear how you’re approaching your AI investments and if you’re effectively managing the balance.

#CiscoAI #EnterpriseAI #ScalableAI #SecureAI