Avatar

In the modern world, Artificial Intelligence (AI)/Machine Learning (ML) has emerged as pivotal force driving innovation across industries. From various analytics to customer service, to data cleansing and validation, and in the near future, robotic surgeries and large-scale deployment of autonomous vehicles, AI/ML is reshaping how we interact with technology and data. This transformative power, however, hinges on the ability to process vast amounts of data at unprecedented speeds, putting a spotlight on the role of and need for datacenters optimized specifically for AI/ML workloads.

Traditional datacenters are architected to support a wide range of IT tasks. Unfortunately, they fall short when it comes to supporting AI/ML workloads because the computation and networking performance needs for AI/ML workloads – massive GPU interconnectivity, large-scale distributed databases, and low-latency/low-loss networking fabric – are significantly different from traditional application processing and content hosting within datacenters.

Service Providers (SPs) see this as an opportunity to re-architect some of their data centers and turn them into specialized computational resources. These re-architected resources are built to handle the intensive, parallel processing workloads that power AI/ML, allowing customers to focus on the ‘what’ & ‘why’ rather than the ‘how’ of harnessing this technology.

One of our customers in Asia Pacific, a SP with a vision to support the next generation of digital solutions, has re-architected its datacenter with Cisco’s help, and found that the journey has not only been financially rewarding but has also provided benefits in other critical domains such as security and sustainability.

The Building Blocks of Datacenters Optimized for AI/ML Workloads

At the heart of the transformation of datacenters lies the strategic integration of Graphics Processing Unit (GPU) clusters, RDMA (Remote Direct Memory Access), it’s associated transport protocol (RoCEv2), and cutting-edge Ethernet switching technologies, such as Cisco Nexus Switches.

This combination might look like an alphabet soup at first, but it is the foundation upon which modern AI/ML datacenter infrastructure is built, ensuring that AI applications not only run but thrive.

GPU clusters provide incredible processing power, enabling parallel computation that’s essential for crunching complex AI/ML algorithms. These clusters communicate over RDMA, they bypass the CPU, allowing them to dramatically reduce latency and ensure that data moves at lightning speeds.

Unlike traditional datacenter compute workloads, GPU AI/ML workloads have less entropy and are typically large (“elephant”) flows. This is where it’s important to talk about network switch buffers – which are sections of physical memory used to temporarily store data during their transit across the network. Buffers serve as essential pit stops in the network, allowing the datacenter to control the flow of data when there’s a deluge of information flowing between GPUs. This allows the datacenter to effectively managing traffic, preventing bottlenecks, and ensuring that the AI’s computational processes are smooth and swift.

In the precise and time-sensitive tasks that AI and ML models undertake, these buffers are the guardians of efficiency, enabling a flawless and uninterrupted dialogue between GPUs, which is critical for maintaining the momentum of AI’s learning and decision-making processes.

Why Buffers are a Big Deal & How Cisco Optimizes Buffering for AI/ML Workloads

Buffers play an indispensable role in the orchestration of AI and ML workloads. Without buffers, GPUs might experience stalls or have to wait for data, which would decrease performance and result in poor experiences for users. Without a doubt, managing buffers is critical – but the reality is buffers also add latency. Too much buffers aren’t a good thing.

Simply put, to optimize datacenters for AI/ML workloads, the key is to balance the use of buffers so as to avoid packet loss and congestion without causing too much lag. This is precisely what Cisco does with Intelligent Buffering.

Cisco’s Intelligent Buffering on Nexus switches is a masterclass in this balancing act. It’s designed to adapt to the ebbs and flows of datacenter traffic, employing sophisticated algorithms that adjust buffer allocation on the fly. This adaptive approach ensures that during periods of high traffic, essential AI/ML data packets are queued efficiently, minimizing the chance of packet loss.

The system also prevents over-buffering, which can introduce unnecessary latency. By continuously monitoring network conditions, Cisco’s technology provides just the right amount of buffering – enough to smooth out data transfer without slowing down the network.

Cisco’s Nexus Switches take this a step further with their intelligent traffic management features in that it’s engineered to understand the type of traffic and can prioritize time-sensitive data, such as real-time analytics or model training information, ensuring that this data moves through the network with higher priority and lower latency.

The result is a network infrastructure that is not only robust and responsive but also finely tuned for the rigorous demands of AI and ML operations, facilitating advancements and innovations in the field.

Having discussed buffers at length, let’s now talk about the technology that connects everything together and facilitates the flow of data – Ethernet & RoCEv2 (RDMA over Converged Ethernet version 2).

RoCEv2 enhances the data transport efficiency within the network, allowing for direct memory access from one computer to another without involving the CPU. This protocol is integral in reducing latency and freeing up CPU resources for more critical processing tasks. In AI/ML operations where every millisecond of processing time can be the difference between staying ahead or falling behind, RoCEv2 ensures that the high-speed, low-latency communication necessary for complex machine learning algorithms and neural network training is achieved.

Cisco’s integration of RoCEv2 within its Nexus switches not only enables, but actively enhances, the performance of AI and ML applications.

This Is Just the Beginning for AI/ML

As enterprises strive to harness the power of AI/ML, SPs that can offer the necessary infrastructure will become invaluable allies. SPs that optimize some of their datacenters for AI/ML today won’t just attract new clientele but will also become integral to their customers’ ongoing digital evolution.

The shift towards AI/ML-optimized infrastructure is not just a technical upgrade but a strategic necessity for companies looking to gain a competitive edge through innovation and efficiency. SPs can offer not only the infrastructure but also the expertise and managed services that can help businesses navigate the complexities of AI/ML implementation.

The AI/ML optimized data center is a value proposition that goes beyond infrastructure provisioning, to encompass a comprehensive suite of services that support the entire lifecycle of AI/ML projects, from inception and training to deployment and ongoing optimization. As a result, SPs can tap into a growing market, foster long-term client relationships, and drive the advancement of AI and ML technologies across industries.

Our client – with their vision to harness AI/ML for business in their region – are now in a unique position. Having re-architected some of their datacenters using Cisco’s solutions, which include our Nexus switches (and Silicone One at their core), they’ve also gained benefits in the security and sustainability domains. While returns are already flowing in, the full extent of this transition can only be truly measured once the rush for AI/ML adoption plateaus, which is not likely to happen anytime soon.