Understanding the Networking Fabric Behind Modern AI Deployments

Data centers have got all the spotlight in recent times when talking about the infrastructure needed to support Artificial Intelligence (AI) deployments. Yes, companies that acquire the storage and compute (comprising of GPUs and CPUs) capabilities to train and run complex AI models, manage large datasets, and ensure high availability and scalability of AI services will win the AI race. However, data centers aren’t the only building block crucial to succeeding with this technology.

Companies looking to lead the new era by empowering workers and workflows with AI are quickly learning that they need a cutting-edge networking fabric to deliver results.

More specifically, organizations need RoCEv2 (RDMA over Converged Ethernet version 2) to connect the GPUs in the computational nodes of the data center supporting AI training and inferencing solutions, while minimizing latency and maximizing throughput for AI workloads.

RoCEv2 is an advanced networking protocol that offers robustness and high efficiency. As an evolution of Ethernet and an enhancement over its predecessor, RoCEv1, this protocol delivers scalability and stability, making it ideal for demanding data environments. In the training cluster of an AI data center, for example, there is data ingest and processing workflows run simultaneously. This enables the algorithm to arrive at the response – quickly. Naturally, the volume and speeds of data that travels within the computational nodes to feed an AI algorithm is incredible, and RoCEv2 requires the lossless networking fabric that is needed to keep things running smoothly.

Looking under the hood of the RoCEv2 protocol

Before we dive into RoCEv2, let’s address the question: Why do AI workloads need RoCEv2 while traditional data centers do fine with an ethernet connection? This is because the ethernet tends to have packet loss, for multiple reasons – be it because the cable wasn’t working properly, there was a loose connection, or the switch was busy during the transmission with other tasks. RoCEv2 has many advantages, but its biggest one is the fact that it doesn’t lose data packets while transmitting them over the computational nodes in the data center.

RoCEv2 by design enables direct memory access without CPU intervention and offloading CPU tasks to free up resources for critical AI computations. Advantages of RoCE are high throughput and low latency transfer of information at a memory level. Hence to support this kind of traffic over ethernet, the need for a lossless network becomes crucial.

The Cisco network switches are built for data center networks and provide the required low latency. With up to 25.6Tbps of bandwidth per ASIC, the switches provide the very high throughput required to satisfy AI/ML clusters running on top of RoCEv2 transport and also support visibility in a lossless network through software and hardware telemetry with QoS techniques like ECN and PFC.

To visualize how ECN and PFC function, consider the role of your car’s GPS system. It uses a city map and real-time traffic updates to guide you to your destination quickly. Similarly, ECN acts like the alert from your GPS about which roads are congested, while PFC provides specific directions, recommending the quickest route to avoid traffic delays.

Ultimately, ECN and PFC help manage volume and speed of data to ensure AI algorithms work seamlessly.

Aside from these congestion-related optimizations, Cisco also uses proprietary knowledge and expertise to tune networking hardware and software to optimize their performance when handling AI-related workloads. Hardware, in this case, includes interfaces, network adapters, and smart Network Interface Cards (smartNICs), while software includes the operating system, driver software, firmware, and more.

While tuning might not sound like a crucial part of the strategy, the reality is that companies find these incredibly valuable. To them, even minor improvements in network performance can lead to significant gains in the speed and efficiency of AI model training and inference, and therefore, give them a leg up among competitors.

Seeing is believing with the Nexus Dashboard

While customers looking for the best networking solution for their AI deployment can choose various combinations of solutions, the best way to see if everything is working, is with the Nexus Dashboard.

This dashboard provides a clear, comprehensive view that showcases the effectiveness of RoCEv2 implementations and the status of ECN and PFC across the network. With real-time analytics and historical data, network administrators can easily assess AI data traffic, identifying and addressing any performance bottlenecks promptly.

The Nexus Dashboard also offers advanced troubleshooting tools that facilitate quick resolution of network issues potentially impacting AI operations, ensuring the reliability of AI services. Additionally, it supports the automation of feature deployment and network tuning, allowing for streamlined configuration across multiple devices. This reduces the manual effort and complexity involved, enhancing the network’s ability to support AI workloads efficiently.

Innovating at speed with AI using cutting-edge data center networking

AI continues to drive innovation across various sectors, the backbone of its success lies in a robust networking infrastructure.

Cisco’s integration of advanced protocols such as RoCEv2, alongside intelligent congestion management techniques like ECN and PFC, ensures that AI deployments achieve optimal performance with minimal latency and maximal throughput. The Nexus Dashboard further empowers organizations by providing a powerful visual tool to monitor, manage, and optimize these network environments in real-time, ensuring that AI systems operate seamlessly and efficiently.

With these technologies to support their AI ambitions, customers not only address today’s networking challenges but also pave the way for future advancements, establishing themselves as AI leaders – today and for the foreseeable future