Every company is looking to leverage AI and ML to harness their data, accelerate their digital transformation, and drive differentiation in their business that leads to a better experience for every customer, partner, and employee. Over the past decade, the computing power needed to make this possible has become available; Connectivity, the other piece of the puzzle, however, has been the limiting factor keeping companies from gaining the competitive advantage they’ve been eyeing.
AI and ML workloads can’t be managed on a single node or server. It’s why connectivity in the form of networking infrastructure is critical. It enables the flow of information between different nodes in a computing network so that the organization’s algorithms and models can access and process data and create insights that decision makers can use.
Since nodes managing AI and ML workloads need to crunch large datasets incredibly quickly, the infrastructure connecting all the nodes in the network needs to have low latency, minimize ‘packet loss’, and be reliable.
If there’s a slowdown or any loss of data, the results can be catastrophic, ranging from unpredictable system behavior, longer response time, reduced model accuracy, and resource wastage, all of which can cost the company dearly, hurt its reputation, and run the competitive advantage it is aspiring towards.
Due to this stringent requirement of low-to-no packet loss and the need for low latency, InfiniBand has been for many years been widely used for AI and ML workloads. However, ethernet is now emerging as the networking technology of choice for many customers.
The leaders’ guide to RDMA, RoCEv2, and AI ML workloads on ethernet
Ethernet has been in existence since the 1970’s, and due to its open-standards nature and innovation throughout the years, it has become the network transport of choice.
What’s made ethernet suitable for AI and ML workloads is its ability to leverage RDMA (remote direct memory access) to build what’s called RoCEv2 (RDMA over converged ethernet 2). Without RDMA, the ethernet needed access to a device’s central processing unit (CPU) to perform a series of steps that facilitated the movement of data which caused delay and in certain cases, packet loss. With RDMA and some smart engineering that led to the creation of RoCEv2, all of this has been eliminated.
Further, RoCEv2 enables higher bandwidth utilization, and allows for it to use data centre bridging (DCB) technologies such as priority flow control (PFC) and explicit congestion notification (ECN). Just from the names, you can guess their function: to provide lossless and deterministic networking leading to a more resilient ethernet that can power AI and ML workloads seamlessly.
The business economics of ethernet for AI and ML workloads
Professionals in the IT community have historically had a preference for the ethernet because they’re familiar with it, understand it, and know how to get the most out of it. The technology has many fans in the industry.
However, familiarity isn’t the only winning edge. Ethernet’s biggest advantage is compatibility.
Since the technology has been around for more than half a century, the ecosystem of hardware and software solutions is extremely mature. What this means is that devices, servers, switches, and components are forwards-and-backwards compatible; there’s interoperability between devices, as well as protocols and application compatibility; finally, management and monitoring of the ethernet and everything it connects is easy, and most importantly, reliable.
However, cost-effectiveness is also a factor when choosing ethernet. Because of the mature ecosystem and the fact that it has been the undisputed leader in the networking space for decades, the cost of hardware and software has reduced significantly. The low cost of the technology is key to AI and ML deployments because in such instances, the focus of the budget needs to be the data, the models, and the processing power. The connectivity needs to tie everything together without itself needing attention and effort.
Finally, security is a ‘top of mind, top of the agenda’ item for every business leader today. Ethernet has been in use for decades and has therefore evolved to address security concerns over the years.
It provides several security features and protocols that help protect data and communication within the network. Between Virtual LAN (VLAN), network access control (NAC), network segmentation & firewalls, 802.1X authentication which requiring devices to authenticate before accessing the network, and other solutions, there are plenty of tools & mechanisms to keep the ethernet – and thus AI and ML workloads – secure.
Ethernet technology has been around for very long and there are many experts available who understand it and can deploy and maintain it. Leveraging the ethernet ecosystem with the attributes to support loss-less ethernet fabric with scheduling capabilities is the ideal networking choice for your AI/ML workload compared to any proprietary closed architecture
Ethernet is the practical choice for AI and ML workloads
Most leaders leveraging AI and ML are trying to gain a business advantage to drive differentiation in their market. Getting the technology right will be a game changer.
To win, leaders must focus their energy on several fronts – clarifying their goals & objectives and building clear use cases, data quality and accessibility, talent and expertise, and security, and even compliance. With ethernet evolving and rising to meet the needs of modern AI and ML workloads, it should be the obvious choice for any organization. The compatibility, cost-effectiveness, expertise & ecosystem, and security make the most practical choice.