Avatar

Today’s guest post is from Rolf vandeVaart, a Senior CUDA Engineer with NVIDIA.

GPUs are becoming quite popular as accelerators in High Performance Computing clusters. For example, check out Titan; a recent entry into the Top 500 list from Oak Ridge Laboratories. Titan has 18,688 nodes (299,008 CPU cores) coupled with 18,688 NVIDIA Tesla K20 GPUs.

To help ease the programming burden working with GPU memory in MPI applications, support has been added to several MPI libraries such that the MPI library can directly send and receive the GPU buffers without the user having to stage them in host memory first. This has sometimes been referred to as “CUDA-aware MPI.”

Here is some psuedo code that shows the difference. This is typical application code that uses a “regular” (non-CUDA-aware) MPI library:

cudaMalloc(&gpuPtr, DATA_SIZE);
cudaMallocHost(&cpuPtr, DATA_SIZE);
/* Special CUDA mojo to launch computation kernel on GPU */
kernel<<grid, block>>(gpuPtr);
cudaMemcpy(cpuPtr, gpuPtr, DATA_SIZE, cudaMemcpyDefault);
MPI_Send(cpuPtr, NUM_ELEMENTS, MPI_DOUBLE, dest, tag, comm);

But a CUDA-aware MPI can hide the extra steps for you:

cudaMalloc(&gpuPtr, DATA_SIZE);
kernel<<grid, block>>(gpuPtr);
cudaDeviceSynchronize();
MPI_Send(gpuPtr, NUM_ELEMENTS, MPI_DOUBLE, dest, tag, comm);

The MPI library is doing a few things behind the scenes. First, CUDA supports the ability to determine if a buffer is a GPU buffer or a host buffer via this API function:

cuPointerGetAttribute(&memType, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, ptr);

If memType is CU_MEMORYTYPE_DEVICE, the MPI library can initiate a copy to get the memory from the device before handing off the resulting buffer to the network API. While this CUDA query function needs to be invoked for every send and receive buffer, pains were taken to optimize this function and make its overhead minimal.

Note, too, that this is quite analogous to what happens with OpenFabrics-based networks: in every call to MPI_Send, the buffer must be looked up to see if it is already registered with the OpenFabrics network stack. Hence, this is not really a new concept.

Once inside the MPI library, there are basically two ways that the GPU buffers can be moved. First, they can be staged through internal host buffers in the MPI library. In this case, after the GPU data is copied into the internal host buffers, MPI just utilizes its existing protocols to send the data. Additionally, the internal host buffers can be registered with the CUDA system via cudaMemHostRegister(). With CUDA-registered memory, the data can be copied asynchronously by a DMA unit in the GPU. The MPI library than just polls every now and then to determine when the data is ready to be sent via standard host buffer protocols.

Alternatively, the MPI library can take advantage of GPU-to-GPU data movement capabilities where available. For example, within a single node, CUDA has a set of APIs that allows copying data directly between two GPUs without passing through host memory. Even better, the DMA units on the GPUs progress such copies asynchronously and without involvement of the main CPU. These APIs are called CUDA Interprocess Communication (IPC) functions; more details can be found here.

These types of CUDA support are appearing in more and more MPI implementations.

Here are some links to FAQs and papers that talk about this feature:

And if you are attending CTC 2013 March 18-21, 2013 in Santa Clara, there are some talks scheduled on this topic: