Avatar

I’ve talked before about how getting high performance in MPI is all about offloading to dedicated hardware.  You want to get software out of the way as soon as possible and let the underlying hardware progress the message passing at max speed.

But the funny thing about networking hardware: it tends to have limited resources.  You might have incredibly awesome NICs in your HPC cluster, but they only have a finite (small) amount of resources such as RAM, queues, queue depth, descriptors (for queue entries), etc.

MPI’s job is to manage these resources.  But with rising core counts — such as Intel E5 2690 v2’s 10 cores per socket — distributing those network / NIC resources between all the MPI processes on a server can require treading a fine line between aggressive performance and fair sharing.

In general, many MPI implementations divide up networking resources evenly/fairly at the beginning of a job.  Each MPI process gets 1/Nth of the networking resources on a server (i.e., shared among N MPI processes on that server).  Then the MPI implementation attempts to keep its hardware resources as full as possible so that the main CPU can go off to do other things and let the networking hardware progress itself.  

Much network hardware is implemented with some kind of queue-based interface back to the main CPU.  For example, the host enqueues a “send” or a “receive” descriptor on the networking hardware, and then networking hardware processor dequeues the descriptor and processes it.

In the case of a send, for example, the network hardware likely does something like this:

  • Dequeue the descriptor
  • Realize that it’s a send descriptor
  • Transfer the packet buffer from RAM (e.g., via DMA over the PCI bus) to local memory
  • If necessary, form a network header
  • Send the network header out on the physical layer (e.g., wire or fiber)
  • Send the packet buffer out on the physical layer

These actions are likely overlapped and/or pipelined — they don’t need to occur in serial.  And since the networking hardware is running on an ASIC that was created specifically for this purpose, it’s really, really fast and efficient.

Receive descriptors are usually a way of telling the NIC that when an incoming packet matching a certain pattern arrives, put it in a specific RAM location where the host can process the incoming packet.

But remember how I said above that network hardware has its limits?  These limits are generally much smaller than those out in main memory.  The queue to communicate with the NIC, for example, can only hold so many descriptors — usually a few thousand or so.

Now also remember that the main CPU is typically really flippin’ fast.  Probably much faster than the NIC hardware.  Meaning: it can probably queue up descriptors much faster than the network hardware can dequeue them.

What’s an MPI implementation to do, for example, in a case like this:

char message;
for (int i = 0; i < 2000000; ++i) {
MPI_Isend(&message, 1, MPI_CHAR, dest, tag, comm, &req[i]);
}

The MPI will likely aggressively enqueue send descriptors for the first bunch of those messages.

…but then the network hardware queue will fill up.  Yoinks!

Remember that MPI is a reliable and guaranteed message passing system.  So if you call MPI_Isend(), the message either has to (eventually) be delivered, or MPI_Isend() must fail (which may not be discovered until you test or wait on the resulting request, but let’s ignore that fact for the moment).

In this case, the MPI will likely queue up the messages in software, and wait for some space to become available in the network hardware queue before enqueueing the next bunch.  Simple enough, right?

Yes… and no.

Depending on how the network hardware works, send and receive descriptors may share the same queue.  And if you’re slamming that queue with send descriptors, you won’t be able to post any receive descriptors.  Which, if you’re simultaneously receiving network traffic (and you probably are), you need to keep re-posting new receive buffers.  Hence: you better keep a little space in that queue for re-posting receive buffers.

So it may not be a good idea to completely monopolize the hardware queue with send requests.

And what about other messages that need to get sent?  For example, the MPI may also need to send some control messages around, such as replying to large message rendezvous requests, etc.

Additionally, if the MPI is really smart, it may be able to coalesce some of these software-enqueued MPI_Isend requests.  For example, if it sees the same message going to the same receiver on the same communicator and tag, it can just tick off a counter and say “send N of this same message”, and then only actually send that message (and counter) across the network once.  The receiver will expand that message into N receives, and all is good.

(there’s a few variations possible on the above coalescing scheme, but note that that kind of situation usually only happens in benchmarks!)

The point: MPI has to carefully tread between absolute performance of a sequence of sends, or overall fairness/performance.  Allocating hardware resources fairly to ensure an overall level of high performance is a difficult balancing act.