It’s the eternal question: should I send lots and lots of small messages, or should I glump multiple small messages into a single, bigger message?
Unfortunately, the answer is: it depends. There’s a lot of factors in play.
The purist answer is, “it doesn’t matter.” MPI does not distinguish between “small” and “large” messages. Indeed, many may be surprised to learn that even though you may hear a lot about MPI’s “eager limit,” the MPI specification defines no such term. The “eager limit” is an implementation technique artifact that is common to many MPI implementations. But it’s not part of the standard definitions from the MPI specification.
From an MPI standard perspective, sending one message is pretty much the same as sending another.
But in reality, it can be quite different. Here are a few factors that come into play:
- MPI implementations need to send your data across a network (which may even be shared memory) between peer MPI processes. A certain amount of meta data must be sent with your MPI message, to include information such as the communicator on which the message was sent, the message tag, the sending process’ ID, and possibly network transport information such as a MAC address, IP address, or OpenFabrics LID. For efficient communication, you may need to minimize the ratio of meta data to message data in a single network frame.
- Some networks will pad frames to send a minimum size. For example, if the MPI meta data plus your message data is 32 bytes, the operating system, NIC driver, NIC hardware, and/or network switch may pad the message to be a minimum of 64 bytes — just so that hardware ASICs can perform more efficiently by guaranteeing a minimum size for all messages. This may hurt individual message latency, but be more efficient for hardware transfers at, for example, larger message sizes.
- The network fabric itself has a raw data rate which serves as an upper bound on how many small messages can be sent per quantum. For example, QDR IB’s 32 Gb/s data rate means that it can send, at most, 16M 64-byte frames per second (I arbitrarily picked a 64 byte frame size, which includes the OpenFabrics header size, MPI meta data, and MPI message size — your frame size may vary, depending on these factors).
- The small message injection rate is typically the maximum rate at which your NIC can inject small messages on to the network. The higher the quality of NIC (and switch), the closer this rate approaches the maximum theoretical value cited in #3. But remember: in “short” messages, the ratio of meta data to message data is typically high, so even if the injection rate is high, overall network efficiency may end up being low.
- Today’s CPU and RAM technology are getting more and more impressive. You need to compare the speed of memory copies vs. the small message injection rate of your network. For example, there is likely a message size under which it is more efficient to copy N MPI message data buffers into a single, larger buffer and send just 1 MPI message (vs. sending N MPI messages). Memory copies are highly optimized these days, but can be affected my all kinds of factors such as process and memory locality — you’ll need to ensure that your memory copy source and destinations are local, the copying process is pinned, when/if the MPI is aggregating messages, etc.
These are some factors off the top of my head. And, honestly, I could probably spend a whole blog post on each of them — I’ve omitted many details in each of the above bullets.
Re-reading the above text, I see that my text seems to imply that you should find the message size where it is more efficient to pack data into a single buffer than sending individual messages. I don’t mean to imply that. Many hardware platforms and MPI implementations optimize for small messages. But not necessarily to the same degree. Although this plays with the central tenant of portability of MPI applications, you will likely need to need to play with your MPI implementation and hardware to find the balance that achieves the best performance with your application.
For example, the following trivial code should quite likely be optimized to send a single message:
MPI_Send(&i, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&j, 1, MPI_INTEGER, peer, tag, comm);
But real applications are rarely this simple, and the optimizations may be more difficult to spot.
Bottom line: dig in to your code and conduct a few trivial experiments; see if you can combine messages to get higher MPI throughput.
Nice post, this is an interesting topic.
Since you are an expert, do you know whether MPI libraries are able to automatically coalesce messages? Some time ago I came across a parameter of the OpenMPI’s MCA framework called “btl_openib_use_message_coalescing” which should do that. However I have troubles figuring out how such thing is concretely implemented. To me there should be a kind of timer that waits “enough time” for several messages to be in the send buffer and then dispatches them all together to the receiver… but this would have a very bad impact on latency… and I don’t think this would even be allowed in MPI.
So the question remains!
Open MPI coalesceses messages only when it is stalled from sending.
For example, when the hardware send queue is full, but the application is still sending more new (short) messages. If successive new messages in this case are of the same MPI signature (e.g., to the same receiver on the same CID with the same tag), then Open MPI will coalesce the messages together.
Hence, when the send queue eventually opens up, there will be fewer entries placed on the queue than MPI messages that are actually sent.
Make sense?
This was a particularly ugly feature to implement and debug. But I’d be surprised if other OpenFabrics-based MPI implementations don’t do similar things (i.e., coalesce only when otherwise blocked from sending).
Yes it makes sense now!
For example let’s say you have this sequence of 4 messages:
MPI_Send(&i, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&j, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&i, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&j, 1, MPI_INTEGER, peer, tag, comm);
The way you explained the implementation makes me thing that most likely the first message is dispatched immediately (and assuming that the send queue only accept 1 message) and the next 3 will be coalesced together and sent out.
Therefore this runtime solution is not solving *completely* the problem, if the developer would have coalesced the sends in the source code probably the performance will increase (also considering all the above considerations you discussed in the post).
thanks again.
Correct — in that case, I’d say that the app developer probably should have coalesced the 4 messages into 1.
But just to clarify your point: hardware send queues are typically fairly deep, capable of holding thousands of pending sends. So the coalescing code in Open MPI probably won’t be triggered by just 4 sends. It’ll typically be triggered by ping-pong benchmarks (yet another reason benchmarks are not good indicators of real performance!) and other sending-many-thousands-of-sends-at-a-time types of codes.