This question is inspired by the fact that the “count” parameter to MPI_SEND and MPI_RECV (and friends) is an “int” in C, which is typically a signed 4-byte integer, meaning that its largest positive value is 231, or about 2 billion.
However, this is the wrong question.
The right question is: can MPI send and receive messages with more than 2 billion elements?
The answer is: yes.
I’ve written about this question before (see here and here), but I’m writing again because people keep asking me. I’m putting a hopefully-google-able title on this post so that it can be found.
You may have heard that MPI-3.0 defined a new “MPI_Count” type that is larger than an “int”, and is therefore not susceptible to the “2 billion limitation.”
That’s true.
But MPI-3.0 didn’t change the type of the “count” parameter in MPI_SEND (and friends) from “int” to “MPI_Count”. That would have been a total backwards compatibility nightmare.
First, let me describe the two common workarounds to get around the “2 billion limitation”:
1. If an application wants to send 8 billion elements to a peer, just send multiple messages. The overhead difference between a single 8GB message and sending four 2GB messages is so vanishingly small that it effectively doesn’t matter.
2. Use MPI_TYPE_CONTIGUOUS to create a datatype comprised of multiple elements, and then send multiple of those — creating a multiplicative effect. For example:
MPI_Type_contiguous(2147483648, MPI_INT, &two_billion_ints_type);
/* Send 4 billion integers */
MPI_Send(buf, 2, two_billion_ints_type, …);
These workarounds aren’t universally applicable, however.
The most-often cited case where these workarounds don’t apply is that MPI-IO-based libraries are interested in knowing exactly how many bytes are written to disk — not types (e.g., check the MPI_Status output from an MPI_FILE_IWRITE call).
For example: a disk may write 17 bytes. That would be 4.25 MPI_INT’s (assuming sizeof(int) == 4).
Ouch.
See these two other blog posts for more details, to include how the MPI_Count type introduced in MPI-3.0 is used to solve these kinds of issues:
Having tried to implement essentially all the communication functions with large-count support, I must say that this post makes it sound much easier than it actually is. Send-Recv are easy but many of the collectives are not. Or try support supporting counts >2^31 without resorting to non-portable functions like MPI_TYPE_CREATE_STRUCT. If they count is prime, it’s impossible.
And the multiple message approach is not suitable for nonblocking operations and some collectives if you want to have a drop-in replacement. Using MPI_IN_PLACE in large-count reductions appears to be very difficult.
Please see https://github.com/jeffhammond/BigMPI for my code. Be warned that it is a work-in-progress and untested functions may not work.
Note also that BigMPI has motivated multiple tickets for the MPI Forum. Links to these tickets can be found at the bottom of the BigMPI home page.
Great! It’ll be interesting to see where your 2 proposed tickets proceed.