Jeff Hammond has recently started developing the BigMPI library.
BigMPI is intended to handle all the drudgery of sending and receiving large messages in MPI.
In Jeff’s own words:
[BigMPI is an] Interface to MPI for large messages, i.e. those where the count argument exceeds
INT_MAX
but is still less thanSIZE_MAX
. BigMPI is designed for the common case where one has a 64b address space and is unable to do MPI communication on more than 231 elements despite having sufficient memory to allocate such buffers.
“What’s the big deal?” you might ask yourself. “Even though the count arguments in MPI functions are int
s, the MPI Forum has been telling us for years that you can just use MPI datatypes to send and receive messages more than 231 bytes.”
It turns out that that only sorta works.
For example, one solution to send huge messages is chop them up and send them as multiple messages.
…but what about non-blocking operations? If you’ve now done multiple sends/receives, you now have multiple requests to handle (vs. just one request from a single send/receive).
This is not a deal-breaker, but it is somewhat annoying.
Another solution is to use MPI datatypes (as mentioned above). Instead of sending six billion integers, make a datatype of one billion integers and send six of those.
That generally works well, because the overhead of creating/destroying an additional MPI datatype is dwarfed by the time to actually send (6B*sizeof(int))
bytes across a network (even on ludicrous speed networks).
…but what about reductions?
MPI’s built-in reductions, while usually nicely optimized in MPI implementations, are only defined for a subset of intrinsic datatypes. They are specifically not defined for user-defined MPI datatypes — not even contiguous ones.
This means that all reductions must use a custom, user-defined reduction operator that must be paired with the MPI datatype used for the large send.
Ouch. That’s a major bummer.
BigMPI handles huge-count reductions for you. You won’t receive all the optimizations that the internal MPI implementation has (because BigMPI can’t use the MPI-implementation-provided reduction operators and has to provide its own), but MPI implementations may not be well-tuned for gigantic reduction operations, anyway.
There is a light at the end of this tunnel, however.
Because of this work, Jeff has proposed some new features to the MPI Forum to ease the pain of using big data in MPI applications. These likely won’t happen until MPI-4.0, but it means that there is hope ahead.
“(6*sizeof(int))” is clearly a typo, because sending 24 bytes takes about 2 microseconds on a decent HPC interconnect.
Note the BigMPI can exploit optimized reductions for blocking collectives, it’s just not the default. You just need to configure with –enable-reduce-pipelining. I will eventually generalize this sort of thing to be a runtime option.
Good to know! I didn’t see that in the README.md. 🙂