Avatar

I’ve mentioned it before: the run-time systems of MPI implementations are frequently unsung heroes.

A lot of blood, sweat, tears, and innovation goes into parallel run time systems, particularly those that can scale to very large systems.  But they’re  not discussed often, mainly because they’re not as sexy and ultra-low latency numbers, or other popular MPI benchmarks.

Here’s one cool thing that we added to the runtime in Open MPI a few years ago, and have continued to improve on over the years (including pretty pictures!).

In the 1990’s when clusters of Linux servers were a new concept, the only way to launch MPI processes on remote servers was via ssh (rsh was used for a while, but it eventually mostly died out).

While job schedulers and cluster resource managers tend to offer fast MPI/parallel job startup these days, there are a surprising number of users who still use ssh-backed job startup mechanisms.  There are a number of valid and good reasons for this, but we’ll explore that another time.

Let’s take a step back and look at what a job launcher does.

Conceptually, parallel job launchers are simple: loop over starting each target process on their target machine.  Keeping with the ssh theme here, the figure below shows this model using individual ssh connections:

Launch by using ssh to start each MPI process.

An obvious optimization — one that Open MPI has done since its inception — is to only connect to each target machine only once, and then launch the desired target processes from that initial single connection:Launch by ssh'ing once to each server, and launching all MPI processes from there.

(NOTE: the above figure is a bit simplified: mpirun actually launches a proxy daemon on each node; the daemon then forks each of the target MPI processes).

As your parallel application grows in terms of number of servers, such a serial launch mechanism becomes an obvious bottleneck.

It therefore makes sense to parallelize the launcher: use a tree-based launch structure.  Have the job initiator (shown as “mpirun” in each figure) be the root of a tree.  Each server that is launched upon can also launch on further servers.  The inherent parallelization speeds up the overall launch from O(N) to O(log N):

Launch in parallel by allowing tree-based propagation of simultaneous ssh sessions to launch MPI processes.

Schweet!

Open MPI debuted a tree-based ssh launcher back in the v1.3 series (circa 2009).  The first generation tree-based launcher used a binomial tree.  This shape effectively amortized the high costs for creating (expensive) ssh connections.

Note that the tree-based ssh structure necessitates setting up password-less/passphrase-less ssh logins between each pair of servers in the HPC cluster.  If you use the same ssh keys on every server, this is trivial to setup.  If you use different ssh keys on each server, it’s a little more work.

That being said, Open MPI allows users to disable the tree-based launch and use the linear ssh launcher, if desired.

This blog entry is getting a bit long, so stay tuned: I’ll describe a few more fun things about the Open MPI ssh tree-based launching system in the next entry…