MPI tutorial

MPI (Message Passing Interface) is a standard describing a set of subroutines that can be used to exchange data between processes (computer programs).

Synopsis

In this introduction you will learn the basics of MPI and you will be able to judge if you can use it in principle. Knowledge of C is a prerequisite for this introduction, however MPI can be used in Fortran and Fortran90 as well.

MPI is intended to be used in the following cases:

Your (serial) computer program is inefficient and the performance could be improved via parallelization.
Your computer program requires more memory that is available for one processor.

By using MPI a programmer is able to create a 'parallel' computer program: a program that uses the resources of more than one processor to accomplish the task.

This tutorial consists of a number of example programs that contain comments. You are encouraged to try these examples on a system on which MPI is available.

Probably it is best to see if there is a program that resembles more or less what you need and look at the code in some depth.

First MPI program

DOWNLOAD First example

This example demonstrates:

The initialization of the MPI environment.
Query the total number of processes.
Query the rank of 'this' process.
Calculate the sum of the first 60 integers in parallel.
Demonstrate the use of MPI_Reduce.
Demonstrate the use of MPI_Bcast.
Demonstrate the use of MPI_Allreduce.

There is one important fact that a programmer should really be acquainted with:

An MPI program consists of a number of processes, running at the same time on different processors. For the MPI programmer there is only one difference between the processes: that is the rank (a unique integer). Each process has its own rank and the ranks can range from 0 to the number of processes - 1. The processes run asynchronously and they are not required to execute the same statements or even the same code. For a meaningful MPI program, however, the processes follow the same patterns executing MPI functions.

One more thing before we get started: MPI has the notion of 'communicator'. In almost every MPI call a communicator is one of the parameters. So, what is this communicator? At this point it suffices to know that:

A communicator describes a collection of processes.
At initialization the communicator MPI_COMM_WORLD is made for you, it contains all processes that were started when your program started running.

A few remarks about the previous program:

All processes execute the same program. This is not required, but most of the times an MPI program is programmed this way: it is in general easier to maintain one program than a collection of programs.
Although the processes execute the same program, they do not execute the same statements. In this example, process #0 has a special role. Giving process #0 a special role makes sense, because there is always a process #0, even when MPI is started with only one process.
For message passing only 'collective' functions are used: MPI_Bcast, MPI_Reduce and MPI_Allreduce. Use of collective functions is recommended as opposed to functions like MPI_Send and MPI_Recv. However, some kind of programs lends itself to the use of not-collective functions, we will give an example later on.
The output of the program is probably not what you expected. In general, the lines originating from different processes will be mixed up. However, if you look at the lines of for example process #2, you will see that they are in the expected order.

Collective functions

Above we have seen a few collective MPI functions, we give here a more comprehensive list. Here A, B, C and D denote arrays of data or a single datum. P denotes a processor.

 P0   A                            A              P0
 P1                 MPI_Bcast      A              P1
 P2                 ---------->    A              P2
 P3                                A              P3

 P0   A                            A  B  C  D     P0
 P1   B             MPI_Gather                    P1
 P2   C             ---------->                   P2
 P3   D                                           P3

 P0   A  B  C  D                   A              P0
 P1                 MPI_Scatter    B              P1
 P2                 ---------->    C              P2
 P3                                D              P3

 P0   A                            A  B  C  D     P0
 P1   B             MPI_Allgather  A  B  C  D     P1
 P2   C             ---------->    A  B  C  D     P2
 P3   D                            A  B  C  D     P3

 P0   A0 A1 A2 A3                  A0 B0 C0 D0    P0
 P1   B0 B1 B2 B3   MPI_Alltoall   A1 B1 C1 D1    P1
 P2   C0 C1 C2 C3   ---------->    A2 B2 C2 D2    P2
 P3   D0 D1 D2 D3                  A3 B3 C3 D3    P3

Above table is inspired by the book 'Using MPI' from William Gropp ea.

The functions above (except MPI_Reduce) are also available in a form where one can specify that the arrays are of different lengths: MPI_Gatherv, MPI_Scatterv, MPI_Allgatherv and MPI_Alltoallv.

Example of matrix transpose

Suppose, we have a square (128x128) matrix that is divided among 4 processes, in such a way that process 0 (P0) owns columns 0-31, P1 owns columns 32-63, P2 columns 64-95 and P3 columns 96-127. We want to transpose this matrix.

How to transpose this matrix? Looking at the table above it seems likely that MPI_Alltoall could be used here.

DOWNLOAD Transpose example

Notes on the above program:

This is a simplified program, in practice one should take care of rectangular matrices. This is of concern for function trans(): an in-place transpose of a rectangular matrix is somewhat more complex.
When the number of processes does not evenly divide the size of the matrix, MPI_Alltoallv must be used. This function uses extra information from the user to determine the size of the blocks of data to send and receive.
We see an example of the use of MPI_Abort(): this function can be called when one of the processes wants to abort the whole program, without being able to notify the other processes that they should stop. A softer end of the program could have been coded: for example each process could count the number of wrong elements. After an MPI_Allreduce of this count, with MPI_MAX as reduction operator, the processes could call MPI_Finalize() if the result is larger than zero.
The number of lines to actually perform the transpose is very small, so debugging is simple. Moreover, by using MPI_Alltoall and trusting the quality of the MPI library, one can be assured that an optimal strategy is used for the MPI part of the transpose.
The partitioning of the matrix is done by dividing the second dimension. Often, people choose for partitioning the first dimension, but then one gets into trouble when using MPI_Alltoall(). This function wants the data-blocks contiguous in memory. This is the case in the example. Dividing the matrix along the first dimension gives gaps between the rows in a submatrix (size 32x32). One could overcome this by using derived datatypes, but this would complicate the code, and make it less efficient. (Using derived datatypes, one can specify exactly the layout of a block of data in memory).

Yet another collective function: MPI_Sendrecv()

This function is used, when processes want to send data to another process, and at the same time want to receive data from another process.

These two other processes can be different, so when every process receives from the left neighbour and sends to the right neighbour, a shift operation is effectively realized. For the most-left and most-right processes something special has to be done: they can be connected, in which case a ring is formed, or they can specify that they receive from nobody or send to nobody respectively.

DOWNLOAD Send-Receive example

Notes:

tag: this is a small number. Something that is send with a tag of, say, 12, will only be received by a process that is receiving with a tag of 12. Note that there are two tags: a send tag and a receive tag. Sometimes it is possible to overlap more than one sends and receives careful using tags. If one wants to receive, regardless the value of the tag, use MPI_ANY_TAG.
status: this is a status indicator, it is filled in by the receive. In the status is information about the message that has been received, or is going to be received. In this example we don't use it.
The concept of sending and receiving from and to neighbours is so commonly used, that MPI has special facilities to handle 2 or more dimensions. These facilities are typically used in calculations on regular grids. Look for MPI_Cart_create and the like: there you find how to make special communicators that automatically take care of neighbours left, right, above, beneath and so on.

Plain send and receive

When the message passing part of your program can be expressed in a natural way using collective MPI functions, than that is the way to go. Sometimes however, this is not possible. In those cases one uses plain MPI_Send() and MPI_Recv(), or varieties thereof which are beyond the scope of this tutorial. In MPI_Send() one specifies the address of the data, the amount and type of the data, a tag (see above) and the rank number of the process that has to receive the data. The receiving process specifies the address of the buffer which is to receive the data, the amount and type of the data, a tag, the rank of the process that has to send the data and a status field. If the process wants to receive from anybody, use MPI_ANY_SOURCE for the rank.

Using MPI_Send and MPI_Recv one must be very careful not to create deadlock situations or race conditions. As a rule of thumb: only do an MPI_Send if it is certain, that the process that has to receive the data is not busy with another MPI operation, except the corresponding MPI_Recv. For example: when two processes (0 and 1) want to exchange data, then the following is wrong (despite the fact the often the code will run, because MPI will buffer some data):

int other,rank;
 ...
if (rank == 0) other = 1; 
if (rank == 1) other = 0;

MPI_Send(... to other ...);
MPI_Recv(... from other ...);

Depending on, amongst others, the amount of data involved, a deadlock can arise because both processes have to finish the MPI_Send, before the MPI_Recv can be done. (Even when one uses non-blocking MPI_Isend(), there is no guarantee that the MPI_Isend will finish if there is no corresponding receive).

One should use an MPI_Sendrecv in this case, or code like this:

if (rank == 0) {
  MPI_Send( ... to 1 ...);
  MPI_Recv( ... from 1 ...);
}
if (rank == 1) {
  MPI_Recv( ... from 0 ...);
  MPI_Send( ... to 0 ... );
}

If there are no other pending MPI functions, process 0 can be sure that at some time there will be a corresponding MPI_Recv for its MPI_Send, and vice versa.

Master-slave paradigm

There is a case where MPI_Send and MPI_Recv can be used: the master-slave paradigm, also called 'farming'. In short: the slaves do the work, the master collects the results and distributes work. Next program is an artificial example of this method. The objective is to calculate the sum of the first N integers.

Each slave sends a message to the master, whereupon the master sends two numbers N and M to the slave: the slave is to compute the sum of M integers, starting at N. After receiving this answer the master sends another N and M to the slave.

The trickiest part of this program is: how to tell everybody that there is nothing to be done anymore, so everybody should do an MPI_Finalize and exit.

Finally, before exiting, each slave sends some statistics to master.

DOWNLOAD Master-Slave example

Notes

Despite the simplicity of the tasks to perform, the program is rather complex, and there is an issue how to properly end the program. This is one of the reasons that collective MPI functions are recommended: in these functions those problems are resolved for you.
This program has a nice feature: it is automatically load- balancing: fast slaves do more work than slow slaves, provided that the amount of calculations to be performed is much larger than the number of slaves.
There are a few varieties of MPI_Send and MPI_Recv. Especially the non-blocking ones are popular. Feel free to use them, but be aware: deadlocks, livelocks and other phenomena can be expected.

This ends this MPI introduction. We hope that you got a feeling for the usage of MPI now.

Space shortcuts

Page tree