All collective communication, including
that used by the communicator manipulation
functions, is internally implemented in most
MPI implementations as a set of point to point communication
operations. For example, an MPI_Bcast on a communicator happens
in two stages: (a) Each process dynamically configures a
tree, using the total number of processes in the communicator.
Its own position in the tree is determined by its rank, and (b) Each
process then waits for the broadcast message from its parent using a
receive statement, and forwards the message to its children,
using send statements. Assume a communicator with 5 processes.
In a simple complete binary tree
, process 0 would be the
root, processes 1 and 2 its children, processes 3 and 4
the children of process 1, and process 5 the child of process
2. The send and receive statements issued at each
process should be obvious. A tree rooted at some process other than 0 is
formed by creating the basic tree described above and swapping
the root with desired root. The same tree construction algorithm
is used for all collective communication calls, i.e. for
barriers and reduces as well.
Consequently, all that is used to implement a collective communication call is a receive statement with exactly the same functionality as MPI_Recv, and a send statement with the same functionality as any of the MPI send statements. In order to prevent point-to-point receives from receiving messages sent in a collective communication operation, all that is used is a special tag for messages that are part of a collective communication operation.