Barrier Elimination: Barrier elimination for explicitly parallel constructs has been studied by Hatcher and Quinn[HQ91] and Philippsen and Heinz[Hei95], both in the context of shared memory machines. Given a set of intervals spanned by def-use, use-def and use-use dependences, Hatcher and Quinn use a simple greedy algorithm to deduce the minimum number of barriers needed to implement them. Philippsen and Heinz perform a more fine grained analysis: given a directed acyclic graph with data dependence, control dependence and evaluation ordering edges (for expressions), they perform a topological sort to determine the minimum number of barriers to implement it. In comparison to Hatcher and Quinn, their analysis additionally uses code restructuring to further minimize barriers, as well as minimize the storage required for intermediate results. In contrast to both the above, our work (parts of which have previously appeared in [Pra93,Pra]) is for distributed memory machines, hence barrier minimization is only one part of the problem. We have developed schemes to correctly handle remote data reads and writes and coordinate them with barrier synchronization. In addition, we have proposed schemes to weaken barrier synchronization both for static and dynamic communication patterns.
Fortran 90D/HPF: The Fortran 90D/HPF compiler described in [BCF94] essentially uses pattern matching techniques first suggested by Li and Chen[LC91] to generate communication for each statement of the FORALL construct. Global synchronization is required for each statement in which communication information is hard to extract. In comparison, we find the minimal number of global synchronizations required to implement a group of statements. Moreover, since these global synchronizations are not necessarily placed immediately before or after a statement, some communication computation overlap is achieved.
Dynamic Communication Patterns:
Most work in the area of
dataparallel languages and parallelizing
compilers concentrates on improving performance using compile-time
available information
to partition computation and/or coalesce and eliminate communication
and/or overlap
communication and computation.
When compile time information is unavailable (or
the available information is hard to analyze),
most methods e.g.
Fortran D [HKT92],
Id Nouveau [RP94], Fortran 90D/HPF [BCF94],
Crystal [LC91],
[HQ91], SUPERB [ZBG86].
generate inefficient code. Saltz et al
[SCM90] did pioneering work
in showing that it is possible to efficiently collect and
use run-time information. This is the basis of the inspector-executor
strategy, used for implementing irregular communication patterns.
The inspector executor method eliminates global synchronization
for repetitive dynamic communication patterns, by capturing
information about the participants of the communication
in the first repetition, and using this information in subsequent repetitions.
It does not work for non-repetitive dynamic communication patterns
however. In comparison,
our method for implementing reader unknown communication
patterns eliminates global synchronization in non-repetitive
dynamic communication patterns as well as repetitive patterns.
I-structures: One of the key concepts we have used in implementing reader unknown patterns is that of variables which hold read requests until definition. This idea was first explored by Arvind et al[ANP89] in the context of functional languages. They introduced I-structures as a mechanism for allowing parallel operations on large data structures, with the parallelism being dynamically extracted by allowing reads on any element of the data structure only after its definition. I-structures make the language nonfunctional, but allow the programmer to write correct programs without overspecifying their control structure. In this chapter, we have shown that communication implied by parallel assignment statements in data parallel programs can be treated in much the same way i.e. it can be dynamically extracted using write-once variables thus avoiding over-restrictive global synchronizations.
Distributed Simulation: The problem of servicing messages in order of their timestamp is similar to the problem faced in implementing reader unknown communication patterns. The conservative approach[Mis86a] is to guarantee that no message of an earlier timestamp can arrive before proceeding forward. The optimistic approach[Jef85] is to proceed forward after saving the current state, so that if a message with an earlier timestamp does arrive, it can be serviced after rolling back to the saved state. Our approach is similar to the optimistic approach, in that we save the state (in definition variables) and proceed without waiting. Rollback is never required, since the only type of requests that can arrive for definition variables are read requests. We also bound the memory used by state saving, by bounding the looseness of synchronization. Bacon and Strom in [Bac91] propose a method for optimistically parallelizing communicating sequential processes. Their method is a general method for parallelizing two pieces of code S1;S2, by guessing the values of the data needed from S1 by S2, and executing S2 in parallel with S1. Clearly, mechanisms for both state saving and rollback are required (and provided). In comparison, our method is optimized for the context of parallel assignment statements in dataparallel languages, since we do not use rollback.