next up previous
Next: Summary of Contributions Up: Introduction Previous: Motivation

Overview

This dissertation consists of two distinct parts. In the first part, we study the compilation of data parallel programs for distributed memory machines. Parallel assignment statements like the FORALL statement of HPF are among the most common statements in a data parallel program. We suggest a set of techniques for efficient implementation of these statements on distributed memory multicomputers. The two main problems in translating a data parallel and hence synchronous program with a single address space to a task parallel program with asynchronous processes and separate address spaces are (a) Data distribution and (b) Communication and synchronization insertion. We first describe how existing dependence analysis tools may be utilized to identify appropriate points in the translated code for the insertion of remote read and write accesses and barrier synchronizations. We then present a set of operator strength reduction optimizations which replace barrier synchronizations with point-to-point communications or clustered synchronizations. These optimizations eliminate barrier synchronizations and reduce blocking time only for statements that result in communication patterns which are known at compile time. Finally, for statements that specify communication patterns that are known only at run-time, we present an adaptive synchronization method that uses write-once variables to eliminate barrier synchronizations and hence reduce blocking time. In contrast to the well known inspector-executor strategy for program compilation, we show that this method can eliminate global synchronizations even for non-repetitive communication patterns. The adaptive synchronization method has been implemented on the IBM SP2. Experimental results from the implementation are presented to demonstrate its impact on the performance of a set of synthetic benchmarks. Additional results are presented from a set of simulation experiments that estimate the impact of the proposed technique on architectures with very low communication latencies.

The second and major part of the dissertation describes the design of a simulation engine for data parallel programs, and a simulation engine for message passing programs. In general, due to the presence of a large number of parallel architectures, and an even greater number of parallel languages, there is a need for tools which efficiently predict the performance of parallel programs on a variety of architectures, many of which might be unavailable to a user. Many simulation engines e.g. Proteus, Tango, RPPT, WWT and LAPSE can be used to predict parallel program performance. However, the sequential ones are too slow (for Proteus, slowdowns of 35-100 (per processor being simulated) are typical) and the parallel ones are not portable for various reasons. We show that parallel simulation can be used to efficiently predict their performance. For data parallel programs, we have developed and implemented a simulation engine called DPSIM, which exploits the deterministic nature of data parallel programs to optimize the underlying conservative simulation protocol. We present results for the validation and performance of DPSIM on the IBM-SP2 for programs including Gaussian Elimination, FFT and Matrix multiplication. When run in parallel, simulators for these programs show significant speedups as compared to when they are run sequentially. Results are also presented for the use of DPSIM to evaluate the impact of architectural characteristics like processor speed and message communication latency on the performance of the above mentioned scientific applications.

For general message passing programs, we have developed and implemented a simulation library called MPISIM. MPISIM can be used to predict the performance of existing MPI programs, with little or automatic modification, against parameters like the number of processors used and the message latency. The simulator produced using MPISIM can be executed sequentially (using multithreading) or in parallel. The simulation protocol underlying MPISIM is a combination of the null message protocol and the conditional event protocol. Several application characteristics are exploited to improve the speed of the simulation: The null message protocol provides fast progress when good lookahead is available. In the absence of program analysis, lookaheads are extracted dynamically. Using program analysis, lookaheads may be extracted statically as well. When good lookaheads are not available, the conditional event protocol automatically provides fast progress. For deterministic sections of application code (dynamically detected at runtime), MPISIM automatically bypasses the simulation protocol and hence adds virtually no slowdown to the simulation. We present results on the validation and performance of MPISIM for a subset of the NAS Parallel Benchmarks (NPB 2).



next up previous
Next: Summary of Contributions Up: Introduction Previous: Motivation



Andy Kahn
Wed Jun 25 20:28:02 PDT 1997