Simulators for parallel programs can be effectively utilized to test, debug, and predict performance of parallel programs on a diverse set of parallel machines. A variety of simulation engines have been designed[BDC91,DGH91,RHL93,rpp91,Dic94] to estimate the performance of a parallel program. Most simulation engines were designed to estimate the performance of asynchronous or task parallel programs. With few exceptions, these simulation engines fall broadly into two classes: the resulting simulator itself is sequential and can be ported to almost any sequential workstation; or the simulator is parallel but designed to execute on a specific parallel machine. The former category includes simulations engines like Proteus and Tango, which can be used to simulate the execution of a program on a very detailed model of the target parallel machine. The primary drawback with this approach is that accurate simulations of target programs that will operate on large datasets and execute on a large number of processors, are typically slow.
Many techniques have been used to improve the execution time of program simulators, including the use of direct execution where local code is executed rather than simulated, and the use of high-level models where components of the parallel machine are modeled at an abstract level. However, in spite of these innovations, it is not unusual to have a factor of anywhere from 10-1000 slowdown in the execution of the simulator when compared with the execution time for the actual program. This slowdown is perhaps more significant for sparse and irregular computations, where it is harder for a programmer to extrapolate the results obtained from the execution of smaller datasets. Parallel implementations of the simulator offer significant potential for reducing the execution time for the simulator.
A data parallel program may be executed on both synchronous and
asynchronous parallel machines. In the latter case, the program
is compiled into a SPMD program, with (typically) one
process assigned to each processor of the target machine.
(In general, it is possible for the compiler to create multiple threads
for each processor; we assume one thread/processor for simplicity in
exposition). Sequential simulators interleave the simulation of each process
in the target program on a single processor
leading to significant slowdown in its execution.
Parallel simulators can use multiple
available processors, such that each processor in the simulation
interleaves a smaller number of processes from the target
program. Thus given a target program with N processes, a
sequential simulator must interleave N processes on one processor,
whereas a parallel simulator with S, S<N, available
processors can interleave approximately
processes on
each processor. However, the parallel simulator must ensure that
the simulation processes executing on different processors
are synchronized such that all events in the model
are executed in their correct global order. This synchronization can add
to the overhead of parallel execution of the simulator, limiting its
potential benefit. This overhead can, in turn, be reduced by using
a number of optimizations that exploit semantic knowledge of
specific data parallel constructs.
In this chapter, we describe the design and implementation of a program simulation engine called DPSIM. The simulation engine is designed for execution on both sequential workstations and parallel distributed memory machines. The primary contributions of this chapter are:
The rest of the chapter is organized as follows: the next section gives a brief description of the primary constructs found in most data parallel languages. Section 3.4 introduces the simulation methodology. Section 3.5 contains the detailed design of the simulation engine and experimental results that validate it and describe the speedups that can be achieved from parallel execution of the resulting simulators compared to a sequential implementation. Section 3.6 discusses the relationship between this and existing program simulation engines and section 3.7 contains a brief summary.