next up previous
Next: Summary Up: Simulation of Data Previous: Simulator Predictions

Related Work

  A number of sequential simulation engines have been designed to evaluate the performance of parallel machines and programs. Existing simulation engines include Proteus[BDC91], Tango[DGH91] and the RPPT simulation engine[rpp91] among others. In Proteus, the application to be simulated is written in a superset of C, and constructs are provided to control the placement of data. Provided library routines are used for message passing, thread management, memory management and data collection. The target machine is specified in terms of interconnection medium (bus, direct or indirect network), sizes of shared and private memory at each node, and other features e.g. interprocessor interrupts and handlers. An application- and machine-specific simulator is created by the simulation engine, which upon execution produces a trace file that can be interpreted by Proteus tools. The simulator uses a custom lightweight threads package. It uses direct execution for most instructions of the application program, but must simulate message passing and shared memory access instructions. As these instructions are the costliest to simulate, the simulator provides low and high accuracy network and shared memory modules to allow the user to tradeoff speed and accuracy in model execution.

Tango has primarily been used to simulate the execution of programs written in typical shared-memory programming notations on shared memory computers. The target application is written in C or FORTRAN, using macros to emulate a variety of programming paradigms, such as locks, barriers, distributed loops and messages. These macros eventually expand into routines which implement equivalent primitives of the target machine. The code is augmented for direct execution. The simulator executes as set of Unix processes. Target machine primitives are implemented by interactions between the processes (using shared memory and semaphores), and interactions between the processes and the memory simulator, which can have varying degrees of accuracy.

In RPPT, the application is written in Concurrent C[Mad87]. The target machine is specified in terms of processor modules and global memory modules, a process mapping which describes the logical node to physical node mapping, and a routine UserSend (a CSIM[Sch85] routine written by the user) which essentially describes the interconnection medium. A preprocessor is used to translate the application into a simulation program by inserting calls to UserSend and possibly a global memory simulator at the required points. A profiler is used to augment the code for direct execution.

Parallel simulation engines include the Wisconsin Wind Tunnel[RHL93] and the Large Application Parallel Simulation Environment[Dic94] (LAPSE). WWT is a simulator of cache-coherent, shared memory computers that runs on the Thinking Machines CM-5. It provides fast parallel simulation by direct execution of local code, one host processor per target processor. Shared memory of the target machine is simulated by trapping on accesses to invalid blocks gif This automatically causes a trap upon access, using a software protocol to obtain the required block, and charging the simulation only with the time the same operation would have taken on the target machine. A conservative distributed simulation protocol is executed to synchronize processors. The processors execute a barrier after every Q simulation cycles, in order to ensure that all messages sent in the current quantum are received before the next quantum starts. Q<T where T is the minimum message latency of the target machine. In a more recent paper[CH96], the implementation of a fully optimistic and a partially optimistic simulation protocol is described, but both have proven to be worse than the conservative protocol.

LAPSE was designed to simulate the performance of asynchronous parallel programs. It is a direct execution parallel simulation environment which uses a conservative synchronization mechanism for synchronization on parallel machines. LAPSE has been implemented on Intel Paragon, and can simulate message passing application programs which use the Intel message passing library calls.



next up previous
Next: Summary Up: Simulation of Data Previous: Simulator Predictions



Andy Kahn
Wed Jun 25 20:28:02 PDT 1997