Friday, January 16, 1998
Initial Results for the Small Problem Size
MPI-SIM, a modular, parallel simulation facility for MPI programs has been developed at UCLA. This simulator can be used to evaluate the performance of existing MPI programs as a function of number of processors, or processor and interconnection network characteristics, or message library implementations. The simulator can also be used to evaluate the performance of parallel input/output systems. Supported capabilities include a number of different caching algorithms: collective I/O techniques, cache replacement algorithms, and I/O device models.
A translator has also been provided to automatically link existing MPI programs with the simulation library and execute the simulator on both sequential and parallel architectures. The parallel implementations allow us to considerably reduce the execution time of the simulator as compared to purely sequential executions.
The utility of MPI-SIM for performance prediction has been demonstrated with two types of parallel programs: I/O intensive and communication intensive. Recent experiments and results of the simulation of I/O intensive programs are highlighted in a recent publication titled, "Parallel Simulation of Parallel File Systems and I/O Programs" presented at Supercomputing ?97 (a reprint is available over the web at http://pcl.cs.ucla.edu/papers/. The study included results from two of the NAS-BTIO benchmarks as well as results from an out-of-core matrix multiplication program and a synthetic I/O workload generator. Running these programs within the simulation environment allows for a much wider range of studies than would be otherwise possible. To date, we have been able to show the scalability of applications as a system?s computational and I/O parallelisms are increased. We have examined the effectiveness of intelligent collective I/O routines for reducing the number of I/O operations sent to disk. For the NAS-BTIO benchmarks, well over 20000 write requests were eliminated by issuing larger, collective requests. We have seen the benefits of using cooperative caching techniques to coordinate the management of a parallel system?s many separate caches. The matrix multiplication program was able to operate more efficiently with larger matrices when cache coordination allowed larger data sets to be kept in memory. Running the simulation model in parallel has shown promising reductions in the execution time of the experiments, up to 6 times improvement on 16 processors.
SWEEP3D
We have used MPI-SIM to simulate the kernel of an ASCI-relevant application developed at Los Alamos National Laboratory: the SWEEP3D benchmark.
SWEEP3D solves a 1-group time-independent discrete ordinates 3D Cartesian geometry neutron transport problem. Parallelism is achieved by domain decomposition and message-passing. MPI-SIM was used to evaluate the scalability of the kernel as a function of number of processors, message latencies and blocking factor. The target architecture was assumed to be the IBM SP2 multicomputer.
Problem Scalability
A few representative results are summarized: first, for a ?small? problem size (503), the number of target processors were increased from 6 to 240, and the performance was best for up to 96, where the execution time was found to be about 7 times faster as compared with 6 processors. Beyond that increasing the number of processors performance gains were small. Figure 1 shows the predicted runtime for up to 240 processors. The size of the problem is 503, and used a blocking size of 10 for the k dimension and a blocking size of 3 for the blocking of angles.
Figure 1 Problem size 503, mk=10, mmi=3
Impact on message latency on performance
We are currently investigating the scalability of the
problem for larger problem sizes and hope to identify the maximum number
of processors that can be used effectively for this problem. The study
also looked at the performance of the application as a function of message
latencies: four different latencies were used: 0, s, 4s, and 10s, where
s is the current latency on the SP2. Surprisingly, the actual latency appeared
to have almost no impact on the execution time of the SWEEP3D kernel for
configurations with up to 240 processors (see Figure 2). Figure 2 shows
the behavior of the program only for a 2x3 processor grid, but similar
results have been obtained for other processor configurations. The runtime
independent of small latencies is an encouraging result as it points towards
the viability of platforms like the computational plant for this application.
Figure 2 Problem size 503, mk=10, mmi=3, latencies
vary from 0 to 10 times the latency of an SP
Only when the latency is increased to 100 times that of an SP, the differences in performance become apparent. Figure 3 shows the runtimes for the small problem size, with mk=10, mmi=3. The latencies are varied from 100 to a 1000 times that of an SP.
Figure 3 Runtime for latencies up to 1000 times
that of an SP
Surprisingly, we do not notice the classic tradeoff between sending many small messages versus sending a fewer number of larger messages (small blocking size versus large blocking size) as the latency increases.
Impact of blocking size on performance
The simulator was also used to investigate the effect of data decomposition on the execution time of the application. SWEEP3D uses block distribution of data---one of the spatial dimensions is block-distributed among processors. MPI-SIM was used to determine the optimal size of data blocks for the small problem size and a given number of processors. On a system with IBM SP message passing characteristics, where the message latency is low, the best blocking size was one for the 6 to 240 processor range. The smaller blocking size implies that a large amount of small messages yield better overall performance as opposed to the larger blocking size which results in larger, albeit fewer messages. This result is counter-intuitive as the software overhead from a large number of smaller messages was expected to have worse performance.
Figure 4 Runtime with varying blocking sizes for k with SP latency
Figure 5 Runtime for various blocking sizes for angles
with SP latency, mk =1
We have studied the effect of the blocking size for the angles. Again, the smaller blocking size performed best, ( see Figures 5 and 6). Figure 6 also demonstrates that while the difference in runtime is not significant when the correct blocking size for k is chosen (size of 1), that difference becomes greater when the size of blocks for k is poorly chosen.
Figure 6 Runtime for various blocking sizes with SP latency,
mk =50
Performance of MPI-SIM
We have studied the performance of the simulator itself. MPI-SIM performed well when simulating SWEEP3D.
Figure 7: Performance of MPI-SIM for various target processor
configurations
Figure 7 shows the performance of the simulator for up
to 16 processors and with various target processor configurations.
Figure 8 shows the corresponding speedups. The simulator
reaches a speedup of 9.21 using 16 processors
Figure 8 Speedup of MPI-SIM simulating SWEEP3D
We have shown the importance of using real applications to study system performance. Currently, we can simulate programs written in Fortran77 and C, which use message-passing and MPIO for disk access. Since software development, especially for parallel platforms, is expensive and complex, it is useful to be able to estimate the performance of a program, before the actual software has been fully developed. The simulation could expose performance bottlenecks and guide the programmer in making parallelization decisions during the software development process. Obviously, there needs to be a means of conceptually representing the future software. For a serial program, a control flow graph is sufficient. However, for parallel programs, a more sophisticated structure is needed---a structure that can expose the parallelism of the program. This can be accomplished with the use of a task graph.
A task is a block of computation that is executed by a single process. The computation within a task needs only the data, which are available to the task at its beginning. The task graph describes the computational entities in the program and exposes the dependencies between the data needed by individual tasks. The scheduling function for the graph assigns individual tasks to processes.
The runtime of a single task can be estimated using runs performed on a single processor. The I/O performed by the tasks can be modeled by the MPI-SIM simulator. When two tasks with data dependencies belong to the same process, no communication is needed. Otherwise, data has to be communicated between the processes. The task graph incorporates explicit edges for communications. These in turn can be modeled by MPI-SIM. A program that reads the task graph with the scheduling information and incorporates it with the MPI-SIM simulator will be developed.
The task graph in conjunction with MPI-SIM can also be used to evaluate the performance of experimental hardware architectures. This will enable the design and evaluation of software in parallel with the new hardware.