We have validated MPISIM, and measured its performance using a subset of the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks (NPB 2)[NAS95] on an IBM-SP2. Each node of the IBM-SP2 we used is a POWER2 node with 128Kb of cache and 256Mb of main memory. Nodes are connected using a high performance switch which offers a point-to-point bandwidth of 40Mb/s, and has a hardware latency of 500ns. The NPB 2 benchmarks are a set of programs designed at the NASA NAS program to help evaluate the performance of parallel supercomputers. These benchmarks are derived from computational fluid dynamics (CFD) applications. The NPB 2 benchmarks are written in Fortran 77 and embedded with MPI calls for communication. To the best of our knowledge, the NPB 2 benchmark suite is the only publicly available benchmark suite containing real applications that use MPI. Since target programs need to be privatized (see Section 5.4.1.1) before being linked with MPISIM, and the code for privatizing has been written for C programs, the benchmarks needed to be converted to C source. We were able to convert four out of the five benchmarks to C using f2c[f2c90], a Fortran-to-C converter. The converted ones are:
There are four classes of each benchmark: S,A,B and C. Class S is the smallest problem size, and classes A, B and C are progressively larger. Since privatization multiplies the memory usage of the permanent variables of the program by the maximum number of threads per process, and all the benchmark codes used a lot of permanent variables, we were able to privatize the code to contain upto 16 threads per process only for the class S benchmarks. For the other classes, privatization ballooned the static memory usage (i.e. program text size plus permanent variables size) so much that the simulations thrashed and performed poorly. For the LU benchmark however, we were forced to use class A since the class S benchmark is too small to execute on anything more than 4 processors. Consequently, we were able to privatize it to contain only upto 4 threads per process.