MPI-SIM Output Guide

by Stephen Docy

Section I. Introduction

In addition to any output generated by the target program, the simulator produces supplemental output showing a number of performance and behavioural metrics for both the target program and simulator itself. This output covers a wide variety of areas that can be used to analyze and compare results from simulation experiments. The utility of the various statistics being supplied is of course very dependent on the experiments being run and the type of information that is of immediate interest. The information currently being presented is by no means an exhaustive set. As new applications have been added and new experiments designed, the output set has grown to include new areas of interest.

Output from the simulation can generally be placed into one of the following categories:

  1. Application metrics: data related to the prdicted performance of the target program, such as execution time, number and size of messages sent or number and type of I/O operations performed.

  2. Simulation metrics: data specifying performance characteristics of the simulator code, such as actual simulator execution time, number and type of simulation protocol messages sent, or number of mesages sent to simulator processes on other processors.

  3. System metrics: data detailing the simulated communication and file system configuration and hehaviour, such as communication latency factors, number of compute nodes (cnodes) and i/o nodes (inodes), or number and type of disk operations. Many of these metrics are presented only when the detailed file system model is used (-DUSE_FS_SIM is specified at compile time).
In addition to simple performance metrics and statistics, the simulator is also an excellent tool for creating trace records for a variety of events. Since all communication and I/O is simulated, message events and I/O events are prime candidates for event tracing. Finally, since the tracing overhead occurs outside of the measured execution time of the target program, detailed tracing can be performed without perturbing the target program's predicted behaviour, though the simulator's execution time may of course be affected.

Section II. Application Metrics

Target application statistics measured by the simulator are printed to stdout at the end of the program's execution. All data is presented in a table with the label Begin Program Statistics. Each column of the table contains a statistic being measured, while the rows represent each thread of the application (not a simulator thread).

Note that each row of the table is contained within one ASCII line that is over 120 characters long; screen wrapping of these lines may make the data hard to read.

Figure 1 shows sample output of the simulator showing the target application statistics from an 80-column-wide screen. The output shows the data for five target application threads.


**********Begin Program Statistics*********************    
ENT#      EXEC TIME      BLOCK TIME    RD    WR    SK      BYTES_R      BYTES_W
SSEND    SSENDTIME BSEND    BSENDTIME  MSGTHERE  MSGWAIT         
  0       198273028       124017835     0     0     0            0            0
  397       296675     0            0   121   256   
  1       198271408       124418440     0     0     0            0            0
  397       296675     0            0   121   256   
  2       198271408       123778955     0     0     0            0            0
  397       296675     0            0   121   256   
  3       198271461       121783181     0     0     0            0            0
  397       296675     0            0   121   256   
  4       198271408       124273504     0     0     0            0            0
  397       296675     0            0   121   256

Figure 1. Target application statistics produced by MPI-SIM


For each target thread, the following information is provided:

Item Label Meaning
1. ENT# The 0-based identification number of the entity (thread).
2. EXEC TIME Predicted execution time of the target thread (in microseconds).
3. BLOCK TIME Amount of time target thread was blocked.
4. RD, WR, SK Number of read, write, and seek operations performed by the target thread.
5. BYTES_R, BYTES_W Number of bytes red and written by the target thread.
6. SSEND, SSENDTIME Number of synchronous sends and accumulated latencies.
7. BSEND, BSENDTIME Number of buffered sends and accumulated latencies.
8. MSGTHERE Number of times a message-receive had to wait for an appropriate message to arrive.
9. MSGWAIT Number of times there was a message waiting in the queue.

The application statistics table is followed by the total number of I/O operations and bytes read/written for all threads. Figure 2 shows data from a program that peforms no I/O at all.


Total read requests : 0
Total write requests: 0
Total seek requests : 0
Total bytes read    : 0
Total bytes written : 0
Figure 2. Target application I/O stats produced by MPI-SIM.


Section III. Simulator Metrics

Simulator statistics are provided in the simulator's output in the table labelled "Begin Simulator Statistics". In figure 3, sample output is shown for 2 threads.


**********Begin Simulator Statistics*******************
PROC#.      EXEC TIME.     BLOCK TIME. PROT MSGS.  LOC MSGS.  OFF MSGS.   NULLTO
T.   ACKNULL.    ACKNULLTIME.   MSGNULL.    MSGNULLTIME.   CONDTOT.   ACKCOND.  
  ACKCONDTIME.   MSGCOND.    MSGCONDTIME.   ACKBOTH.    ACKBOTHTIME.   MSGBOTH. 
   MSGBOTHTIME.NUMLOCBLKS.    TIMELOCBLKS.   DETRCVS.NONDETRCVS.  SWTCHOPT.  SWT
CHEIT.REQLSTSIZE.RCVINRLIST.MSGLSTSIZE.   MSGLISTTRAVS.       MSGSSEEN.MAXMSGSIZ
E.
    0       299145137       224576803          0          0        589          
0          0               0          0               0          0          0   
            0          0               0          0               0          0  
             0        428        74255193        377          0        569      
    0          0          0          0               0               0      6000
0
    1       299230106       224994271          0          0        613          
0          0               0          0               0          0          0   
            0          0               0          0               0          0  
             0        620        73852968        329          0        617      
    0
Figure 3. Simulator statistics


For each simulation thread (there is one thread per processor used to run the simulation), the following information is provided:

Item Label Meaning
1. PROC# The 0-based identification number of the entity (thread).
2. EXEC TIME Execution time of this thread in microseconds.
3. TIME Amount of time this thread was blocked.
4. PROT MSGS Number of protocol simulation protocol messages sent.
5. LOC, OFF MSGS Number of local and off-procesor messages sent.
6. CONDTOT, ACKCOND, ACKCONDTIME, MSGCOND, MSGCONDTIME Conditional-event-message protocol statistics.
7. ACKBOTH, ACKBOTHTIME, MSGBOTH, MSGBOTHTIME Accelerated-message protocol statistics (??).
8. NUMLOCBLKS, TIMELOCBLKS Number of local execution code blocks and the total time spent executing local code blocks.
9. DETRCVS, NONDETRCVS Number of determinstic and non-determistic receives.
10. SWTCHOPT, SWTCHEIT, REQLSTSIZE, RCVINTRLIST, MSGLSTSIZE, MSGLISTTRAVS, MSGSSEEN, MAXMSGSIZE ???

Section IV. System Metrics

The system metrics show the configuration of the simulated parallel system as well as the performance statistics. Many of these metrics are printed only when the detailed file I/O system is specified at compile-time. The listing of this information is actually printed before the program statistics. Figure 4 shows sample output for a system with 1 compute node and 16 i/o nodes. Data in this category is prepended by <***>.



<***>cnodes 1 ionodes 16 disks 1 myCoop 0 greedy 0
<***>CNsize 0 CNassoc 0 IOsize 0 IOassoc 0 CNdonate 0
<***>w_thru 1 w_alloc 0 CACHE_XFER 1966
<***>disk... seek 8500 rot 4800 xfer 97 simple disk model
<***>sector 512 block 65536
syntime 26836 - 0 = 26836

cnode #0 all i/o complete @268362051
cnode #0 N_NETMSG 10000 N_NETBLOCK 10000 N_DISKWR 0(0) N_DISKRD 10000(10000)
cnode #0 N_CNHIT 0 N_IOHIT 0 N_CCHIT 0(0) N_REHIT 0
cnode #0 RDBLOCKS 10000 WRBLOCKS 0
cnode #total TOTCN 0 TOTIO 0 TOTCC 0(0) TOTRE 0

disk #0 reads 664(664) writes 0(0)
disk #1 reads 610(610) writes 0(0)
disk #2 reads 611(611) writes 0(0)
disk #3 reads 596(596) writes 0(0)
disk #4 reads 577(577) writes 0(0)
disk #5 reads 639(639) writes 0(0)
disk #6 reads 668(668) writes 0(0)
disk #7 reads 614(614) writes 0(0)
disk #8 reads 665(665) writes 0(0)
disk #9 reads 657(657) writes 0(0)
disk #10 reads 629(629) writes 0(0)
disk #11 reads 602(602) writes 0(0)
disk #12 reads 598(598) writes 0(0)
disk #13 reads 634(634) writes 0(0)
disk #14 reads 599(599) writes 0(0)
disk #15 reads 637(637) writes 0(0)
Figure 4. System statistics


Using the figure as a guide, configuration data includes:

Item Label Meaning
1. cnodes 1 ionodes 16 Number of compute and I/O nodes.
2. disks 1 Number of disks attached to each I/O node.
3. SLOW_NETWORK-n Communication latency factor by n (not shown in figure).
4. myCoop 0 greedy 0 Cooperative caching algorithms being used.
5. CNsize 0 CNassoc 0 IOsize 0 IOassoc 0 Cache size and associativity for both compute and I/O nodes.
6. CNdonate 0 Amount of compute node cache used for centrally coordinated caching (a cooperative caching technique) ???
7. w_thru 1 w_alloc 0 Cache write policies being used (write-through vs. write-back, write-around vs. write-allocate).
8. CACHE_XFER 1966 Cache hit time (time to retrieve data block from cache).
9. disk... seek 8500 rot 4800 xfer 97 Disk seek, rotation, and data transfer times.
10. simple disk model Disk model used (simple vs. detailed)

Performance statistics are provided for each compute node in the system. The information provided includes:

Item Label Meaning
1. cnode #0 all i/o complete @268362051 Execution time of the target thread running on that compute node.
2. N_NETMSG 10000 N_NETBLOCK 10000 Number of messages and number of data blocks sent across interconnect.
3. RDBLOCKS 10000 WRBLOCKS 0 Number of data blocks read and written.
4. N_DISKWR 0(0) N_DISKRD 10000(10000) Number of data blocks which required a disk operation (were not serviced by cache) ???
5. N_CNHIT 0 N_IOHIT 0 N_CCHIT 0(0) N_REHIT 0 Number of cache hits in local compute node cache, I/O node cache, and remote compute node cache.