next up previous
Next: Simulation Protocols Up: Simulation of Message Previous: Simulation Model

Predicting the Execution Time of a Local Code Block

  Even though a local code block may have the same process trace in the target program execution and in the simulation, it may have different execution times in each, for several reasons:

  1. The processor and memory hierarchy in the nodes of the host and target machines may be different.
  2. Even if they are the same, the simulator will surely have additional code interspersed in the code of the target program. Upon compilation, this additional simulation code may alter the assembly code generated for the pieces of target program code preceding and following it. For example, register allocation could be altered.
  3. Even if all additional simulation code is in the form of function calls, which would minimize the alteration described in the previous point, the simulator data structures will, at run time, alter the state of the cache. Hence, a local code block following a simulation code block may experience more cache misses than it would have in the target program execution.
Due to the reasons stated above, measuring the execution time of each local code block during the simulation (termed runtime measurement) might result in an inaccurate prediction of its execution time on the target machine. Additionally, each process in the target program execution uses two types of data structures: program data structures and OS data structures. OS data structures are accessed through system calls. Since the target program executes as one process per processor, and each processor has a separate copy of the OS on most distributed memory machines (like the IBM-SP2), processes do not share any OS data structures. However, in the simulation, since many LPs are mapped to the same processor, they automatically share all OS data structures. Such unintended interference between LPs may alter the process trace of a local code block that contains a system call. The parts of the OS in which this effect is expected to be significant must be modeled separately by each LP (i.e. not by direct execution).

For runtime measurement to work, we must ensure that the host and target processors are the same, that additional simulation code is in the form of function calls and that OS interference has been eliminated. The only factor that remains to be accounted for is the cache alteration effect by the data of the simulation code. The cache alteration effect is hard to quantify since the number of additional cache misses in the target program code due to the simulation code depends on many factors ranging from the size of the cache to the data access pattern of the target program, and the actual effect of the additional cache misses depends on the target program and the processor architecture i.e. its ability to tolerate misses using (a) write buffers and (b) exploitation of instruction level parallelism (ILP) in the target program. Hence, we cannot easily estimate and subtract the cache alteration effect from the measured execution time to get an accurate estimate of the execution time. Consequently, we make the assumption that additional simulation code is invoked infrequently by the simulation. Since simulation code needs to be invoked primarily during communication statements, this assumption limits us to programs that communicate infrequently.

Besides runtime measurement, the other methods that might be adopted to predict the execution time of a local code block are:

  1. Instruction Level Simulation: Complete instruction level simulation of the target program (compiled for the target architecture) and the cache of the target processor. This is very time consuming, and requires detailed modeling of every target processor architecture and cache. This method does not require the host machine to be the same as the target machine.
  2. Profiling(Compile time estimation of basic block execution time): In this method (first used in [CMM88,MF88]), the target program is compiled, and the assembly code of each basic block of the target program is examined to estimate its execution time in the number of processor cycles, and (each basic block) is augmented with additional instructions which increment a counter by this number. The simulator executes the augmented target program (as opposed to just the target program). Hence, in the simulation, the counter maintains a count of the total number of target processor cycles used, which can be easily translated to an execution time estimate. However, this count cannot be an accurate count, for there are two additional factors that cannot be precisely known, namely
    1. The number of bubbles in the instruction processing pipeline(s) of the processor. Bubbles reduce the throughput and increase the number of cycles that it takes to execute a basic block. The number of bubbles in the instruction processing pipeline(s) is in part determined by: (a) Number of cache misses (which depends on the data pulled into the cache by the preceding basic blocks in the execution, hence known only at runtime), (b) Amount of ILP (high ILP implies instructions can be executed in parallel, hence pipeline(s) can operate at full speed, otherwise not. Dependent instructions may be revealed only at runtime (e.g. across basic blocks, or even if two instructions within the same basic block are dependent on each other), and hence the amount of ILP may be known only then) (c) Branch prediction success at basic block boundaries (influences number of pipeline startups, and clearly known only at runtime).
    2. How long system calls would take to execute. Until this point, we have assumed that when the target program is compiled, the resultant assembly code has all the code that may be executed during the lifetime of the process that executes it. However, it cannot contain code for system calls. System calls appear in the code as function calls that generate software interrupts to the location in the operating system code that contains the code for the system call. In order to profile system calls, the operating system needs to be augmented and recompiled.

    Due to the above reasons, only a lower bound on the cycle count can be calculated. Presumably, this count can be improved by statistical estimates of the number of cache misses, ILP, successful branch predictions and duration of system calls expected in the program. Profiling does not require the host processor to be the same as the target processor. If they are not, the target program is compiled for both the target and the host instruction sets. The target assembly code is examined to get the execution time estimates of each basic block. The corresponding basic blocks of the host assembly code are augmented with these estimates, since the host assembly code is what is executed by the simulation. Such augmentation is feasible because basic blocks are program dependent and not architecture dependent.



next up previous
Next: Simulation Protocols Up: Simulation of Message Previous: Simulation Model



Andy Kahn
Wed Jun 25 20:28:02 PDT 1997