Next: Simulation Protocols
Up: Simulation of Message
Previous: Simulation Model
Even though a local code block may have the same process trace in the
target program execution and in the simulation, it may have different
execution times in each, for several reasons:
- The processor and memory hierarchy
in the nodes of the host and target machines may be different.
- Even if they are the same, the simulator
will surely have additional code interspersed in the code of
the target program. Upon compilation, this additional simulation code may alter
the assembly code generated for the pieces of
target program code preceding and following it. For example,
register allocation could be altered.
- Even if all additional simulation code is in the form of
function calls, which would minimize the alteration described in the
previous point, the simulator data structures will, at run time, alter the state
of the cache. Hence, a local code block following a simulation
code block may experience more cache misses than it would have in the target
program execution.
Due to the reasons stated above, measuring the execution time of
each local code block during the simulation (termed runtime measurement)
might result in an inaccurate prediction of its execution time on the target
machine. Additionally, each process in the target program execution uses two
types of data structures: program data structures and OS
data structures. OS data structures are accessed through
system calls. Since the target program
executes as one process per processor, and each processor has
a separate copy of the OS on most distributed memory
machines (like the IBM-SP2), processes do not share any
OS data structures. However, in the simulation,
since many LPs are mapped to the same processor, they automatically
share all OS data structures. Such
unintended interference between LPs may alter the process trace
of a local code block that contains a system call.
The parts of the OS in which this effect is
expected to be significant must be modeled separately by
each LP (i.e. not by direct execution).
For runtime measurement to work,
we must ensure
that the host and target processors are the same, that additional
simulation code is in the form of function calls and that OS interference
has been eliminated.
The only factor that remains
to be accounted for is the cache alteration effect by the
data of the simulation code. The cache alteration effect is hard to
quantify since the number of additional cache misses in the target
program code due to the
simulation code depends on many factors ranging from the size of
the cache to the data access pattern of the target program,
and the actual effect of the additional cache misses depends on the
target program and the processor architecture i.e. its ability to
tolerate misses using (a) write buffers and (b)
exploitation of instruction level parallelism (ILP) in the target program.
Hence, we cannot easily estimate and subtract the cache alteration effect
from the measured execution time to get an accurate estimate of the execution
time. Consequently, we make the assumption that additional simulation code is
invoked infrequently by the simulation.
Since simulation code needs to be invoked primarily during
communication statements, this assumption limits
us to programs that communicate infrequently.
Besides runtime measurement, the other methods
that might be adopted to predict the execution time of a local code
block are:
- Instruction Level Simulation: Complete
instruction level simulation of the target program (compiled
for the target architecture) and
the cache of the target processor.
This is very time consuming, and requires detailed modeling of
every target processor architecture and cache. This method
does not require the host machine to be the same as the target machine.
- Profiling(Compile time estimation of basic block execution time):
In this
method (first used in [CMM88,MF88]),
the target program is compiled, and
the assembly code of each basic block of the target program is
examined to estimate its execution time in the number of processor cycles,
and (each basic block) is augmented with
additional instructions which increment a counter by this number.
The simulator executes the augmented target program (as opposed to just the
target program).
Hence, in the simulation, the counter maintains a count of the
total number of target processor cycles used, which can be easily
translated to an execution time estimate. However, this count cannot be an
accurate count, for there are two additional factors that cannot be precisely
known,
namely
- The number of bubbles in the instruction processing pipeline(s)
of the processor.
Bubbles reduce the throughput and increase the number of
cycles that it takes to execute a basic block.
The number of bubbles in the instruction processing pipeline(s)
is in part determined by:
(a) Number of cache misses (which depends on the data pulled
into the cache by the
preceding basic blocks in the execution, hence known only at runtime),
(b) Amount of ILP
(high ILP implies instructions can be executed in parallel,
hence pipeline(s) can operate at full speed, otherwise not.
Dependent instructions may be revealed only at runtime (e.g. across
basic blocks, or even if two instructions within the
same basic block are dependent on each other), and
hence the amount of ILP may be known only then)
(c) Branch prediction success
at basic block boundaries (influences number of
pipeline startups, and clearly known only at runtime).
- How long system calls would take to execute. Until this point, we
have assumed that when the target program is compiled, the
resultant assembly code has all the code that may be executed during the
lifetime of the process that executes it.
However, it cannot contain code for system calls. System calls
appear in the code as function calls that generate software interrupts
to the location in the operating
system code that contains the code for the system call.
In order to profile system calls, the operating system needs to
be augmented and recompiled.
Due to the above reasons, only a lower bound on the cycle
count can be calculated. Presumably, this count can be improved by
statistical estimates of the number of cache misses, ILP,
successful branch predictions and duration of system calls expected
in the program.
Profiling does not require the host
processor to be the same as the target processor. If they are not,
the target program is compiled for
both the target and the host instruction sets. The target assembly
code is examined to get the execution time estimates of
each basic block. The corresponding basic blocks of the host assembly code
are augmented with these estimates, since the host assembly code is what
is executed by the simulation. Such augmentation is feasible because
basic blocks are
program dependent and not architecture dependent.
Next: Simulation Protocols
Up: Simulation of Message
Previous: Simulation Model
Andy Kahn
Wed Jun 25 20:28:02 PDT 1997