We compared all modes of each simulator against the quantum protocol. The performance of the simulation protocol in each simulator mode is gauged by the number of rounds of protocol messages, R, sent per processor. The performance of the quantum protocol is gauged by the number of global synchronizations it would have taken to simulate the same target program. A round of protocol messages is similar to a global synchronization, although it is frequently less expensive, since in many cases, a processor does not need to wait to receive protocol messages from all other processors.
Given a target processor configuration, we found that R decreases only modestly on increasing the number of host processors used to simulate the configuration. Figures 5.4, 5.5, 5.6 and 5.7 show the variation of R with the simulator modes for two representative target and host processor configurations of each benchmark. In each graph, the number of rounds of protocol messages is normalized against the number of global synchronizations of the quantum protocol. The X axis shows the simulator mode, where ``N+C" refers to the NMP+CEP mode and the ``N+C+D" mode refers to the NMP+CEP+Det mode.
Consider only the CEP mode: the amount of improvement over the quantum protocol is strongly dependent on the average duration for which an LP (i.e. thread) executes before getting blocked. Table 5.2 shows this average duration for each benchmark and each target program configuration. L is the minimum message latency of the target machine. The 9 processor BT benchmark has the largest average uninterrupted execution time per thread, and in the simulation, the NMP+CEP mode is able to eliminate more than 80% of the global synchronizations of the quantum protocol. The 16 processor MG benchmark has the smallest average uninterrupted execution time per thread, and the NMP+CEP mode is unable to significantly reduce the number of global synchronizations of the quantum protocol. The NMP+CEP+Det mode of the simulators for the BT and SP benchmarks are able to eliminate all global synchronizations, since the benchmarks are recognizably deterministic. The MG benchmark has very few recognizably deterministic receives, and hence the NMP+CEP+Det mode is only about 5% better than the NMP+CEP mode. The LU benchmark has virtually no recognizably deterministic receives.
The performance of the CEP mode is significantly better than the NMP mode only for the 9 processor BT benchmark. The NMP mode eliminates 40% of the global synchronizations in the quantum protocol, and the CEP mode eliminates 80%. As described in Chapter 4.5.2.2, this is because the CEP significantly improves over the NMP only when some LPs are far ahead of the others in simulation time, requiring the other LPs to exchange many rounds of null messages to update their simulation times. This situation is more likely to occur when the average duration of uninterrupted execution is long, as in the 9 processor BT benchmark.
The NMP mode almost never performs better than the CEP mode, and the NMP+CEP mode is not significantly better than simply the CEP mode. This is because all the benchmarks predominantly use one communicator, and consequently, the null message protocol is unable to extract and use information on the communication topology.
Table 5.2: Average Uninterrupted Execution Time
Figure 5.4: Performance of Simulators for SP
Figure 5.5: Performance of Simulators for BT
Figure 5.6: Performance of Simulators for MG
Figure 5.7: Performance of Simulators for LU