Parallel Switch-Level Circuit Simulation
Yu-an Chen and Rajive Bagrodia
Circuit simulation is a significant bottleneck in the design of VLSI circuits and parallel execution offers the potential for considerably reducing the simulation time for large circuits. We have developed MIRSIM, a parallel switch-level circuit simulator. This is a Maisie implementation of IRSIM, an existing event-driven simulator that incorporates a linear model of MOS transistors. The implementation has been used to simulate a number of circuits using both conservative and optimistic protocols on the IBM SP2 using a variety of circuit partitioning techniques.
An interactive circuit partitioning environment
For parallel execution, the model must first be divided into a number of partitions, each of which may potentially be executed on a processor. The objective of the partitioning algorithm is to generate balanced partitions (to try and achieve a uniform distribution of the computation) such that a minimum number of nets are cut (to minimize the message communication among the corresponding partitions in the simulation).

Figure 1: An interactive circuit partitioning interface
In addition to the existing iterative partitioning algorithms (K-FM), we also support acyclic algorithms (K-AFM) that impose the constraint that generated partitions do not form cycles. The preceding set of partitioning algorithms can be invoked automatically for a given circuit. A graphical facility has been provided to allow a designer to manually partition the circuit (Figure 1). The hierarchical structure of a circuit is displayed as a tree. Circles represent leaf nodes and rectangles represent higher level functional blocks. A circuit designer can use his/her knowledge about the structure of the circuit to manually partition components into approximately balanced parts with low communication overheads. Manual partitioning method can also be combined with other partitioning algorithms. A designer can assign an initial manual partition to the circuit and apply automatic partitioning algorithms on some of the sub-partitions to obtain a balanced decomposition of the circuit into a larger number of partitions
Multiple synchronization protocols
In order to simulate distributed events in the correct global order, MIRSIM provides the following synchronization algorithms:
Benchmarks
A set of circuits used in this performance study are listed in Table 1. For each circuit, we indicate the simulation time using IRSIM and the sequential execution of MIRSIM. As seen from the figure, on average MIRSIM is only 2.1% slower. A major difference between IRSIM and MIRSIM is that the former keeps track of all state transition of all nodes for interactive debugging. However, MIRSIM only keeps the state histories of nodes specified in the command file; this accounts for the few cases where MIRSIM is actually faster. Although it is hard to quantify the performance impact of the additional trace for MIRSIM, preliminary experiments indicate that it is not a significant overhead.
|
Circuits |
# of trans. |
# of input |
IRSIM |
MIRSIM |
Slow down |
|
Square |
15620 |
20 |
8.9 |
8.4 |
-5% |
|
Filter |
13464 |
100 |
30.3 |
30.4 |
0.3% |
|
Chip2u |
3336 |
320 |
23.8 |
24.4 |
2.5% |
|
DDFS |
11495 |
100 |
34.7 |
36.7 |
5.7% |
|
CDMA |
87030 |
439 |
68.7 |
70.2 |
2.1% |
|
CODEC |
27528 |
204 |
87.1 |
90.1 |
3.4% |
Table 1: Six benchmarks for MIRSIM
Figure 2 summarizes the results of the simulation study for the VLSI circuits. The experiments were executed on a 32-node IBM SP2 running AIX -- each node is a RS/6000 workstation processor with 256M bytes of main memory and 66.7MHz clock speed. For each circuit in our test suite, we present the best speedup that was obtained using both conservative and optimistic algorithms and the specific partitioning algorithm that yielded the corresponding result. The speedup is measured related to the original IRSIM execution. As can be seen from the figure, no single synchronization or partitioning algorithm dominated. However, where feasible, acyclic partitions appeared to yield the best performance. Acyclic partitions yield better performance as they eliminate circular dependencies in the parallel model. Consider a circuit decomposition that contains two partitions P1 and P2. Assume that the output from some transistor in P1 is an input to another transistor that is in P2. This requires that a message be sent from P1 to P2 whenever the value of this output changes; we say that P1 is dependent on P2. In an acyclic partitioning, if a partition, say Pi, is dependent on Pj, then Pj will not be dependent on Pi and vice versa. For conservative algorithms, this significantly improves the lookahead properties of the entities resulting in a dramatic reduction in null messages. For optimistic algorithms, it reduces the number of rollbacks, although the improvements in execution time are not as dramatic.
Where acyclic partitions were not feasible, manual partitioning yielded better performance as the designer could use structural information from the circuits to minimize communication and synchronization overheads. The figure also includes the speedup predicted by the ISP protocol. Interestingly, in many cases, the performance of the simulator is close to the lower bound predicted by ISP, which clearly indicates that further improvements in performance are likely to come from better partitioning algorithms or more efficient IPC implementations than from any improvements in the synchronization mechanism itself.

Figure 2: Circuit Simulation Results