

# Handling Latch Loops in Timing Analysis with Improved Complexity and Divergent Loop Detection

Xizhe Shi<sup>1\*</sup>, Zizheng Guo<sup>1,2\*</sup>, Yibo Lin<sup>1,2,3†</sup> Runsheng Wang<sup>1,2,3</sup>, Ru Huang<sup>1,2,3</sup>

<sup>1</sup>School of Integrated Circuits, Peking University <sup>2</sup>Institute of Electronic Design Automation, Peking University

<sup>3</sup>Beijing Advanced Innovation Center for Integrated Circuits

Email: xizheshi@stu.pku.edu.cn, {gzz, yibolin, r.wang, ruhuang}@pku.edu.cn

**Abstract**—Latch loops introduce feedback cycles in timing graphs for static timing analysis (STA), disrupting timing propagation in topological order. Existing timers handle latch loops by checking the convergence of global iterations in timing propagation without lookahead detection of divergent loops. Such a strategy ends up with the worst-case runtime complexity  $O(n^2)$ , where  $n$  is the number of pins in the timing graph. This can be extremely time-consuming, when  $n$  goes to millions and beyond. In this paper, we address this challenge by proposing a new algorithm consisting of two steps. First, we identify the strongly connected components (SCCs) and levelize them into different stages. Second, we implement parallelized arrival time (AT) propagation between SCCs while conducting sequential iterations inside each SCC. This strategy significantly reduces the runtime complexity to  $O(\sum_i k_i^2)$  from the previous global propagation, where  $k_i$  is the number of pins in each SCC. Our timer also detects timing information divergent loops in advance, avoiding over-iteration. Experimental results on industrial designs demonstrate  $10.31 \times$  and  $8.77 \times$  speed-up over PrimeTime and OpenSTA on average, respectively.

## I. INTRODUCTION

STA plays an essential role in the electronic design automation (EDA) flow for assessing circuit timing under varying scenarios [1]. Therefore, as integral components of EDA tools, STA engines are vital for identifying timing violations at different design stages, aiding in circuit verification [2]. As sequential elements exhibiting distinct timing behaviors compared to flip-flops (FFs), latches are widely used in modern VLSI circuits for their numerous benefits, including reduced area, lower power, and time borrowing capability [3], [4].

The STA procedure first converts the circuits to directed acyclic graphs (DAGs) and analyzes them in a level-by-level manner following the topological order. However, the incorporation of latches into a circuit may result in the emergence of feedback loops. Unlike combinational logic loops that are likely to lead to divergent AT results, latch feedback loops may create convergent AT propagation but still destroy the directed acyclic nature of the original graph. This may cause a failure of timing analysis or lead to incorrect timing results [5].

\* Equal contribution, † Corresponding author.

This project is supported in part by the Natural Science Foundation of Beijing, China (Grant No. Z230002), the National Science Foundation of China (Grant No. T2293701), and the 111 Project (B18001).

In order to address feedback loops resulting from latches, subsequently referred to as latch loops, a series of methods have been employed in industry. In early industrial EDA tools, STA engines employed cycle cutting technology in the detected latch loops, selecting some timing arcs in the loops as loop breaker arcs to remove [6]. Although this makes the resulting graph acyclic, it prevents timing propagation in the selected broken timing arcs, which may turn out to be critical. As a result, this method offers high efficiency in analysis but may occasionally compromise accuracy.

To derive accurate timing reports considering latch loops, various academic research works have proposed different algorithms. The approach proposed in [7] computes circuit yields at the cost of additional extraction of constraint graphs by verifying the absence of positive or negative loops in the corresponding constraint graphs. In [8], graph decomposition methods are initially employed, with subsequent improvements made to the form of graph traversal. However, this heuristic algorithm also inherently results in some loss of accuracy. The research in [9] improves accuracy by iteratively propagating timing information to analyze latch loop convergence. However, this method increases analytical complexity and requires an additional build-up of the reduced timing graph (RTG). The research work in [10], optimizes RTG-based convergence detection, reducing iterations and STA runtime.

Most of the mentioned methods improve accuracy but incur much higher time complexity, often quadratic, due to additional construction of RTGs or other auxiliary graphs. Therefore, these methods lack practicality and are not widely adopted in industry. As of today, leading-edge STA engines including PrimeTime [11] and OpenSTA [12] still use a vanilla iteration-till-converge approach to handle latch loops. Although this approach works well in practice, its worst theoretical complexity can also reach  $O(n^2)$ .

In this work, we propose a novel algorithm framework to accelerate the STA process for circuits with latch loops. Our timer, based on this framework, can ground-breakingly predict divergent loops and outperform widely used tools like PrimeTime and OpenSTA in runtime. We summarize our contributions as follows:

- 1) We propose a novel graph decomposition algorithm for circuits containing latch loops according to loop locality.



Fig. 1: Graph representation of Flip-Flops and latches in GBA.



Fig. 2: Graph representation of latch loop in GBA.

Thanks to this partitioning, we achieve parallel waste-free latch AT propagation without accuracy loss.

- 2) We integrate an efficient algorithm using SCC-based iteration into loop analysis to enable early detection of divergent loops and prevent excessive iterations of AT propagation.
- 3) Our algorithm has provably better runtime complexity compared to prior works. We prove that our algorithm runs in  $O(\sum_i k_i^2)$  where  $k_i$  denotes the number of pins in each SCC. It is much smaller than the  $O(n^2)$  algorithm used by PrimeTime and OpenSTA where  $n$  is the full graph size.

We integrate our algorithm into a cutting-edge STA engine and demonstrate an average speed-up of  $10.31\times$  over PrimeTime and  $8.77\times$  over OpenSTA on large circuit designs.

## II. PRELIMINARIES

### A. Latch Loops in Graph-Based Timing Analysis

STA is typically deployed in two phases: graph-based analysis (GBA) and path-based analysis (PBA). GBA uses a graph-based approach to analyze timing across circuits, while PBA targets specific critical paths to identify and optimize timing violations. In GBA, the circuit is represented as a DAG, with nodes as pins and edges as arcs carrying timing information.

In latches, timing information such as AT at output pin can be derived from data input pin after timing checks due to their time borrowing property, which is different from FFs. So when representing latches in a graph, there exists an arc from data input pin to data output pin, as shown in Figure 1.

Figure 2 illustrates how a latch loop can create a feedback loop in the extracted graph, making the conventional topological timing propagation approach based on DAGs inapplicable. Thus, circuits with latch loops require a different methodology.

### B. AT Propagation and Convergence in Latch Loops

In STA, arcs in the circuit graph are annotated with minimum and maximum delays to account for process variations, creating the early/late split model for hold/setup violation checks. In the late case of the model, the propagation of timing information, such as ATs, via relaxation always preserves the

larger possible values during the updating process [13]. However, once feedback loops exist in circuits, the propagation of AT must be operated iteratively until convergence. Divergence in this process will result in unbounded increases in ATs, leading to setup timing violations at endpoints.

Unlike combinational loops, latch loops contain D-Q arcs of latches. The relaxation process of them is more specific, requiring setup time checks and potentially resulting in a decreased AT result due to the change of clock domains [7], [10]. The presence of decreased AT values is fundamental to the convergence of ATs during loop iterations, indicating that convergence is possible for latch loops.

### C. Global Iteration for AT Propagation in Current Timers

In a DAG, AT propagation does not require repeated iterations; however, loops or SCCs consisting of multiple pins (MSCCs) necessitate them, until ATs converge or diverge.

Taking the late case as an example, current timers first break all D-Q arcs to ensure AT propagation on the DAG. Subsequent setup time checks determine if the Q pins' ATs are updated by the D pins, requiring re-updates of downstream ATs if true. Its worst-case complexity can be up to  $O(n^2)$ , leading to unacceptable runtimes with a high pin count [14]. Moreover, their iterative process lacks preemptive assessment of potential divergence scenarios and does not terminate early but iterates until a timing violation occurs.

Figure 3 reveals how AT propagates in global iteration in a latch loop chain. The number of AT updates per loop continues to grow linearly as the order of the loops increases from left to right. Assuming each update requires a fixed amount of time, the total runtime will be the sum of these updates, increasing quadratically with the total number of loops.

## III. ALGORITHMS

To address the limitations of current timers in AT propagation in circuits with latch loops, we hope to introduce SCC-based AT iteration. According to graph theory, a directed graph can be decomposed into a collection of SCCs [15]. This generates a DAG of SCCs, after which the longest-path algorithms can be applied to propagate AT inside each SCC, with AT iterations carried out in topological order for each SCC. In this case, ATs within each SCC converge before propagating into downstream SCCs, reducing the runtime overhead caused by repeated updates compared to the AT propagation in Figure 3.

Following these ideas, our algorithm first decomposes the extracted graph into SCCs. Subsequently, after performing levelization on SCCs, AT iterations are then confined inside each SCC to reduce complexity, while parallel AT propagation between SCCs is enabled to further enhance efficiency. The overall flow of our algorithm is shown in Algorithm 1 and Figure 4.

### A. Graph Decomposition and SCC Levelization

To decompose the circuit graph into SCCs first, we adopt the ideas of Tarjan's algorithm, whereby the full graph is traversed using depth-first search (DFS) [15].



Fig. 3: Global AT propagation iterations of latch loop chains in current timers.

---

**Algorithm 1:** Overall-Flow-AT-Propagation(Graph G)

---

```

1 for each pin  $v$  in the circuit graph  $G$  do
2    $DFS-SCC(v)$ 
3    $SCC-Levelization(G)$ 
4    $AT-Prop-In-Parallel-Timing-Checks(G)$ 
```

---

Algorithm 2 details the workings of graph decomposition. In this algorithm, an *index* is assigned to each pin, with the search order stored in the array  $dfn$ . The array  $low$  holds the earliest order of any pin in the *stack* that can be traced back for pin  $u$  or its subtree. We invoke Algorithm 2 on all pins in the graph, achieving a linear complexity of  $O(n)$ , where  $n$  is the number of pins in the circuit graph [15]. Ultimately, the pins contained in each SCC are stored in the corresponding element of  $sccs$ . In turn, we also use  $scc\_id[pin]$  to store the SCC number to which  $pin$  belongs.

---

**Algorithm 2:** DFS-SCC(Pin  $v$ ).

---

```

1  $index \leftarrow index + 1$ 
2  $dfn[v] \leftarrow index$ ,  $low[v] \leftarrow index$ 
3  $Push(stack, v)$ 
4 for each  $v$ 's fanout arc  $e$  do
5    $u \leftarrow destination\_pin[e]$ 
6   if  $u$  is not visited then
7     mark  $u$  as visited
8      $DFS-SCC(u)$ 
9      $low[v] \leftarrow min(low[v], low[u])$ 
10  else if  $u$  is in stack then
11     $low[v] \leftarrow min(low[v], dfn[u])$ 
12  if  $low[v] = dfn[v]$  then
13    initialize a new SCC  $scc$ 
14    while stack is not empty do
15       $u \leftarrow Pop(stack)$ 
16       $Pushback(scc, u)$ 
17    assign  $scc\_id[pin]$  for all pins in  $scc$ 
18     $Pushback(sccs, scc)$ 
```

---

After the decomposition process, the pin-based Directed Cyclic Graph is transformed into an SCC-based DAG. These

SCCs are then leveled into different stages to enable parallel AT propagation among SCCs at the same level while ensuring the correct propagation order.

Algorithm 3 determines the level of each SCC in three steps. First, the fanin count of each SCC is computed and stored in  $nFanin[scc]$ . This step involves examining each fanout arc of every pin. If an arc points to a pin of an external SCC, the fanin count of the target SCC is incremented. In the second step, SCCs with an initial  $nFanin[scc]$  of 0 are assigned to  $levels[0]$  as the first-level SCCs. Finally, starting from  $levels[0]$ , each successive level of SCCs is determined iteratively and stored into  $levels[level\_i]$ . It is worth noting that the Levelization process can also be parallelized.

---

**Algorithm 3:** SCC-Levelization(Graph G).

---

```

1 for each pin  $v$  in the circuit graph  $G$  do
2   for each  $v$ 's fanout arc  $e$  do
3      $u \leftarrow destination\_pin[e]$ 
4     if  $scc\_id[u]! = scc\_id[v]$  then
5        $scc \leftarrow sccs[scc\_id[u]]$ 
6        $nFanin[scc] \leftarrow nFanin[scc] + 1$ 
7   for each  $scc$  in  $sccs$  with  $nFanin[scc] = 0$  do
8      $Pushback(levels[0], scc)$ 
9    $level\_i \leftarrow 0$ 
10  while  $levels[level\_i]$  is not empty do
11     $level\_i \leftarrow level\_i + 1$ 
12    for each  $scc$  in  $levels[level\_i - 1]$  do
13      for each pin  $v$  in  $scc$  do
14        for each  $v$ 's fanout arc  $e$  do
15           $u \leftarrow destination\_pin[e]$ 
16          if  $scc\_id[u]! = scc\_id[v]$  then
17             $scc\_u \leftarrow sccs[scc\_id[u]]$ 
18             $nFanin[scc\_u] \leftarrow$ 
19             $nFanin[scc\_u] - 1$ 
20            if  $nFanin[scc\_u] = 0$  then
21               $Pushback(levels[level\_i], scc\_u)$ 
```

---

### B. Local Iteration of AT Inside Each SCC

To avoid global iteration of AT in the current STA engines discussed in Section II, which results in quadratic complexity,

our algorithm localizes the AT iteration inside each SCC.

It is essential to clarify that the localization of the AT iteration occurs at the level of individual SCCs, as outlined in Algorithm 4. For single-pin SCCs, the AT is computed by relaxing all fanin arcs. Note that when relaxing D-Q arcs of latches, a setup time check is performed simultaneously.

---

**Algorithm 4:** AT-Prop(SCC scc).

```

1 if size(scc) = 1 then
2   for each v's fanin arc e do
3     Relax arc e
4 else
5   for each pin u in scc do
6     Enque(queues[scc], u)
7     q_cnt[u] ← q_cnt[u] + 1
8   while queues[scc] is not empty do
9     u ← Pop(queues[scc])
10    for each u's unrelaxed fanin arc e do
11      from ← source_pin[e]
12      if scc_id[u]! = scc_id[from] then
13        Relax arc e
14    for each u's fanout arc e do
15      to ← destination_pin[e]
16      if scc_id[u] = scc_id[to] then
17        Relax arc e
18        if AT[to] is changed && to is not in
19          queues[scc] then
20          Enque(queues[scc], to)
21          q_cnt[to] ← q_cnt[to] + 1
22        if q_cnt[to] > size(scc) then
23          Error(Divergent Loop!)

```

---

But for MSCCs, the iteration of AT inside each SCC is realized by an adapted SPFA. As described in Algorithm 4, all internal pins of the SCC must be enqueued into the pending queue  $queues[scc]$ , and the corresponding element in the array  $q\_cnt$ , which tracks the number of times each pin is enqueued, should be initialized to one.

AT propagation is then executed for each pin in the queue, beginning with relaxations on its fanin arcs originating from other SCCs. This process marks the relaxed arc to prevent further relaxations, as the AT value of its source pin must have been fixed by the time it is relaxed. Therefore, excess relaxations more than once will not update the AT of the destination pin again. The necessary relaxations for the AT iteration, similar to those in SPFA, actually occur on the arcs between pins inside the SCC. For each pin, we relax its fanout arcs inside the SCC; if this alters the AT of a destination pin not in  $queues[scc]$ , we should get that pin enqueued for further iteration. This is due to the fact that the AT of this pin has not yet converged and the updated AT may affect the ATs of its fanout pins. This process will continue until the queue is emptied.

However, the number of queue entries for each pin must be limited to avoid infinite iterations. For possible ATs that do

not converge, our algorithm employs a judgment mechanism similar to SPFA. If a pin enters the queue more times than the total number of pins in the SCC, it indicates both a non-converging AT and a divergent loop inside the SCCs.

According to [16], the complexity of Algorithm 4 adopted from SPFA for a single SCC is  $O(ke)$ , where  $k$  is the number of pins and  $e$  is the number of arcs. Since the number of pins and arcs in timing graphs are often similar or linear, the complexity can be approximated as  $O(k^2)$ .

### C. Parallel AT Propagation between SCCs and Timing Checks

After the graph decomposition and levelization of SCCs, AT can propagate level by level, with parallelism achieved within each level. Algorithm 5 illustrates the parallel framework for AT propagation and the final timing checks. AT propagates in parallel among the SCCs, starting from those in  $levels[0]$  and continuing to the final level. After AT propagation, timing checks are performed, with setup time checks for all latches already completed during propagation, leaving only other timing checks to be conducted also in parallel.

---

**Algorithm 5:** AT-Prop-In-Parallel-Timing-Checks(G)

```

1 for level_i in 0..levels.size() do
2   for each scc in levels[level_i] do
3     AT-Prop(scc)
4 for each endpoint in endpoints do
5   slack[endpoint] ← TimingChecks(endpoint)

```

---

### D. Complexity Analysis

To justify the efficiency of our algorithm, we derive the following theory results:

**Theorem 1.** *The time complexity of the overall flow of the algorithm above is  $O(\sum_i k_i^2)$ , where  $k_i$  denotes the size of SCC numbering  $i$ .*

*Proof.* The time complexity has three parts, *Graph decomposition*, *SCC Levelization* and *AT propagation*. Graph decomposition involves a DFS traversal of the entire graph, while SCC levelization uses a BFS traversal, both taking  $O(n)$  time. The time complexity of AT iteration for each SCC with  $k_i$  pins is  $O(k_i^2)$ , and overall complexity is their sum,  $O(\sum_i k_i^2)$ . Since  $\sum_i k_i$  is equal to  $n$ , it is clear that  $\sum_i k_i^2$  is greater than  $n$ . Therefore, the overall time complexity of the algorithm is  $O(\sum_i k_i^2)$ .  $\square$

It is clear that  $\sum_i k_i^2 < (\sum_i k_i)^2 = n^2$ , so this method will have a complexity advantage in the case of multiple MSCCs considering the cost of graph decomposition.

## IV. EXPERIMENTAL RESULTS

### A. Experimental Setup

We implement our STA flow algorithm for latch-loop analysis using high-performance C++ and Rust. Experiments are conducted on a 64-bit Linux machine with a 128-core Intel Xeon Platinum 8358 CPU running at a base frequency



Fig. 4: Overall flow of our algorithm on AT propagation.

TABLE I: Overall efficiency comparison between PrimeTime, OpenSTA and our timer on benchmarks.

| Benchmark           | Circuit_Statistics |         |        |       |            | PT_RT<br>8 Threads | OpenSTA_RT<br>8 Threads | Ours_RT  |           | PT_RTR<br>8 Threads | OpenSTA_RTR<br>8 Threads | Ours_TRTR |           |
|---------------------|--------------------|---------|--------|-------|------------|--------------------|-------------------------|----------|-----------|---------------------|--------------------------|-----------|-----------|
|                     | #Gates             | #Pins   | #MSCCs | #APPS | #Endpoints |                    |                         | 1 Thread | 8 Threads |                     |                          | 1 Thread  | 8 Threads |
| SDLL                | 5                  | 15      | 1      | 10    | 16         | 305.83             | 52.84                   | 0.008    | 0.01      | 30583               | 5284                     | 0.8       | 1.00      |
| ELLC_1000           | 72000              | 184018  | 8000   | 18    | 128000     | 278.12             | 179.61                  | 1.28     | 0.85      | 327.20              | 211.31                   | 1.51      | 1.00      |
| ELLC_2000           | 144000             | 368018  | 16000  | 18    | 256000     | 395.01             | 684.55                  | 2.47     | 1.59      | 248.43              | 430.53                   | 1.55      | 1.00      |
| ELLC_3000           | 216000             | 552018  | 24000  | 18    | 384000     | 859.59             | 1660.48                 | 3.78     | 2.46      | 349.42              | 674.99                   | 1.53      | 1.00      |
| ELLC_4000           | 288000             | 736018  | 32000  | 18    | 512000     | 1495.11            | 3148.02                 | 5.02     | 2.97      | 503.40              | 1059.93                  | 1.69      | 1.00      |
| ac97_ctrl_latch     | 17480              | 48537   | 642    | 14    | 9244       | 2.46               | 1.21                    | 0.31     | 0.21      | 11.71               | 5.76                     | 1.47      | 1.00      |
| aes_core_latch      | 23788              | 68453   | 94     | 55    | 4212       | 0.76               | 0.64                    | 0.45     | 0.31      | 2.45                | 2.06                     | 1.45      | 1.00      |
| b19_iccad_latch     | 269553             | 815147  | 1106   | 557   | 84106      | 79.63              | 30.50                   | 20.91    | 16.16     | 4.93                | 1.89                     | 1.29      | 1.00      |
| leon2_iccad_latch   | 1748121            | 4543132 | 18774  | 71    | 605884     | 485.62             | 491.12                  | 54.28    | 30.28     | 16.04               | 16.22                    | 1.79      | 1.00      |
| leon3mp_iccad_latch | 1363261            | 3582603 | 16955  | 64    | 911134     | 453.43             | 489.90                  | 53.69    | 31.83     | 14.25               | 15.39                    | 1.69      | 1.00      |
| netcard_iccad_latch | 1625851            | 4248741 | 24788  | 52    | 863618     | 520.78             | 561.90                  | 61.80    | 33.03     | 15.76               | 17.01                    | 1.87      | 1.00      |
| vga_lcd_latch       | 144626             | 396023  | 4974   | 10    | 129454     | 69.08              | 43.69                   | 11.07    | 7.66      | 9.02                | 5.70                     | 1.45      | 1.00      |
| vga_lcd_iccad_latch | 256899             | 628677  | 6690   | 16    | 121932     | 118.33             | 86.91                   | 17.91    | 14.20     | 8.33                | 6.12                     | 1.26      | 1.00      |

RT: Runtime in seconds. RTR: Runtime ratio of this timer to our timer at 8 threads. APPS: Average number of pins in each SCC.

TRTR: Runtime ratio of 1 thread to 8 threads in our timer.



Fig. 5: Chain of  $n$  latch loops.



Fig. 6: Circuit construction for industrial benchmarks [10].

of 2.60 GHz and 1.0 TB of memory. The tests utilized different numbers of threads for evaluation. Previous academic methods lack practicality and industry adoption, as discussed in Section I, so we compare our timer with two advanced STA engines: PrimeTime [11] and OpenSTA [12].

We validate our algorithm on two sets of benchmarks. Since our primary objective is to examine the acceleration of AT propagation, we preset the cell delays of these benchmarks in each timer to eliminate discrepancies arising from different delay computation models, meanwhile setting the net delays to zero for simplicity as well. In the first set, five special cases were constructed to validate the detection of divergent loops and to demonstrate the superiority of our algorithm in time complexity. In the second set, to examine the advantages of our timer on real designs, we constructed eight industrial circuits from TAU contests [17] using the method in [18], with some other changes to the netlist, lib and sdc files. 60% FFs in the initial circuits were converted into latches, and

then latches and some combinational logic were duplicated, as illustrated in Figure 6. The clocks clk\_1 and clk\_2 formed a two-phase clock scheme and were set to inverted clocks with clock phase shift  $T/2$  and duty cycle 0.5. We have uploaded two examples that allow readers to reproduce and observe the prolonged PrimeTime runtimes, which stem from the presence of divergent loops and the methodology adopted by the timer in handling latch-loop circuits.<sup>1</sup>

For accuracy check of our timer, it is evaluated by comparing the Mean Absolute Error (MAE) of each endpoint slack between our timer and PrimeTime. When testing the runtime of three timers, it is assessed by the execution time of command `report_timing` in PrimeTime, `report_checks` in OpenSTA, and can be monitored by command `time` in TCL tools. Our timer is timed with the help of rust's internal `std :: time` library. Note that all the runtime values reported are collected by averaging 10 runs of each experiment.

#### B. Accuracy and Runtime Results of Industrial Benchmarks

We finally compare the runtime results obtained by three timers at different threads, as shown in Table I. For accuracy checks, our results perfectly match PrimeTime's endpoint slacks, with all MAEs less than 0.01%.

In eight industrial benchmarks, as depicted in Figure 7 and Table I, our timer exhibits superior accuracy with an average speedup ratio of  $10.31\times$  and  $8.77\times$  at eight threads compared to PrimeTime and OpenSTA, respectively. Despite including sequential components, such as the non-parallelized graph decomposition, our timer still achieves an average  $1.5\times$  speedup at 8 threads compared to single-thread scenarios, allowing it to outperform the baselines, even their fully parallelized versions. Concurrently, especially for graphs with a

<sup>1</sup><https://github.com/xiaoshixuexi/Latch-Loop-Examples>



Fig. 7: Runtime ratio of PrimeTime and OpenSTA over our timer at 8 threads.



Fig. 8: Runtime values at different numbers of threads for three timers on two benchmarks.



Fig. 9: Runtime breakdown on design **netcard\_iccad** at 8 threads.

greater number of MSCCs such as `leon2_iccad_latch`, `leon3mp_iccad_latch` and `netcard_iccad_latch`, our timer exhibits a more pronounced speedup up to around  $17\times$  and  $15\times$  at eight threads, which aligns with the effect of localized AT iterations inside SCCs.

As observed in the runtime speedup at multi-threaded tests, our timer successfully implements parallelization for AT propagation between SCCs. Figure 8 shows runtime values for the benchmarks `leon2_iccad_latch` and `netcard_iccad_latch` at different threads. It is clear that runtimes of three timers saturate at 8 threads and our timer exhibits good parallelism.

We further investigate the performance of our algorithms by conducting a runtime breakdown across different components. As illustrated in Figure 9, compared to the longer time costs associated with iterative AT propagation and endpoint checks in commercial timers, our algorithm achieves significant gains by leveraging approximately one-third of the total runtime for preparation time within this circuit. This also suggests that future optimization efforts will be more focused on graph decomposition and AT propagation.

### C. Divergent Loop Detection

Now we examine the results on the first set of benchmarks in Table I. Figure 2 shows the circuit graph for the first special

Fig. 10: Runtime result of four ELLC cases at single thread.

case named single-divergent-latch-loop (SDLL), which is in fact a latch loop comprising four latches and an AND gate. It is designed as a circuit with a divergent loop with large clock period and a small total loop delay. As is shown in Table I, over-iteration of PrimeTime and OpenSTA results in runtime ranging from 50 to 300 seconds, while our timer detects the divergent loop in microseconds.

### D. Analysis on Cases of Eight-Latch-Loop-Chain

Figure 5 shows the circuit designs of the remaining four special cases named eight-latch-loop-chain (ELLC). These benchmarks all consist of eight latch loop chains, each chain of which only differs in the number of latch loops.

As can be observed in Table I and Figure 10, the runtime of our timer increases in a linear relation to the loop number, achieving a speedup of hundreds of times in processing these circuits in comparison to PrimeTime and OpenSTA in both single-threaded and multi-threaded experiments. The runtimes of the latter two timers, on the other hand, demonstrate a clear square relationship, which matches our statement in Section II.

## V. CONCLUSION

In this work, we propose a novel STA algorithm for circuits containing latch loops. By strategically partitioning the graph into SCCs and leveraging parallelized AT propagation, we make full use of loop locality in latch-enabled circuits. The innovative incorporation of the SPFA for local iteration inside SCCs and the pre-identification of divergent loops have reduced the worst-case complexity of the algorithm to  $O(\sum_i k_i^2)$  where  $k_i$  is the SCC size. Experimental evaluations on industrial designs confirm the superiority of our timer, with an average speed-up of  $10.31\times$  over PrimeTime and  $8.77\times$  over OpenSTA. Our future work includes attempts to parallelize and accelerate the AT iteration within each SCC, as well as applying it to automatic time borrowing flows.

## REFERENCES

- [1] K. Kang, B. C. Paul, and K. Roy, "Statistical timing analysis using leveled covariance propagation," in *Design, Automation and Test in Europe*. IEEE, 2005, pp. 764–769.
- [2] J. Bhasker and R. Chadha, *Static timing analysis for nanometer designs: A practical approach*. Springer Science & Business Media, 2009.
- [3] S. Paik, L.-e. Yu, and Y. Shin, "Statistical time borrowing for pulsed-latch circuit designs," in *2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC)*. IEEE, 2010, pp. 675–680.
- [4] H. Cheng, X. Li, Y. Gu, and P. A. Beerel, "Saving power by converting flip-flop to 3-phase latch-based designs," in *2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)*. IEEE, 2020, pp. 574–579.
- [5] N. Xiromeritis, S. Simoglou, C. Sotiriou, and N. Sketopoulos, "Graph-based sta for asynchronous controllers," in *2019 29th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS)*. IEEE, 2019, pp. 9–16.
- [6] S. Simoglou, C. Sotiriou, and N. Blias, "Timing errors in sta-based gate-level simulation," in *2020 26th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC)*. IEEE, 2020, pp. 1–2.
- [7] R. Chen and H. Zhou, "Statistical timing verification for transparently latched circuits," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 25, no. 9, pp. 1847–1855, 2006.
- [8] X. Yuan and J. Wang, "Statistical timing verification for transparently latched circuits through structural graph traversal," in *2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC)*. IEEE, 2010, pp. 663–668.
- [9] L. Zhang, J. Tsai, W. Chen, Y. Hu, and C. C.-P. Chen, "Convergence-provable statistical timing analysis with level-sensitive latches and feedback loops," in *Proceedings of the 2006 Asia and South Pacific Design Automation Conference*, 2006, pp. 941–946.
- [10] B. Li, N. Chen, and U. Schllichtmann, "Statistical timing analysis for latch-controlled circuits with reduced iterations and graph transformations," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 31, no. 11, pp. 1670–1683, 2012.
- [11] Synopsys, *PrimeTime*, version: S-2021.06-SP1, Synopsys, Inc., Mountain View, California, USA, 2021, synopsys, Inc. Software. [Online]. Available: <https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html>
- [12] "OpenSTA, version: 2.5.0," <https://github.com/abk-openroad/OpenSTA>.
- [13] R. Chen, L. Zhang, V. Zolotov, C. Visweswariah, and J. Xiong, "Static timing: back to our roots," in *2008 Asia and South Pacific Design Automation Conference*. IEEE, 2008, pp. 310–315.
- [14] M. Panda and A. Mishra, "A survey of shortest-path algorithms," *International Journal of Applied Engineering Research*, vol. 13, no. 9, pp. 6817–6820, 2018.
- [15] R. Tarjan, "Depth-first search and linear graph algorithms," *SIAM journal on computing*, vol. 1, no. 2, pp. 146–160, 1972.
- [16] "Segmented spfa: An improvement to the shortest path faster algorithm," <https://konaeakira.github.io/posts/segmented-spfa-an-improvement-to-the-shortest-path-faster-algorithm.html>.
- [17] J. Hu, G. Schaeffer, and V. Garg, "TAU 2015 contest on incremental timing analysis," in *Proc. ICCAD*. IEEE, 2015, pp. 882–889.
- [18] T. G. Szymanski, "Computing optimal clock schedules," in *[1992] Proceedings 29th ACM/IEEE Design Automation Conference*. IEEE, 1992, pp. 399–404.