

# International Symposium on Physical Design



## IncreMacro: Incremental Macro Placement Refinement

Yuan Pu<sup>1,2</sup>, Tinghuan Chen<sup>1</sup>, Zhuolun He<sup>1,2</sup>, Chen Bai<sup>1</sup>,  
Haisheng Zheng<sup>2</sup>, Yibo Lin<sup>3</sup>, Bei Yu<sup>1</sup>

<sup>1</sup>The Chinese University of Hong Kong, Hong Kong SAR

<sup>2</sup>Shanghai AI Laboratory, Shanghai, China

<sup>3</sup>Peking University, Beijing, China



# Outline

① Introduction

② Algorithm

③ Experimental Results

④ Conclusion

# Introduction

# Background

- Macro placement plays an important role for the QoR of the following physical design flows (e.g. std placement, routing).
- Analytical placers (e.g. DREAMPlace<sup>1</sup>) and AutoDMP<sup>2</sup> may cause **macro blockage**, leading to discontinuous space for standard cell placement, thus bad wirelength, routability and timing.



(a) Macro placement by analytical placers.  
(b) Macro placement by DREAMPlace

<sup>1</sup>Yibo Lin et al. (2019). “Dreamplace: Deep learning toolkit-enabled GPU acceleration for modern vlsi placement”. In: *Proc. DAC*, pp. 1–6.

<sup>2</sup>Anthony Agnesina et al. (2023). “AutoDMP: Automated DREAMPlace-based Macro Placement”. In: *Proc. ISPD*, pp. 149–157.

# Previous Work<sup>3</sup>

- Data Structure-based Metaheuristics
- Methodology:
  - Three-stage macro placement. (**placement prototype** → **macro placement** → **standard cell placement**).
  - Macro positions → specific data structure (e.g., MP tree<sup>3</sup>).
  - Simulated Annealing: iteratively perturbation.
- Objective: push macros to chip boundary, optimize macro displacement, wirelength, routability, etc.
- Disadvantage:
  - Disturb relative macro positions by placement prototype → disturb the timing optimization by analytical placers.



(a) Macro placement by analytical placers



(b) Macro placement by academic methods

<sup>3</sup>Tung-Chieh Chen et al. (2007). "MP-trees: A packing-based macro placement algorithm for mixed-size designs". In: *Proc. DAC*, pp. 447–452.

# Previous Work

- RL-based Approach<sup>45</sup>.
- Macro placement → sequential Markov Decision Process (MDP).
  - State: intermediate macro placement solution.
  - Action: assign one macro to a legalized position on the chip.
- Generate macro placement solution **from scratch**.
- Disadvantages:
  - From scratch → computationally prohibitive for parameters exploration.
  - No utilization of timing-optimization by analytical placement prototype.

---

<sup>4</sup>Ruoyu Cheng and Junchi Yan (2021). “On Joint Learning for Solving Placement and Routing in Chip Design”. In: *Proc. NeurIPS 34*, pp. 16508–16519.

<sup>5</sup>Azalia Mirhoseini et al. (2020). “Chip placement with deep reinforcement learning”. In: *arXiv preprint arXiv:2004.10746*.

# Problem Formulation

## Incremental Macro Placement Refinement

Given the placement prototype by an analytical placer, find the legalized coordinates of macros to optimize the objective of wirelength.

Requirements satisfied by IncreMacro:

- Pushing macros to chip periphery.
- Macro relative position (timing-opt by analytical placers) preserved.



# Algorithm

# Framework

- Overall flow of IncreMacro:



# KD-tree-based Macro Diagnosis

- Regularly-placed macro satisfies one of the following conditions:
  - clings to chip boundary
  - nearest four 4 macros surround it
- Motivation: only poorly-placed macros need to be adjusted.
- Methodology for macro diagnosis: KD-tree based nearest neighbor search
- Example:
  - $m_1$ : **poorly-placed**
  - $m_3$ : **regularly-placed**



# Gradient-based Macro Shift



The Macro shifting gradient descent process.

- Make analogy between macro placement and deep learning:
  - macro coordinate offset → model weight
  - wirelength and periphery cost → cost function
  - netlist → data instance
- Implemented by pytorch
- Only shift poorly-placed macros

$$\min \quad \mathcal{L} = \sum_{e \in Net_{macro}} WL_e + \alpha \sum_{m_i \in \mathcal{M}_{poor}} P_i, \quad (1)$$

where  $WL_e$  is wirelength cost for net  $e$ ,  $P_i$  is periphery cost for macro  $m_i$ .

# Gradient-based Macro Shift: Wirelength Cost

- Half-perimeter wirelength (HPWL) for net  $e$ :

$$HPWL_e = \max_{v_i \in e} \{x_i\} - \min_{v_i \in e} \{x_i\} + \max_{v_i \in e} \{y_i\} - \min_{v_i \in e} \{y_i\}. \quad (2)$$

- HPWL is not differentiable.
- Weighted-average wirelength (WA)<sup>6</sup> for a net  $e$ :

$$WA_e = \frac{\sum_{v_i \in e} x_i e^{\frac{x_i}{\gamma}}}{\sum_{v_i \in e} e^{\frac{x_i}{\gamma}}} - \frac{\sum_{v_i \in e} x_i e^{-\frac{x_i}{\gamma}}}{\sum_{v_i \in e} e^{-\frac{x_i}{\gamma}}} + \frac{\sum_{v_i \in e} y_i e^{\frac{y_i}{\gamma}}}{\sum_{v_i \in e} e^{\frac{y_i}{\gamma}}} - \frac{\sum_{v_i \in e} y_i e^{-\frac{y_i}{\gamma}}}{\sum_{v_i \in e} e^{-\frac{y_i}{\gamma}}}. \quad (3)$$

- Use WA as wirelength cost.

---

<sup>6</sup>Meng-Kai Hsu, Valeriy Balabanov, and Yao-Wen Chang (2013). "TSV-aware analytical placement for 3-D IC designs based on a novel weighted-average wirelength model". In: *IEEE TCAD* 32.4, pp. 497–509.

# Gradient-based Macro Shift: Periphery cost



(a) Visualization of the peripheral cost



(b) Vertical peripheral cost

- Motivation: push macros to chip boundary.

$$\begin{aligned}
 P_i^H &= \left| \frac{w_{core}}{2} - x_i \right| + \frac{\left( \frac{w_{core}}{2} \right)^2}{\left| \frac{w_{core}}{2} - x_i \right|}, \\
 P_i^V &= \left| \frac{h_{core}}{2} - y_i \right| + \frac{\left( \frac{h_{core}}{2} \right)^2}{\left| \frac{h_{core}}{2} - y_i \right|}, \\
 P_i &= P_i^H + P_i^V.
 \end{aligned} \tag{4}$$

# Constraint-graph based LP for Macro Legalization<sup>7</sup>



- Purpose:
  - Eliminate overlaps generated by last step (Macro shift)
  - Further push macros to nearest chip periphery
- Constraint graphs:
  - Relative positional relationship among macros, horizontally and vertically.
  - Constructed on the macro placement prototype by analytical placers.
- Objective for LP (minimization):
  - Macro displacement
  - Distance to the nearest periphery

<sup>7</sup>Jason Cong and Min Xie (2006). "A robust detailed placement for mixed-size IC designs". In: Proc. ASPDAC, 7-pp.

# Macro Legalization: Linear Programming Formulation

$$\begin{aligned} \min \quad & \sum_{i=1}^{|M_{poor}|} |x'_i - x_i| + |y'_i - y_i| + \min(x_i, W - x_i) + \min(y_i, H - y_i) \\ \text{s.t.} \quad & x'_j - x'_i \geq \frac{w_i + w_j}{2}, \quad \forall e_{ij} \in G_h \\ & y'_j - y'_i \geq \frac{h_i + h_j}{2}, \quad \forall e_{ij} \in G_v \\ & \frac{w_i}{2} \leq x'_i \leq W - \frac{w_i}{2}, \frac{h_i}{2} \leq y'_i \leq H - \frac{h_i}{2}. \quad \forall m_i \in \mathcal{M}_{poor} \end{aligned} \tag{5}$$

- $x_i/x'_i$ : x-coordinate before/after macro legalization.
- Original objective for macro displacement:  $|x'_i - x_i| + |y'_i - y_i|$ .
- $\min(x_i, W - x_i) + \min(y_i, H - y_i)$ : distance to the nearest boundary.
- First and second inequality: non-overlap constraint.
- Last inequality: out of boundary constraint.

# Macro Legalization: Linear Programming Formulation (Con't)

$$\begin{aligned}
 \min \quad & \sum_{i=1}^{|M_{poor}|} x_i^p + x_i^q + y_i^p + y_i^q + \min(x_i, W - x_i) + \min(y_i, H - y_i) \\
 \text{s.t.} \quad & x_i^p - x_i^q = x'_i - x_i, \quad y_i^p - y_i^q = y'_i - y_i, \quad \forall m_i \in \mathcal{M}_{poor} \\
 & x_i^p \geq 0, \quad x_i^q \geq 0, \quad y_i^p \geq 0, \quad y_i^q \geq 0, \quad \forall m_i \in \mathcal{M}_{poor} \\
 & x'_j - x'_i \geq \frac{w_i + w_j}{2}, \quad \forall e_{ij} \in G_h \\
 & y'_j - y'_i \geq \frac{h_i + h_j}{2}, \quad \forall e_{ij} \in G_v \\
 & \frac{w_i}{2} \leq x'_i \leq W - \frac{w_i}{2}, \quad \frac{h_i}{2} \leq y'_i \leq H - \frac{h_i}{2}. \quad \forall m_i \in \mathcal{M}_{poor}
 \end{aligned} \tag{6}$$

- First and second constraints: transformation on displacement objective (remove absolute value).

# Experimental Results

# Experiment Setting

- Benchmark: OpenSource RISC-V SOCs from chipyard<sup>8</sup>.
- PDK: ASAP7<sup>9</sup>.
- VLSI flow: standard cell placement → CTS → routing (commercial tool).
- Baseline: DREAMPlace, AutoDMP.

**Table:** Design information

| Benchmark          | # Macros | # Std cells | # Nets  | freq. (MHz) |
|--------------------|----------|-------------|---------|-------------|
| Rocket             | 121      | 203633      | 208595  | 500.0       |
| GemminiRocket      | 737      | 1176141     | 1189717 | 333.3       |
| Sha3Rocket         | 121      | 235944      | 240907  | 666.7       |
| LargeBoomAndRocket | 636      | 856576      | 874017  | 333.3       |
| FFTRocket          | 121      | 211543      | 216507  | 1000.0      |
| SmallBoom          | 314      | 362479      | 372623  | 500.0       |
| HwachaRocket       | 292      | 679667      | 687204  | 250.0       |

<sup>8</sup>Alon Amid et al. (2020). “Chipyard: Integrated design, simulation, and implementation framework for custom SoCs”. In: *IEEE Micro* 40.4, pp. 10–21.

<sup>9</sup>Lawrence T Clark et al. (2016). “ASAP7: A 7-nm FinFET predictive process design kit”. In: *Microelectronics Journal* 53, pp. 105–115.

# Experiment: Post-Route (DREAMPlace)

**Table:** Post-Route PPA results for the incorporation of IncreMacro to DREAMPlace

| Benchmark          | DREAMPlace (two stage placement) |             |               |               |                | DREAMPlace + IncreMacro |              |                 |               |                |
|--------------------|----------------------------------|-------------|---------------|---------------|----------------|-------------------------|--------------|-----------------|---------------|----------------|
|                    | WL<br>(um)                       | WNS<br>(ns) | TNS<br>(ns)   | Power<br>(mW) | Runtime<br>(s) | WL<br>(um)              | WNS<br>(ns)  | TNS<br>(ns)     | Power<br>(mW) | Runtime<br>(s) |
| Rocket             | 14838552                         | -1.40       | -67.16        | 72.54         | 64.55          | <b>11794141</b>         | -0.05        | -0.88           | 68.90         | 101.77         |
| GemminiRocket      | <b>62420588</b>                  | -3.06       | -15146.20     | 520.73        | 207.83         | 69708870                | <b>-2.45</b> | <b>-4927.20</b> | <b>497.50</b> | 418.82         |
| Sha3Rocket         | 11584499                         | -0.22       | <b>-40.84</b> | 95.49         | 77.15          | <b>11570282</b>         | <b>-0.21</b> | -48.60          | <b>95.08</b>  | 99.61          |
| LargeBoomAndRocket | 36201457                         | 0           | 0             | 132.57        | 298.71         | <b>33286377</b>         | 0.09         | 0               | <b>131.46</b> | 533.97         |
| FFTRocket          | 10740986                         | -0.01       | -0.02         | 70.80         | 63.19          | <b>9716426</b>          | <b>0.02</b>  | 0               | <b>67.06</b>  | 72.89          |
| SmallBoom          | 13967377                         | -0.13       | -22.01        | 98.10         | 112.40         | <b>13547457</b>         | <b>0.01</b>  | 0               | <b>96.55</b>  | 153.29         |
| HwachaRocket       | 48096700                         | -0.10       | -2.82         | 145.55        | 146.64         | <b>40552488</b>         | <b>0.03</b>  | 0               | <b>137.48</b> | 253.25         |
| Normalize          | 1.065                            | 1.599       | 1.639         | 1.033         | <b>1.000</b>   | <b>1.000</b>            | <b>1.000</b> | <b>1.000</b>    | <b>1.000</b>  | 1.600          |

## Performance Improvement:

- wirelength: +6.5%
- power: +3.3%
- WNS: +59.9%
- TNS: +63.9%

# Experiment: Post-Route (AutoDMP)

**Table:** Post-Route PPA results for the incorporation of IncreMacro to AutoDMP

| Benchmark     | AutoDMP (two-stage placement) |             |             |               |                | AutoDMP + IncreMacro |              |              |               |                |
|---------------|-------------------------------|-------------|-------------|---------------|----------------|----------------------|--------------|--------------|---------------|----------------|
|               | WL<br>(um)                    | WNS<br>(ns) | TNS<br>(ns) | Power<br>(mW) | Runtime<br>(s) | WL<br>(um)           | WNS<br>(ns)  | TNS<br>(ns)  | Power<br>(mW) | Runtime<br>(s) |
| Rocket        | 14704880                      | -0.41       | -354.87     | 74.33         | 69.24          | <b>11186898</b>      | <b>0.01</b>  | <b>0</b>     | <b>68.43</b>  | 82.99          |
| GemminiRocket | 41031737                      | -0.24       | -17.76      | 301.39        | 205.95         | <b>34580828</b>      | <b>-0.01</b> | <b>-0.01</b> | <b>287.51</b> | 381.58         |
| FFTRocket     | <b>8108137</b>                | -0.001      | -0.001      | 55.43         | 55.14          | 8176606              | <b>0.05</b>  | <b>0</b>     | <b>55.31</b>  | 101.73         |
| HwachaRocket  | 31675727                      | -0.05       | -0.14       | 130.37        | 193.24         | <b>22684613</b>      | <b>0.35</b>  | <b>0</b>     | <b>121.66</b> | 321.65         |
| Normalize     | 1.168                         | 1.996       | 1.999       | 1.049         | <b>1.000</b>   | <b>1.000</b>         | <b>1.000</b> | <b>1.000</b> | <b>1.000</b>  | 1.640          |

## Performance Improvement:

- wirelength: +16.8%
- power: +4.9%
- WNS: +99.6%
- TNS: +99.9%

# Result Visualization



(f) GemminiRocket:  
DREAMPlace



(g) GemminiRocket:  
DREAMPlace +  
IncreMacro



(h) GemminiRocket:  
AutoDMP



(i) GemminiRocket:  
AutoDMP + IncreMacro

# Runtime Analysis



Average runtime breakdown of IncreMacro + DREAMPlace.

# Conclusion

# Conclusion

- IncreMacro: incrementally enhances macro placement by SOTA analytical placers.
- Three main techniques:
  - KD-tree-based macro diagnosis
  - gradient-based macro shift
  - constraint-graph-based linear programming for macro legalization
- Two requirements for macro placement:
  - push macros to the chip boundary
  - Macro relative position preserved

**THANK YOU!**