

# **DREAMPlaceFPGA**: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit

Rachel Selina Rajarathnam<sup>1</sup>, Mohamed Baker Alawieh<sup>1</sup>, Zixuan Jiang<sup>1</sup>,  
Mahesh A. Iyer<sup>2</sup>, David Z. Pan<sup>1</sup>

<sup>1</sup>*ECE Department, The University of Texas at Austin, TX, USA*

<sup>2</sup>*Intel Corporation, CA, USA*

This work was supported in part by Intel Strategic Research Alliance, Intel & VMware, and NSF

# HETEROGENEOUS PLATFORMS IN DATA CENTERS

- ⊕ XPU: Application/workload-specific Processing Unit



# FPGA IN DATA CENTERS

- ⊕ Better processing performance per Watt for data centric acceleration tasks than CPU or GPU for increasing thread count



# FPGA DESIGN IMPLEMENTATION



# FPGA DESIGN IMPLEMENTATION



# FPGA PLACEMENT



- ⊕ Determine locations of instances on a fixed-floorplan chip with limited resources
- ⊕ Concurrent optimization of wirelength, routing congestion, etc.
- ⊕ Placement stages: Global Placement, Legalization, and Detailed Placement

# GLOBAL PLACEMENT

---

- Obtain rough legal locations for instances
- Optimization problem formulated as:
  - Simulated Annealing
  - Quadratic Placement
  - Non-linear Placement



# LEGALIZATION

- Assign instances to sites
- Meet resource-type and routability constraints



# DETAILED PLACEMENT

---

- Further improve metrics of legalized placement
- Formulated as:
  - Independent Set Matching
  - Slot Assignment



# FPGA PLACEMENT

- ⊕ Global Placement (GP):
  - Obtain rough legal locations  
**Focus of this work**
- ⊕ Legalization (LG):
  - Assign instances to sites
- ⊕ Detailed Placement (DP):
  - Further improve metrics of legalized placement



# EXISTING FPGA PLACEMENT ACCELERATION SCHEMES

---

- ⊕ Multi-threaded CPU [W. Li+, TCAD'2018; W. Li+, ICCAD'2019]
  - Runtime benefit plateaus after certain thread count
- ⊕ FPGA [S. Dhar+, FPL'2019; S. Dhar+, HPC'2019]
  - Hybrid CPU + FPGA platform
    - Limited by total memory on FPGA
- ⊕ GPU [C. Fobel+, NEWCAS'2012; Y. Meng+, TCAD'2021]
  - Expertise on underlying hardware for efficient kernel development

## EXISTING FPGA PLACEMENT ACCELERATION SCHEMES

- ⊕ Multi-threaded CPU [W. Li+, TCAD’2018; W. Li+, ICCAD’2019]
    - Runtime benefit plateaus after certain thread count
  - ⊕ FP
    - Non-trivial effort to adapt acceleration schemes to changes in placement formulations!**
  - ⊕ GF
    - Can we do better?**

# DREAMPLACEFPGA

---

- ⊕ Adapt the state-of-the-art `elfPlace` placer on a DL toolkit
  - Inspired by `DREAMPlace` framework for ASIC
- ⊕ Re-use advanced libraries available in deep learning toolkit
- ⊕ Consists of low-level optimized operators and high-level programming interface => *low development cost*
- ⊕ Contribute to FPGA open-source ecosystem

# ELFPLACE OVERVIEW

- ⊕ A flat, nonlinear placement algorithm for large-scale heterogeneous FPGAs

$$\min_{x,y} W(x,y) \text{ s.t. } D(x,y) \leq D_0 \quad \Rightarrow \quad \min_{x,y} f(x,y) = \min \left( \sum_{e \in E} W_e(x,y) + \lambda D(x,y) \right)$$

- ⊕ Casts the placement density cost  $D(x,y)$  to the potential energy  $\Phi(x,y)$  of an electrostatic system.



# ELFPLACE OVERVIEW

---

- ⊕ Each resource type solved as a separate electrostatic system  
{LUT, FF, DSP, RAM}
- ⊕ Formulate using Augmented Lagrangian Method

$$\min_{x,y} f(x, y) = \tilde{W}(x, y) + \sum_{s \in S} \lambda_s \left( \Phi_s(x, y) + \frac{c_s}{2} \Phi_s(x, y)^2 \right)$$

- ⊕ Spectral methods (Discrete Cosine Transform) used to numerically solve the Poisson's equations for the electrostatic system

# ELFPLACE OVERVIEW

- ⊕ Best Routed WL: UTPlaceF (-13.6%), RippleFPGA (-11.3%), GPlace3.0 (-8.9%), & UTPlaceF-DL (-7.1%)



- ⊕ Placement Runtime: UTPlaceF (3.97x), RippleFPGA (1.04x), GPlace3.0 (3.42x), & UTPlaceF-DL (3.63x)

# DREAMPLACE OVERVIEW

- ⊕ Novel **Analogy** by casting the nonlinear placement optimization into a neural network training problem
- ⊕ Greatly leverage deep learning hardware (GPU) and software (e.g., PyTorch)



# DREAMPLACE OVERVIEW

Leverage highly optimized deep learning toolkit PyTorch



# DREAMPlace OVERVIEW

## RePIAce [Cheng+, TCAD'18]

- CPU: 24-core 3.0 GHz Intel Xeon
- 64GB memory allocated

Same quality of results!

10M-cell design  
finishes within 5min c.f. 3h

34x  
speedup

## DREAMPlace

- CPU: Intel v4 @2.20GHz
- GPU: 1 NVIDIA Tesla V100
- Single CPU thread used



# WHY INTEGRATE ELFPLACE & DREAMPLACE ?

---

- ⊕ Quality of Results: **elfPlace** is a state-of-the-art academic placer
- ⊕ Acceleration using Deep Learning (DL) Hardware & Software
- ⊕ Low development overhead
  - High-level Python Programming + low-level operators in C++/CUDA
- ⊕ Extensible
  - Easy to incorporate new algorithms and acceleration techniques
  - Build on existing framework – similar to **DREAMPlace** (ASIC)

# DREAMPLACEFPGA

---

Adopt ASIC framework for FPGA  $\Rightarrow$  Address various challenges

- ⊕ Heterogenous resource types with discrete site locations
  - Implement different resources as separate electrostatic systems
- ⊕ Legality constraints for FFs & LUTs
- ⊕ High fanout nets: Clock nets have  $> 100k$  pin connections
- ⊕ GPU-friendly data structures
  - Connectivity representation & maps  $\rightarrow$  Multiple flat arrays

# DREAMPLACE FPGA FLOW

- ⊕ Overflow conditions based on instance size:
  - LUT/FF instances: 10%
  - Large DSP/RAM instances: 20%
- ⊕ Operators in **red** are GPU accelerated:
  - Wirelength
  - Density
  - Instance Area Update



# DREAMPLACE FPGA OVERVIEW

Leverage highly optimized deep learning toolkit PyTorch



# DREAMPLACEFPGA: PLACEMENT API



# DREAMPLACEFPGA: DENSITY OPERATOR



- ⊕ Each resource type treated as a separate electrostatic system
- ⊕ Demand map computation done in parallel for instances of a resource type

# DREAMPLACE FPGA: CLUSTERING AWARE AREA UPDATE

Adjust LUT/FF instance areas to ensure legality constraints are met

- ⊕ Configurable Logic Block (CLB) consists of 8 Basic Logic Elements (BLEs)
  - BLE consists of 2 LUTs & 2 FFs
- ⊕ For LUTs: maximum I/P count
  - LUT6 occupies the entire BLE
- ⊕ For FFs: Control set
  - Shared clock (CK), set/reset (SR) and control enable (CE)



# DREAMPLACEFPGA: DSP/RAM LEGALIZER

---

- ⊕ Large instances (DSPs & block RAMs) are legalized at the end of GP
- ⊕ Sparse resources in the considered Xilinx Ultrascale Architecture
  - 1728 block RAMs
  - 768 DSPs
- ⊕ Min-cost flow (CPU) used to legalize these sparse resource types

# EXPERIMENTAL SETUP:

- Design suite: ISPD'2016 benchmarks
- Placers: *elfPlace-CPU*, *elfPlace-GPU*
- Router: *Xilinx Vivado v2015.4*

CPU (8T): Intel Core i9-7900X @ 3.30 GHz

GPU: NVIDIA TITAN Xp (Pascal)

- Comparison Metrics
  - Routed Wirelength
  - Placement Runtime

| Design    | #LUT | #FF   | #RAM | #DSP |
|-----------|------|-------|------|------|
| FPGA01    | 50k  | 55k   | 0    | 0    |
| FPGA02    | 100k | 66k   | 100  | 100  |
| FPGA03    | 250k | 170k  | 600  | 500  |
| FPGA04    | 250k | 172k  | 600  | 500  |
| FPGA05    | 250k | 174k  | 600  | 500  |
| FPGA06    | 350k | 352k  | 1000 | 600  |
| FPGA07    | 350k | 355k  | 1000 | 600  |
| FPGA08    | 500k | 216k  | 600  | 500  |
| FPGA09    | 500k | 366k  | 1000 | 600  |
| FPGA10    | 350k | 600k  | 1000 | 600  |
| FPGA11    | 480k | 363k  | 1000 | 400  |
| FPGA12    | 500k | 602k  | 600  | 500  |
| Resources | 538k | 1075k | 1728 | 768  |

\*LG & DP run on *elfPlace-CPU*

# RESULTS: ROUTED WIRELENGTH



# RESULTS: ROUTED WIRELENGTH



# RESULTS: PLACEMENT RUNTIME



# RESULTS: PLACEMENT RUNTIME



# DREAMPLACE FPGA RUNTIME BREAKDOWN



# DREAMPLACEFPGA SUMMARY

- ⊕ Open-source accelerated FPGA placement framework
- ⊕ Adapts *elfPlace* to a DL toolkit, inspired by the ASIC *DREAMPlace* framework
- ⊕ Outperforms *elfPlace-CPU* and *elfPlace-GPU* in terms of global placement runtime with similar quality of results

| Metric                    | <i>elfPlace-CPU</i> | <i>elfPlace-GPU</i> | DREAMPlaceFPGA<br>(GPU) |
|---------------------------|---------------------|---------------------|-------------------------|
| Routed Wirelength         | <b>0.997</b>        | <b>0.997</b>        | 1.000                   |
| Global Placement Runtime  | <b>5.35</b>         | <b>1.19</b>         | <b>1.00</b>             |
| Overall Placement Runtime | <b>1.82</b>         | <b>0.99</b>         | 1.00                    |

<https://github.com/rachelselinar/DREAMPlaceFPGA>

# FUTURE WORK

---

- ⊕ Acceleration of other placement stages
  - Legalization
  - Detailed Placement
- ⊕ Support for different FPGA architectures
  - Only Xilinx Ultrascale architecture support available

# THANK YOU