



# RapidWright Documentation

*Release 2023.1.4-beta*

**AMD Research and Advanced Development  
Copyright 2018-2023, Advanced Micro Devices, Inc.**

Oct 27, 2023



# CONTENTS

|          |                                                     |           |
|----------|-----------------------------------------------------|-----------|
| <b>1</b> | <b>Introduction</b>                                 | <b>1</b>  |
| 1.1      | What is RapidWright? . . . . .                      | 1         |
| 1.2      | Why RapidWright? . . . . .                          | 1         |
| 1.3      | What about RapidSmith? . . . . .                    | 2         |
| 1.4      | Vivado and RapidWright . . . . .                    | 2         |
| <b>2</b> | <b>Getting Started</b>                              | <b>5</b>  |
| 2.1      | Quick Start . . . . .                               | 5         |
| 2.2      | Install . . . . .                                   | 6         |
| 2.3      | RapidWright Eclipse Setup . . . . .                 | 10        |
| 2.4      | RapidWright IntelliJ Setup . . . . .                | 22        |
| 2.5      | RapidWright Jupyter Notebook Kernel Setup . . . . . | 26        |
| <b>3</b> | <b>FPGA Architecture Basics</b>                     | <b>33</b> |
| 3.1      | What is an FPGA? . . . . .                          | 33        |
| 3.2      | CPU vs. FPGA . . . . .                              | 33        |
| 3.3      | Lookup Tables (LUTs) . . . . .                      | 34        |
| 3.4      | State Elements . . . . .                            | 37        |
| 3.5      | Carry Chains . . . . .                              | 37        |
| 3.6      | DSP Blocks . . . . .                                | 37        |
| 3.7      | Block RAMs . . . . .                                | 37        |
| <b>4</b> | <b>Xilinx Architecture Terminology</b>              | <b>39</b> |
| 4.1      | BEL (Basic Element of Logic) . . . . .              | 39        |
| 4.2      | Site . . . . .                                      | 41        |
| 4.3      | Tile . . . . .                                      | 43        |
| 4.4      | FSR (Fabric Sub Region or Clock Region) . . . . .   | 44        |
| 4.5      | SLR (Super Logic Region) . . . . .                  | 44        |
| 4.6      | Device . . . . .                                    | 44        |
| <b>5</b> | <b>RapidWright Overview</b>                         | <b>45</b> |
| 5.1      | Device Package . . . . .                            | 45        |
| 5.2      | EDIF Package (Logical Netlist) . . . . .            | 47        |
| 5.3      | Design Package (Physical Netlist) . . . . .         | 49        |
| <b>6</b> | <b>Design Checkpoints</b>                           | <b>53</b> |
| 6.1      | What is a Design Checkpoint? . . . . .              | 53        |
| 6.2      | What is Inside a Design Checkpoint? . . . . .       | 53        |
| 6.3      | RapidWright and Design Checkpoint Files . . . . .   | 53        |
| <b>7</b> | <b>Implementation Basics</b>                        | <b>55</b> |

|           |                                                                             |            |
|-----------|-----------------------------------------------------------------------------|------------|
| 7.1       | Placement . . . . .                                                         | 55         |
| 7.2       | Routing . . . . .                                                           | 56         |
| <b>8</b>  | <b>Merging Designs</b>                                                      | <b>59</b>  |
| 8.1       | Customizing Merge Behavior . . . . .                                        | 59         |
| <b>9</b>  | <b>Bitstream Manipulation</b>                                               | <b>63</b>  |
| 9.1       | Disclaimer . . . . .                                                        | 63         |
| 9.2       | Overview . . . . .                                                          | 63         |
| 9.3       | Bitstream Packet Model . . . . .                                            | 64         |
| 9.4       | Configuration Array Model . . . . .                                         | 68         |
| 9.5       | Example Usages: Modify User State Bits . . . . .                            | 68         |
| 9.6       | Example Usages: Find and Print the Frames of a Placed Cell . . . . .        | 69         |
| <b>10</b> | <b>FPGA Interchange Format</b>                                              | <b>71</b>  |
| 10.1      | What is the FPGA Interchange Format? . . . . .                              | 71         |
| 10.2      | What does the FPGA Interchange Format enable? . . . . .                     | 71         |
| 10.3      | How is RapidWright related to the FPGA Interchange Format? . . . . .        | 71         |
| 10.4      | Additional Resources . . . . .                                              | 71         |
| <b>11</b> | <b>RapidWright Publications</b>                                             | <b>73</b>  |
| 11.1      | Original RapidWright Publication - FCCM 2018 . . . . .                      | 73         |
| 11.2      | Additional RapidWright Publications . . . . .                               | 73         |
| 11.3      | Select Community Publications . . . . .                                     | 74         |
| <b>12</b> | <b>A Pre-implemented Module Flow</b>                                        | <b>75</b>  |
| 12.1      | Background and Flow Comparison . . . . .                                    | 75         |
| 12.2      | High Performance Flow . . . . .                                             | 76         |
| 12.3      | Rapid Prototyping Flow . . . . .                                            | 79         |
| <b>13</b> | <b>RapidWright Tutorials</b>                                                | <b>83</b>  |
| 13.1      | RWRoute Timing-driven Routing . . . . .                                     | 83         |
| 13.2      | RWRoute Wirelength-driven Routing . . . . .                                 | 88         |
| 13.3      | RWRoute Partial Routing . . . . .                                           | 90         |
| 13.4      | RapidWright Report Timing Example . . . . .                                 | 93         |
| 13.5      | Reuse Timing-closed Logic As A Shell . . . . .                              | 98         |
| 13.6      | Use DREAMPlaceFPGA to Place a Netlist via FPGA Interchange Format . . . . . | 108        |
| 13.7      | Polynomial Generator: Placed and Routed Circuits in Seconds . . . . .       | 113        |
| 13.8      | Inserting and Routing a Debug Core As An ECO . . . . .                      | 122        |
| 13.9      | Create Placed and Routed DCP to Cross SLR . . . . .                         | 134        |
| 13.10     | Build an IP Integrator Design with Pre-Implemented Blocks . . . . .         | 138        |
| 13.11     | RapidWright PipelineGenerator Example . . . . .                             | 138        |
| 13.12     | RapidWright PipelineGeneratorWithRouting Example . . . . .                  | 145        |
| 13.13     | Pre-implemented Modules - Part I . . . . .                                  | 149        |
| 13.14     | Pre-implemented Modules - Part II . . . . .                                 | 159        |
| 13.15     | Create and Use an SLR Bridge . . . . .                                      | 166        |
| 13.16     | RapidWright FPGA 2019 Deep Dive Tutorial . . . . .                          | 172        |
| 13.17     | RapidWright FCCM 2019 Workshop . . . . .                                    | 174        |
| 13.18     | RapidWright FPL 2019 Tutorial . . . . .                                     | 174        |
| 13.19     | RapidWright ICCAD 2023 Hands-on Tutorial . . . . .                          | 176        |
| <b>14</b> | <b>Tech Articles</b>                                                        | <b>179</b> |
| 14.1      | Call RapidWright from C/C++ Using GraalVM . . . . .                         | 179        |
| 14.2      | Using RapidWright Directly in Python 3 . . . . .                            | 183        |
| 14.3      | Setup JUnit 5 Tests in RapidWright . . . . .                                | 187        |

|                                                                                                                                                                |            |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| 14.4 RapidWright Data Files . . . . .                                                                                                                          | 191        |
| <b>15 Frequently Asked Questions</b>                                                                                                                           | <b>193</b> |
| 15.1 I can't open my DCP in RapidWright, I get 'ERROR: Couldn't determine a proper EDIF netlist to load with the DCP file . . . ', what should I do? . . . . . | 193        |
| 15.2 Can RapidWright be used for designs targeting the AWS F1 platform? . . . . .                                                                              | 193        |
| 15.3 When should I use RapidWright and when should I use Vivado? . . . . .                                                                                     | 193        |
| 15.4 What languages does RapidWright support, and how do I interact with them? . . . . .                                                                       | 194        |
| 15.5 Why is the framework called RapidWright? . . . . .                                                                                                        | 194        |
| 15.6 Can RapidWright generate bitstreams? . . . . .                                                                                                            | 194        |
| 15.7 Does RapidWright provide device timing information? . . . . .                                                                                             | 194        |
| 15.8 Does RapidWright support partial reconfiguration (PR)? . . . . .                                                                                          | 194        |
| 15.9 Is there any published work on RapidWright? . . . . .                                                                                                     | 194        |
| <b>16 Glossary</b>                                                                                                                                             | <b>197</b> |
| <b>Index</b>                                                                                                                                                   | <b>199</b> |



## INTRODUCTION

### Table of Contents

- *Introduction*
  - *What is RapidWright?*
  - *Why RapidWright?*
  - *What about RapidSmith?*
  - *Vivado and RapidWright*

## 1.1 What is RapidWright?

RapidWright is an open source Java framework that enables netlist and implementation manipulation of modern Xilinx FPGA and SoC designs. It complements [Xilinx's Vivado® Design Suite](#) and provides developers with capabilities such as:

- Fast loading accurate device model views for all Vivado-supported Xilinx devices (Series 7, UltraScale™, and UltraScale+™)
- Reads and writes unencrypted Vivado Design Checkpoint files (.dcp)
- Hundreds of APIs to help build customized solutions to a wide variety of implementation challenges
- Examples of how to pre-implement (pre-place and pre-route) IP, relocate such blocks and compose pre-implemented blocks together

---

**Note:** RapidWright is not an official product from Xilinx and designs created or derived from it are not warranted. Please see [LICENSE.TXT](#) for full details.

---

## 1.2 Why RapidWright?

We believe that when people are empowered to create tailored solutions to their own specific challenges, innovation takes place. We are building RapidWright to be an environment that fosters this caliber of innovation. The commercial FPGA CAD world is in the unfortunate state of being closed source. We hope that with the release and continued development of RapidWright, we can change the status quo of how we develop and interact with FPGAs.

RapidWright's mission is to:

- Facilitate rapid creation of custom design implementation solutions for FPGAs
- Foster an ecosystem of research and development in academia and industry
- Be fast, efficient, light-weight and easy-to-use
- Serve as a platform that can grow into an open source FPGA implementation flow (future work)

## 1.3 What about RapidSmith?

RapidWright is a next generation RapidSmith. Previously, RapidSmith was created to enable FPGA CAD tool creation for older Xilinx devices, specifically those supported under ISE. RapidSmith is dependent on the Xilinx Design Language (XDL) which was discontinued in Vivado. Therefore, RapidSmith doesn't work with newer devices supported exclusively in Vivado (although some valiant efforts have been made to bridge the gap<sup>1,2</sup>).

RapidWright has been significantly overhauled from its parent RapidSmith code. The FPGA device model is cleaner, more data rich, is faster, more memory efficient and adds several insights and capabilities from the Vivado design paradigm. A distinguishing and enabling capability of RapidWright is its ability to read and write unencrypted Vivado Design Checkpoint files. It also maintains full representation of both the logical and physical netlist of FPGA designs.

## 1.4 Vivado and RapidWright

The [Vivado Design Suite](#) is the tool environment for developing and implementing designs for Xilinx FPGAs and SoCs. Vivado provides both a GUI environment and a Tcl scripting interface to control the various tools and steps involved in development. The Tcl scripting interface is quite powerful in that it provides users with hundreds of commands to manipulate their design. However, despite the breadth of functionality that the Tcl interface offers, it does have some shortcomings.

- First, some tasks that a user would want to complete using Tcl constructs and commands takes an inordinate amount of runtime making the task infeasible, especially for large designs. For example, attempting to import routing information via Tcl commands for a full design can take several hours or days.
- Second, constructing large, complex operations out of Tcl commands can be inefficient due to its interpreted nature. Many users would also prefer a more mainstream object oriented language with wider support for developing solutions.
- Lastly, if the user wants a particular capability that is not available in the provided library of Tcl commands in Vivado, there is generally no alternative.

RapidWright addresses these shortcomings by providing a means to import, modify and export Vivado-based designs independent of the Tcl interface. It achieves this capability by providing APIs that can read and write design checkpoint files (Vivado's design file format) into and out of the RapidWright framework as illustrated below.

---

<sup>1</sup> White, Brad S., "Tincr: Integrating Custom CAD Tool Frameworks with the Xilinx Vivado Design Suite" (2014). All Theses and Dissertations. 4338. <http://scholarsarchive.byu.edu/etd/4338>

<sup>2</sup> Townsend, Thomas James, "Vivado Design Interface: Enabling CAD-Tool Design for Next Generation Xilinx FPGA Devices" (2017). All Theses and Dissertations. 6492. <http://scholarsarchive.byu.edu/etd/6492>



RapidWright includes a compact, fast-loading device model and hundreds of APIs to help manipulate implementations. These capabilities will enable users to develop new implementation strategies and capabilities that have not been available previously in Vivado. We believe RapidWright provides a foundational framework that opens the door for innovation in the FPGA CAD space.



## GETTING STARTED

### How would you like to use RapidWright?

- *Quick Start* – “Just want to try it out”
- *Install* – “Ready to write code”

### Setting up Development Environments

- *RapidWright Eclipse Setup*
- *RapidWright IntelliJ Setup*
- *RapidWright Jupyter Notebook Kernel Setup*

## 2.1 Quick Start

---

**Note:** The only major prerequisite is Java (1.8 minimum, 11 or later recommended) - Any distribution such as [Adoptium](#) should work

---

Download and run the latest stand-alone RapidWright release jar file: [Linux](#) | [Windows](#):

#### Linux:

```
 wget https://github.com/Xilinx/RapidWright/releases/download/v2023.1.0-beta/
 ↵rapidwright-2023.1.0-standalone-lin64.jar
 java -jar rapidwright-2023.1.0-standalone-lin64.jar
```

#### Windows:

```
 curl -L -O https://github.com/Xilinx/RapidWright/releases/download/v2023.1.0-beta/
 ↵rapidwright-2023.1.0-standalone-win64.jar
 java -jar rapidwright-2023.1.0-standalone-win64.jar
```

This will start the RapidWright [Jython](#) (Python 2 in Java) interpreter with most RapidWright classes loaded. You can test your install by running the following at the prompt:

```
>>> DeviceBrowser.main([])
```

You should see the GUI come up similar to this screenshot:



If you have gotten to this point, congrats! Your RapidWright install is correctly configured and you are ready to start experimenting.

Note that the standalone jar comes pre-packaged with a few select devices:

- AWS-F1: Virtex UltraScale+ VU9P (xcvu9p)
- PYNQ-Z1: Zynq 7020 (xc7z020)
- Virtex UltraScale VU440 (xcvu440)

Additional devices are downloaded over the Internet on demand when the code attempts to load them. See [RapidWright Data Files](#) for more details.

## 2.2 Install

### 2.2.1 TL;DR

```
git clone https://github.com/Xilinx/RapidWright.git
cd RapidWright
./gradlew compileJava
export PATH=`pwd`/bin:$PATH
```

## 2.2.2 What You Need to Get Started

1. Java (1.8 minimum, 11 or later recommended) - Any distribution such as [Adoptium](#) should work. If you already have Vivado, it includes Java, see [Using Java distributed with Vivado](#) below on how to use it.
2. [Git](#) (source revision control system)
3. If you are running Linux and want to run the GUI portion of RapidWright, you may need an older libpng12 library. For those running Debian/Ubuntu-based distros, try the following:

```
wget -O /tmp/libpng12.deb https://snapshot.debian.org/archive/debian/20160413T160058Z/
  ↵pool/main/libp/libpng/libpng12-0_1.2.54-6_amd64.deb && sudo dpkg -i /tmp/libpng12.
  ↵deb && rm /tmp/libpng12.deb
```

For CentOS/RedHat/Fedora distros, try the following:

```
sudo yum install libpng12
```

## 2.2.3 Additional Recommendations

1. [Vivado Design Suite 2018.3 or later](#) (Not essential to run RapidWright, but makes it useful)
2. An IDE such as [IntelliJ](#) or [Eclipse](#)

RapidWright includes the [Gradle Wrapper](#) (automatic build tool), so a Gradle installation is not necessary.

## 2.2.4 Install Steps

The easiest way to get RapidWright setup is to simply run these commands:

**Linux ( /bin/sh or compatible):**

```
git clone https://github.com/Xilinx/RapidWright.git
cd RapidWright
./gradlew compileJava
export PATH=`pwd`/bin:$PATH
```

---

**Note:** C-style shells (csh or tcsh) should replace the last line with setenv PATH `pwd`/bin:\$PATH

---

**Windows ( cmd.exe ):**

```
git clone https://github.com/Xilinx/RapidWright.git
cd RapidWright
.\gradlew compileJava
set "PATH=%CD%\bin;%PATH%"
```

---

**Note:** For Windows Powershell users, replace the last line with \$env:PATH="\$pwd\bin;\$env:PATH"

---

This will clone a copy of RapidWright from GitHub, download jar dependencies, compile the Java code and add the rapidwright wrapper to your PATH. Checking out and compiling the code can also be accomplished by using an IDE (see [RapidWright Eclipse Setup](#) or [RapidWright IntelliJ Setup](#)).

To perform a quick test to ensure RapidWright is setup correctly, try running the following:

```
rapidwright DeviceBrowser
```

**Note:** If you prefer to run with `java` directly (you'll need to set the `CLASSPATH` appropriately, see [CLASSPATH](#) below for details), the same tool can be invoked with: `java com.xilinx.rapidwright.device.browser.DeviceBrowser`

---

You should see the GUI come up similar to this screenshot:



If you have gotten to this point, congrats! Your RapidWright install is correctly configured and you are ready to start experimenting.

## 2.2.5 RapidWright Wrapper

Some may be new to Java so RapidWright has included a `rapidwright` wrapper script (`rapidwright.bat` for Windows users) that manages setting the Java class path and provides a handy interface to the various use modes. The directions above add the wrapper to the PATH.

The `rapidwright` wrapper has the following options (printed when run without parameters):

```
rapidwright com.xilinx.rapidwright.<ClassName> -- to execute main() method of Java class
rapidwright <application> -- to execute a specific application
rapidwright --list-apps -- to list all available applications
rapidwright jython -- to enter interactive Jython shell
rapidwright jython -c "..." -- to execute specific Jython command
```

To pass options to `java`, it is recommended to use the `_JAVA_OPTIONS` environment variable, for example:

```
_JAVA_OPTIONS=-Xmx32736m rapidwright RWRote
```

This will set the Java Virtual Machine (JVM) upper heap memory limit to ~32GBs. This limit is useful as it is the largest heap size available by default without causing all references to expand from 4 bytes to 8 bytes.

## 2.2.6 Vivado Compatibility & Versioning

RapidWright aims to be as compatible as possible with Vivado in terms of the device models it offers and its ability to load design checkpoints (DCPs) as far back as 2018.2.

RapidWright versioning intends to indicate to the user what the latest version of Vivado for which it will be compatible. For example, RapidWright 2023.1.0 will be compatible with Vivado 2023.1 and previous versions back to 2018.2. Conversely, a DCP created in Vivado 2023.1 will likely not be readable in previous versions of RapidWright (2022.2.0, 2022.1.0, etc). This also is true for device models. If a device is released in Vivado 2023.1, it won't be available in previous versions of RapidWright.

## 2.2.7 Notes for Advanced/Legacy Users:

### Using Java distributed with Vivado

The easiest way to find out where the Java runtime is packaged with your installation of Vivado, is to run the following at the Vivado Tcl prompt:

```
which java
```

Based on where your installed Vivado is located, it should produce a full path, something like this:

```
/opt/Vivado/2022.2/tps/lnx64/jre11.0.11_9/bin/java
```

To use this version of Java instead of the system Java or installing it, simply update your `PATH` and `JAVA_HOME` environment variables:

```
export PATH=/opt/Vivado/2022.2/tps/lnx64/jre11.0.11_9/bin:$PATH
export JAVA_HOME=/opt/Vivado/2022.2/tps/lnx64/jre11.0.11_9
```

Or, if using Windows, search for “edit environment variables” and add a new entry for `PATH` and `JAVA_HOME` appropriately.

### CLASSPATH

Java has the notion of a `CLASSPATH`, this is a list of locations where `java` can look for compiled Java code (`.class` files or `.jar` files) to execute at runtime. The `CLASSPATH` can be set on the command line (`java -cp <CLASSPATH_HERE>`) or it can be set via the environment variable `CLASSPATH`. If a script to set the `CLASSPATH` variable (in Linux) is desired, the following command can be run:

```
echo "export CLASSPATH=`pwd`/bin:`pwd`/jars/*" > bin/rapidwright_classpath.sh
```

This sets up the environment so the `-cp bin:jars/*` classpath option doesn't need to be set as an argument when invoking `java`, for example:

```
source bin/rapidwright_classpath.sh  
java com.xilinx.rapidwright.device.browser.DeviceBrowser
```

Should start the DeviceBrowser just as before.

## RAPIDWRIGHT\_PATH

The environment variable RAPIDWRIGHT\_PATH is no longer required. RapidWright data files have a default location (see [RapidWright Data Files](#)). To override the default location, the environment variable RAPIDWRIGHT\_PATH can be set and the data files will be placed in \$RAPIDWRIGHT\_PATH/data.

### 2.2.8 RapidWright Installer (Obsolete)

The RapidWright installer is no longer the preferred method of installation. Please use the steps above, it is included below for legacy purposes.

1. Download rapidwright-installer.jar (or run command below in Linux) to the directory where you would like RapidWright to reside.

```
wget http://www.rapidwright.io/docs/_downloads/rapidwright-installer.jar
```

2. From a terminal in that directory, run (To open a terminal on Windows, search and run ‘cmd.exe’ from the Start orb):

```
java -jar rapidwright-installer.jar
```

3. Use one of the BASH/CSH/BAT scripts created at the end of the install to set the proper environment variables for subsequent invocations of RapidWright.
4. Setup your IDE (if applicable):
  - [RapidWright Eclipse Setup](#)
  - [RapidWright IntelliJ Setup](#)

Once complete, you can run the DeviceBrowser within your respective IDE to test the installation.

## 2.3 RapidWright Eclipse Setup

### 2.3.1 Eclipse Step-by-Step Instructions

1. Make sure you have Java JDK 1.8 (or later) installed: <http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html> Follow the instructions when running the downloaded executable. Add the \$(YOUR\_JDK\_INSTALL\_LOCATION)/jdk1.x.x\_x/bin folder to your PATH environment variable.
2. Download Eclipse: <http://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/oxygen2>
3. Install Eclipse by extracting the archive into a desired folder on your computer
4. Run Eclipse (you may want to add the executable to your path)
5. In Eclipse, choose the File->Import... menu option. This will bring up a dialog, choose the Git/Projects from Git option as shown in the screenshot below (click Next):



6. Choose Clone URI and click Next:



7. Copy and paste <https://github.com/Xilinx/RapidWright.git> into the URI box as shown below. The Host and Repository path fields should automatically be populated. Enter user and password (if applicable).



8. Choose the master branch, click next:



9. Choose the location of where you want Eclipse to put your RapidWright workspace. Preferably, you should choose a workspace directory with any other Eclipse projects such as /home/user/workspace/RapidWright. Click next to have Eclipse clone the repo into your workspace.



### 2.3.2 Setup Eclipse with Existing Repo

If you already have the RapidWright repository checked out, you can import it into an Eclipse workspace by following these steps (you can skip to Step 5 if you already have Eclipse installed and open)

1. Make sure you have Java JDK 1.8 (or later) installed: <http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html> Follow the instructions when running the downloaded executable. Add the \${YOUR\_JDK\_INSTALL\_LOCATION}/jdk1.x.x\_x/bin folder to your PATH environment variable.
2. Download Eclipse: <http://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/oxygen2>
3. Install Eclipse by extracting the archive into a desired folder on your computer
4. Run Eclipse (you may want to add the executable to your path)
5. In Eclipse, choose the File->Import... menu option. This will bring up a dialog, choose the Git/Projects from Git option as shown in the screenshot below (click Next):



6. Choose 'Existing local repository', then click Next



7. Select the existing repository by clicking the 'Add...' button



8. Enter the location of the repository in the 'Directory:' text box, check the box next to the name of the repo once it appears in the lower window. Click 'Finish' and then 'Next' on the previous window.



9. On the Wizard selection window, choose 'Import existing projects'. Then, click 'Next'.



10. Finally, click 'Finish' to finalize the import.



11. Eclipse will then import the project, compile all the source and it should look similar to the screenshot below:



## 2.4 RapidWright IntelliJ Setup

### 2.4.1 Step-by-Step Instructions

1. Make sure you have Java JDK 1.8 (or later) installed: <http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html> Follow the instructions when running the downloaded executable. Add the `$ (YOUR_JDK_INSTALL_LOCATION) / jdk1.x.x_x/bin` folder to your PATH environment variable.
2. Download IntelliJ: <https://www.jetbrains.com/idea/download/>
3. Install IntelliJ by running the setup executable.
4. Start IntelliJ, and navigate through setting selection (if necessary) to the welcome screen.
5. Choose Open from the Welcome screen and navigate to the RapidWright directory where RapidWright has been installed, then click OK.



6. The RapidWright project should open and IntelliJ may indicate that it is indexing the project. Click on the **:Project** button at the top left sidebar, this will expand the project tree:



7. Expand the source tree to navigate to the DeviceBrowser class, RapidWright/src/com.xilinx.

`rapidwright/device/browser/DeviceBrowser` as shown in the screenshot above.

- (If running Linux, skip this step). In Windows, we need to set the GUI library jar to choose the win64 version instead of the lin64 (the default). In order to do this, Choose File->Project Structure..., then select Libraries under Project Settings at the top left. this should produce a list of jar file names in the right window pane. Use the – and + buttons to remove the `qt jambi-lin64*.jar` and replace it with the `qt jambi-win64*.jar` respectively:



- You should now be able to run any of the programs in RapidWright in the IntelliJ environment. For example, right-click on `DeviceBrowser` and choose `Run DeviceBrowser.main()` from the menu. If successful, the `DeviceBrowser` will run similar to the screenshot below:



- The IntelliJ environment should be correctly configured at this point. If you have problems, try setting the `RAPIDWRIGHT_PATH` environment variable to point to your RapidWright install directory prior to running IntelliJ.

## 2.5 RapidWright Jupyter Notebook Kernel Setup

A [Jupyter Notebook](#) is an open source web application that allows you to create and share live documents that can embed and run code. As RapidWright has a built-in Python interpreter ([Jython](#) – a Python interpreter implemented in Java), RapidWright can harness the Jupyter Notebook paradigm for tutorial, demonstration and design analysis. This page describes how to setup a Jython kernel for use on a local machine to enable RapidWright-capable notebooks.

### 2.5.1 Pre-requisites

- RapidWright 2023.1 or above
- [Python](#) and [Jupyter Notebook](#), see [installation details here](#).
- A web browser

### 2.5.2 Step-by-Step Instructions

- Make sure Python and Jupyter Notebook is installed following the [directions provided by project Jupyter](#).
- If running RapidWright from the standalone jar, run:

```
java -jar rapidwright-2018.3.0-standalone-lin64.jar --create_jupyter_kernel
```

for all other installs run:

```
rapidwright StandaloneEntrypoint --create_jupyter_kernel
```

3. Install the Jython 2.7 kernel by running the following at the command line:

```
jupyter kernelspec install <path_to_kernel_file_dir>
```

Two other useful commands if you make a mistake and need to undo is:

To list all the installed kernels, run:

```
jupyter kernelspec list
```

To remove an installed kernel by name (obtained from list command), run:

```
jupyter kernelspec uninstall <kernel_name>
```

4. Run the jupyter notebook server by running:

```
jupyter notebook
```

The console output should look similar to the image below.

```
C:\home\test_jupyter>jupyter notebook
[i 12:26:45.388 NotebookApp] Serving notebooks from local directory: C:\home\test_jupyter
[i 12:26:45.388 NotebookApp] The Jupyter Notebook is running at:
[i 12:26:45.388 NotebookApp] http://localhost:8888/?token=6cbbbe09dd77c5afc0514b47c64e6b7d1bf74f38bbc9c6e29
[i 12:26:45.388 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:26:45.397 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=6cbbbe09dd77c5afc0514b47c64e6b7d1bf74f38bbc9c6e29
```

If the Jupyter Notebook directory page does not open automatically (example in screen capture below), copy and paste the provided URL into your browser (example URL highlighted in the image above).



5. Create a new notebook by clicking on the new button and selecting ‘Jython 2.7’ from the drop down menu as shown in the screenshot below:



6. In the new notebook, you’ll see a long rectangular box where you can enter code. This is called a cell.



- To get started, try entering some import commands as shown in the screenshot below. You can then click on the "Run" button to run the code from the cell in the kernel.



```
from com.xilinx.rapidwright.device import Device
from com.xilinx.rapidwright.design import Design
```

- The results from executed commands persist from one cell to the next as long as the Jython kernel stays alive and will maintain all state variables. The "Run" button also creates a new cell below the last one executed. Now that we have imported some RapidWright libraries, we can use tab completion to see inside objects their visible methods and members.



10. For a quick example, we can create an empty design and write out a design checkpoint. Another way to execute a cell is with the keyboard shortcut **CTRL+ENTER**.



```
design = Design("HelloWorld",Device.AWS_F1)
design.writeCheckpoint(design.getName() + ".dcp")
```

11. By going back to the Jupyter Dashboard (click on the Jupyter logo at the top left of the page), we can find the recently created DCP and select it for download.



Using Jupyter Notebooks is still a new technology, but will allow for easy demonstration of coding examples and techniques to use RapidWright. We hope to leverage its infrastructure significantly in the coming future.



## FPGA ARCHITECTURE BASICS

### Table of Contents

- *FPGA Architecture Basics*
  - *What is an FPGA?*
  - *CPU vs. FPGA*
  - *Lookup Tables (LUTs)*
  - *State Elements*
  - *Carry Chains*
  - *DSP Blocks*
  - *Block RAMs*

This section is meant as a brief introduction to FPGA architecture and technology. Most people familiar with FPGAs can easily skip this section.

### 3.1 What is an FPGA?

An field programmable gate array (FPGA) is a special kind of chip (integrated circuit, silicon device, microchip, computer chip, or whatever designation is most familiar) that can be programmed to behave essentially like any other chip. One might think that a microprocessor or CPU falls into such a description as it is programmable through software compilation. However, an FPGA and CPU differ significantly in architecture and programming model.

### 3.2 CPU vs. FPGA

A central processing unit (CPU or just processor) follows the Von Neumann compute-based architecture as illustrated in the figure below.

A control unit driven by instructions fetched from memory drives the flow of input data through the processor's registers and logic producing outputs. The data paths, instruction set, register counts and memory interface are all fixed at the time of fabrication of the CPU. That is, they are unchanging attributes of the processor and cannot be customized later.

In stark contrast to the CPU architecture, an FPGA has highly configurable logic and data paths. This is enabled by a bit-wise, fine-grained architectural model to realize computation. In order to better understand how FPGAs work, it is beneficial to comprehend their atomic units of computation. Although modern FPGAs have a wide variety of



Fig. 1: Basic Von Neumann Processing Model for CPUs (Source: Labtron, Creative Commons).

components, at their heart is a large array of replicated programmable look-up tables (LUTs), flip-flops (or registers) and programmable wires called interconnect as seen in the figures below.

### 3.3 Lookup Tables (LUTs)

At the heart of configurable logic in FPGAs, lies a basic atomic unit of computation, a lookup table or LUT. A LUT has a single bit output that is calculated based on the input signal values and the configurable table (or memory) entries as shown in the figure below.

Although mainstream FPGAs typically use 6-input LUTs, this example illustrates a 3-input LUT for simplicity but the principle of operation is the same.

LUTs are typically constructed using an N:1 multiplexer (shown in green in Figure 4b) and an Nx1-bit memory (shown in blue). The example in the figure above is a LUT where N=8. The number of inputs of a LUT is calculated as the log base 2 of N.

The memory entries in blue boxes in part (b) of the figure above represent the configurable table entries under the ‘out’ column in part (a). The vector of programming bits {a, b, ... h} ultimately decide how the LUT will behave given different values presented on the inputs {i0, i1, i2}. For example, to program the LUT to evaluate “i0 XOR i1” on the inputs, the programming vector {a=0,b=1,c=1,d=0,e=0,f=1,g=1,h=0} would be used. A LUT can implement any Boolean logic equation limited only by the number of inputs of the LUT’s size. This characteristic is illustrated in the figure below. LUTs are commonly chained or combined in series to implement larger Boolean equations.

In some devices, some of the LUTs have additional functionality then enable them to act as small RAMs. These RAMs can be chained together to build larger RAMs as well.



Fig. 2: Hypothetical FPGA logic array of LUTs, flip flops and programmable wires (interconnect)



Fig. 3: Close up view of replicated tiles of the logic array and interconnect



Fig. 4: (a) Truth table relationship of a LUT (b) Diagram of logical behaviour of a LUT



Fig. 5: Examples of several (but not all) logic functions a LUT can potentially implement

## 3.4 State Elements

Once a value is computed from a LUT, it often is desirable to store it. For this purpose, most FPGAs pair their LUTs with a D-flip-flop or equivalent state element. Often the storage element has configurable reset/clear and clock enable signals with an option of making it behave as a latch. These state elements have dedicated clocking paths to help minimize clock skew.

By chaining together LUTs and storing results in flip flops, FPGAs can implement any number of functions and computation limited only by the number of resources of the device and its delay.

Xilinx offers a variant of LUTs that enable them to also store data in the lookup portion of the table such that they can perform as small memories, shifters or FIFOs. More information on this can be found in [Series 7 CLB User's Guide](#) or [UltraScale CLB User's Guide](#).

## 3.5 Carry Chains

Carry chain blocks are primitive elements that are provided with a group of LUTs to enable more efficient programmable arithmetic. Primarily it provides dedicated paths for the carry logic of simple arithmetic operations (add, subtract, comparisons, equals, etc). Implementing these arithmetic operations in LUTs would result in an inefficient use of resources and performance would suffer.

For more detailed information of Xilinx carry chains, please see [Series 7 CLB User's Guide](#) or [UltraScale CLB User's Guide](#).

## 3.6 DSP Blocks

Multiplication on FPGAs can be quite expensive when implemented in LUTs and is a common operation. Therefore, dedicated hard blocks to provide integer multiplication have been present in FPGAs for several years. As applications have evolved, multiplier blocks have evolved to support a variety of DSP-friendly operations such as MAC (multiply, accumulate), wide AND/XOR and several others.

For more detailed information of Xilinx DSP blocks, please see [Series 7 DSP User's Guide](#) or [UltraScale DSP User's Guide](#).

## 3.7 Block RAMs

Larger memories (than those made available as small LUTs) are also a significant resource on FPGAs that generally provide several kilobits of memory storage (Xilinx typically makes 18k or 36k available). These memories are provided in the fabric and are highly configurable and compose-able such that larger memories with several features can be made available.

For more detailed information of Xilinx Block RAMs, please see [Series 7 Memory User's Guide](#) or [UltraScale Memory User's Guide](#)



## XILINX ARCHITECTURE TERMINOLOGY

### Table of Contents

- *Xilinx Architecture Terminology*
  - *BEL (Basic Element of Logic)*
  - *Site*
  - *Tile*
  - *FSR (Fabric Sub Region or Clock Region)*
  - *SLR (Super Logic Region)*
  - *Device*

In order to use RapidWright, an understanding of Xilinx FPGA architecture and hierarchy will be necessary in navigating your way around the device APIs. In Xilinx FPGAs, there are six major levels of hierarchy that describe basic components all the way up to the entire device. This hierarchy can be seen in the figure below:

We begin our discussion with a bottom-up approach starting with the lowest level of hierarchy, the basic element of logic.

### 4.1 BEL (Basic Element of Logic)

At the lowest level, the atomic unit of Xilinx FPGAs is a BEL. BELs are the smallest, indivisible, representable component in the fabric of an FPGA. There are two kinds of BELs, Logic BELs (Basic Element of Logic) and Routing BELs. A Logic BEL is a configurable logic-based site that can support the implementation of a design cell. Each BEL can support one or more types of UNISIM cells (UNISIM cells are described in Libraries Guides [UG953](#) for Series 7 devices and [UG974](#) for UltraScale™ devices). The mapping between a leaf cell (non-leaf cells do not represent implementable hardware, just hierarchy) in the netlist and a BEL site is referred to as the ‘placement’ of the cell. Thus, when one runs the Vivado command `place_design`, it is essentially mapping all leaf cells in the netlist to compatible and legal BEL sites.

Routing BELs are programmable routing muxes used to route signals between BELs. Routing BELs do not support any design elements (logic cells from the netlist do not occupy routing BEL sites), they are used only for routing. However, some routing BELs do have optional inversions.

BELs have input and output pins. BELs also have configurable connections that connect an input pin to an output pin. These BEL-based configurable connections are called site PIPs (where PIP stands for Programmable Interconnect Point). Both logic BELs and routing BELs can have site PIPs. However, in the case of a logic BEL, the site must be unoccupied by a cell in order for the route through to be usable. Often, these site PIPs, when implemented in logic



Fig. 1: Levels of architectural hierarchy in Xilinx FPGAs.



Fig. 2: Vivado representation of two routing muxes (routing BELs) and two flip flops (logic BELs).

BELs (a LUT is a common example), are referred to as a “route through” or “route-thru.” When routing a design, in order to physically route a net it is sometimes necessary to route through unused LUTs or other logic BELs with site PIPs.

## 4.2 Site

A group of related elements and their connectivity is referred to as a site. Inside of a site, one can find three major categories of objects:

1. BELs (Logic BELs and/or Routing BELs)
2. Site Pins (External input and output pins to the site)
3. Site wires (connecting elements to each other and site pins)

Sites are instances of a type and each site has a unique name with an `_X#Y#` suffix denoting its location in the site type grid. Each site type will have its own XY coordinate grid, independent of others. The only exception to this is that SLICEL and SLICEM types share the same grid space. SLICEL and SLICEM are the most common site type and are the basic configurable logic building blocks that contain LUTs and flip flops that form the backbone of the FPGA fabric.

### 4.2.1 Site Type

Sites are heavily replicated across the device and each instance of a site corresponds to a site type of that device’s architecture family. Additionally, sites found in an FPGA device are sometimes capable of hosting different types, however, when a tile is queried, a ‘primary’ site type is designated.



Fig. 3: An UltraScale+ SLICEL site, where logic BELs are magenta, routing BELs are green, site pins are red and site wires are yellow.

## 4.3 Tile

At an abstract level, Xilinx devices are created by assembling a grid of tiles. Similar to sites, each tile is an instance of a type and each tile has a unique name with an `_X#Y#` suffix. Tiles are the building blocks used when constructing an FPGA device. Tiles are designed to abut one another when laid down to construct an FPGA device.

Not all tiles contain sites and those that do, can have more than one. Unlike sites and BELs, tiles do not have user visible pins. Instead, tiles contain uniquely-named wires that can connect to site pins or other wires through a programmable interconnect point (PIP). PIPs are programmable muxes that connect two wires together in the same tile. Most PIPs are present in switch box tiles (those with the “INT” prefix). Columns of switch box tiles are designed to connect to all fabric resources such as CLBs, DSPs, and BRAMs. When tiles abut, they are designed such that certain wires in the adjoining tiles line up and connect as shown in the figure below:



### 4.3.1 Node

As there are no pins on tiles, the notion of a node is used to describe the connectivity of wires in between tiles. A node is a collection of electrically connected wires that spans one or more tiles. The figure below shows how four wires that abut among four tiles form a node:



Nodes and wires exist as first class Tcl objects in Vivado and the example above can be queried as follows:

```
% get_wires -of [get_node INT_X12Y101/EE2_W_BEG5]
INT_X12Y101/EE2_W_BEG5 INT_X13Y101/EE2_W_END5 CLEL_R_X12Y101/EASTBUSIN_FT0_21 CLE_M_
↪X13Y101/EASTBUSIN_FT0_21
%
```

For additional resources regarding Vivado objects, see [UG912: Vivado Design Suite Properties Reference Guide](#).

### 4.3.2 Tile Type

Each tile belongs to a type or definition. A tile type will contain the inventory list of all wires, PIPs and site types. Vivado does not directly represent the tile type as an object, but is listed as a property value under each tile.

Xilinx traditionally has leveraged a columnar-based architectural approach to tile layout. That is, with a few exceptions, all tiles within a column are of the same type but tiles occupying the same row are typically different types.

## 4.4 FSR (Fabric Sub Region or Clock Region)

A fabric sub region, also known as a clock region, is a replicated 2D array of tiles in the fabric. In the UltraScale architecture, all FSRs are 60 CLBs tall, but their width will vary depending on the mix of tile types used in its construction.

Clock routing and distribution lines are represented as the same granularity as FSRs. In UltraScale architectures, there are 24 horizontal routing tracks, 24 vertical routing tracks, 24 horizontal distribution tracks and 24 vertical distribution tracks. These routing and distribution tracks abut to tracks in neighboring FSRs to form the device clock network resource set. For more information specific to clocking resources, please see [UG472: Series 7 Clocking Resources User Guide](#) or [UG572: UltraScale Architecture Clocking Resources User Guide](#).

## 4.5 SLR (Super Logic Region)

This level of hierarchy is only present on devices that use the stacked silicon interconnect technology (SSIT) or also known as 2.5D packaging using a silicon interposer. As multiple dies (or dice) are packaged together, each die becomes a super logic region or SLR. SLRs contain a 2D array of FSRs and are typically identical as each die is fabricated from the same mask set.

In order for logic to communicate between SLRs, the UltraScale architecture employ special tiles in the FSRs neighbouring the abutment of two SLRs. A column of CLBs is removed and replaced with special tiles called Laguna tiles that have dedicated flip flop sites to aid in crossing the SLR divide.

## 4.6 Device

At the highest level of Xilinx architecture is the device. This is generally a 2D array of FSRs for single die products or two or more SLRs abutted vertically.

The core object in RapidWright is the Device class for any Xilinx device and is described in the next section.

## RAPIDWRIGHT OVERVIEW

### Table of Contents

- *RapidWright Overview*
  - *Device Package*
  - *EDIF Package (Logical Netlist)*
  - *Design Package (Physical Netlist)*

This page aims to help bridge the gap between Xilinx architectural constructs and classes and APIs found within the RapidWright code base. There are three core packages within RapidWright: device, edif and design.

## 5.1 Device Package

The device package contains the classes that correspond to constructs in the hardware and/or silicon devices. The most prominent and important class in this package is aptly named the `Device` class. The `Device` class represents a specific product family member (xcku040, for example) but does not carry package, speed grade or temperature grade information. These additional unique attributes are captured in the `Package` class. When a specific device is combined with its package and grade information, this uniquely identifies a Xilinx part, represented by the `Part` class.

Most of the details of managing speed grades, packages, temperature are most commonly dealt with by using a string to uniquely identify a part is by using a String of the part name. RapidWright automatically interprets all valid and supported Xilinx devices by part name and can correctly load a device if that information is included or not. For example, the following lines of code all load the same device, even though the part name is slightly different:

```
Device device = null;
device = Device.getDevice("xcku040");
device = Device.getDevice("xcku040-fbva676-2");
device = Device.getDevice("xcku040ffva1156");
device = Device.getDevice("xcku040-sfva784-1LV-i");
device = Device.getDevice("xcku040ffva1156-2");
```

The `Device` class maintains a singleton map to avoid loading the same device more than once. Devices files are stored in `com.xilinx.rapidwright.util.FileTools.DEVICE_FOLDER_NAME` and are provided by the maintainers of the RapidWright project, typically refreshed with each production release of Vivado (2017.3, 2017.4, 2018.1, ...). A significant amount of information is stored in the device files and so they are highly compressed to avoid consuming excessive disk space.

The Device class makes available all of the architectural resources through various APIs and data objects that follow the same hierarchical model as shown previously in the [Xilinx Architecture Terminology](#) section. For convenience, here again is the logical hierarchy of Xilinx devices:



Fig. 1: Levels of architectural hierarchy in Xilinx FPGAs.

These levels of hierarchy are available in RapidWright and the table below shows basic getters in both RapidWright and Vivado.

| RapidWright Class | RapidWright Java API                   | Vivado Object | Vivado Tcl API                           |
|-------------------|----------------------------------------|---------------|------------------------------------------|
| SLR               | Device.getSLR(int id)                  | SLR           | get_slrs -filter SLR_INDEX==\$idx        |
| ClockRegion       | Device.getClockRegion(int row,int col) | Clock Region  | get_clock_regions -filter NAME==\$name   |
| Tile              | Device.getTile(String name)            | Tile          | get_tiles -filter NAME==\$name           |
| Site              | Device.getSite(String name)            | Site          | get_sites -filter NAME==\$name           |
| BEL               | Site.getBEL(String name)               | BEL           | get_bels -of \$site -filter NAME==\$name |

The `Device` class is the top level object in RapidWright and has direct accessors to all other levels of hierarchy except for BELs. All classes in the hierarchy are static and do not change based on a user design. Most of the interaction between a user's design and the device occur at the Tile, Site and BEL levels of hierarchy. The `BEL` class can be one of three kinds of non-routing objects in a Site: a Logic BEL, a Routing BEL and a Port (port of the Site). This is designated by its class member enum of type `BELClass`. Most components within the device architecture are assigned an integer index. This helps to lower memory usage by not always having to explicitly represent a

component of the architecture with a dedicated object. It also helps by providing faster lookups. In some cases, such as `TileTypeEnum` and `SiteTypeEnum`, the index has been explicitly enumerated and an enum is used instead.

In parallel with the logical hierarchy of Xilinx devices, there also exist several constructs for representing routing resources. At the lowest level are pins on BELs represented by the `BELPin` class. Pins on `Site` objects can be referenced by creating dynamic objects of type `SitePin`. Inside a `Site`, wires called ‘site wires’ connect `BELPin` objects. Connectivity of a site wire is stored in each `BELPin` and also in the `Site` object. Site wires do not have an explicit object for representation, but their name, index and connectivity are available on `Site` and `BELPin` objects.

Remaining faithful to the Vivado representation of inter-site routing resources, RapidWright provides `Wire`, `Node` and `PIP` (Programmable Interconnect Points) objects. These objects are generated on the fly as needed as there can be several millions of unique instances of each. The figure below correlates a Vivado device GUI representation with an example of the different routing resources types available in RapidWright.



Fig. 2: Examples of different routing resources Xilinx FPGAs.

## 5.2 EDIF Package (Logical Netlist)

In Vivado, all designs post synthesis have a logical netlist that can be exported in the EDIF netlist format. EDIF (Electronic Design Interchange Format) 2 0 0 is the netlist format used in RapidWright. This is due to its inclusion in Vivado’s design checkpoint file format and that Vivado has facilities to read and write it (`read_edif` and `write_edif`).

RapidWright reads, represents and writes logical netlist information in the EDIF format and the EDIF package is written to explicitly accommodate this need. It was written with Vivado-generated EDIF in mind and may not support every corner case of the EDIF 2 0 0 specification.

Parsing EDIF is performed by the `EDIFParser` class. EDIF is normally handled when reading or writing a DCP, but it can be parsed/exported independently as follows:

```
// Read in my_edif_file.edf
EDIFParser parser = new EDIFParser("my_edif_file.edf");
EDIFNetlist netlist = p.parseEDIFNetlist();
// Work some netlist magic...
```

(continues on next page)

(continued from previous page)

```
// ...
// Now write it out
netlist.exportEDIF("my_edif_file_post_rapidwright.edf");
```

The EDIFNetlist is the top level class that contains the netlist and cell libraries. All EDIF-related objects have EDIF has a class name prefix. The EDIFNetlist keeps a reference to the top cell which is wrapped in the EDIFDesign class. It also maintains a top cell instance reference that is generated when the file is loaded.

Although a full explanation of netlist modeling and relationships are beyond the scope of this documentation, an attempt to clarify the contextual meaning of some of the classes will be made. One important distinction to make is between EDIFPort and EDIFPortInst. At one level, an EDIFPort belongs to an EDIFCell and an EDIFPortInst belongs to an EDIFCellInst. Another distinction is that an EDIFPort can be a bussed-based object whereas an EDIFPortInst can only represent a single bit. An EDIFNet defines connectivity inside an EDIFCell by connecting EDIFPortInst objects together (port references on cell instances inside the cell or to external port references entering/leaving the cell).



Fig. 3: Snapshot of the Vivado netlist viewer with references to RapidWright EDIF classes

Most classes inherit from `EDIFName`. EDIF has peculiar naming rules and provides for a mechanism to map the original name to a legal EDIF name. The `EDIF` package in RapidWright attempts to hide all of the String gymnastics necessary to maintain both name spaces and simply present the user with the original intended name.

Several classes also inherit from `EDIFPropertyObject` (which also inherits from `EDIFName`). `EDIFPropertyObject` endows objects with the ability to store properties which are key/value pairs. Properties are a mapping between an `EDIFName` object and a `EDIFPropertyObject`. These properties can contain key programmable information such as LUT equations or attributes specific to BEL sites.

## 5.3 Design Package (Physical Netlist)

The design package is the collection objects used to describe how a logical netlist map to the device netlist. The design is also referred to as the physical netlist or implementation. It contains all of the primitive logical cell mappings to hardware, specifically the cells to BEL placements and physical net mapping to programmable interconnect or routing.

The `Design` class in RapidWright is the central hub of information for a design. It keeps track of the logical netlist, physical netlist, constraints, the device and part references among other things. The `Design` class is most similar to a design checkpoint in that it contains all the information necessary to create a DCP file.

Since a design programs a device, there are some one-to-one mappings between the device and design representation in RapidWright. For example:



Fig. 4: Illustration representing how a Cell, SiteInst and Design map to BEL, Site and Device respectively

### 5.3.1 SiteInst

Design representation and implementation in Vivado is BEL-centric (BELs and cells). The `SiteInst` keeps track of the cells placed onto its BELs, the site PIPs used in routing and how routing resources map to nets.

Each `SiteInst` maps to a specific compatible site within a device. The `SiteInst` has a type using a `SiteTypeEnum` as the designator. It also maintains a map of named leaf cells from the logical netlist that are physically placed onto the BEL sites within the site. RapidWright also preserves the same Vivado “fixed” flag that is used in certain situations by Vivado to prevent components inside the site from being moved.

Routing nets inside of a site (intra-site) is different from routing outside of sites (inter-site). Routing nets outside of sites consists of finding a path of `Node` objects from a source site pin to a sink site pin by turning on a set of PIPs. In contrast, routing inside of a site can be a bit more complex as it must also account for site context and consider which BELs are occupied. In general, Vivado attempts to automate the intra-site routing task. RapidWright also strives to do the same (see `SiteInst.routeSite()`), however it may not always fully automate tasks as expected and the user may be required to call additional APIs when placing/routing design elements.

One of the ways routing is accomplished inside a site is through a `SitePIP`, which is a programmable interconnect point that exists on a routing BEL. Generally, a `SitePIP` will establish a connection through a routing BEL or, in some cases, a logic BEL from an element input pin to an element output pin, thus connecting two separate site wires. The `SiteInst` is the object in RapidWright where site PIP usage is recorded and maintained. By default all site PIPs are turned off, if the site PIP is added to the `SiteInst` then it is interpreted as the site PIP being turned on or used.

### 5.3.2 Net

Routing outside of a site is represented by the Net class. A Net in RapidWright is typically named after the logical driver source pin and represents the entire set of logically equivalent nets that map to the same electrically equivalent net. For example, consider the net depicted in the following netlist screenshot:



This figure shows the logical netlist connection of three cells over one physical net. However, there are 11 separate nets in the logical netlist that must be traversed in order to make the connection.

A Net is a physical net that implements a route using PIPs (programmable interconnect points) that, when combined together connect nodes into a path from a source site pin to one or more sink site pins. A Net starts and stops at site pins represented by `SitePinInst` objects (design instances of `SitePin` objects). The physical implementation of the 11 logical nets above is shown in the figure below:



The net is also referenced when routing inside a site, but the site routing implementation is captured in the `SiteInst` object.

### 5.3.3 Cell (A BEL Instance)

At the lowest level, a RapidWright Cell maps a logical leaf cell from the EDIF netlist (EDIFCellInst) to a BEL. The cell name is typically the full hierarchical logical name of the leaf cell it maps to and also maintains the library cell type name (FDRE, for example for a reset flip flop). A cell also maintains the logical cell pin mappings to the physical cell pin mappings (pins on the BEL).

### 5.3.4 Module

A module is a physical netlist container construct available in RapidWright. A RapidWright module is represented by the `Module` class in the `design` package. A module contains both a logical and physical netlist that provides all the details necessary for a full implementation. It is most similar to a placed and routed out-of-context DCP, however RapidWright enables the implementation to be replicated or relocated to multiple compatible areas of the fabric—capabilities that are not yet available in Vivado. A module is a definition object in that the `SiteInst` and `Net` objects it contains are a prototype or blueprint for a pre-implemented block that can potentially be ‘stamped’ out and relocated in valid locations around a device. The `ModuleInst` represents the instance object of a `Module` and is part of the implemented portion of a physical netlist.

### 5.3.5 Module Instance

A module instance quite simply is an instance of a module. RapidWright supports module instances in a design using the `ModuleInst` class in the `design` package. Module instances have a unique name within the design and as each module has a collection of `SiteInst` and `Net` objects, these containers are prefixed hierarchically with the module instance name. For example, if a module had a `SiteInst` named “SLICE\_X2Y2” and a `Net` named `data_ready`, a newly created module instance named “fred” would have counterpart `SiteInst` and `Net` objects called “`fred/SLICE_X2Y2`” and “`fred/data_ready`”.

A module instance will typically have one of its site instances selected as what is called an ‘anchor’. The anchor site instance is a common reference point by which all other site instances and nets in the instance can be referenced. This is useful for determining if a potential location on the fabric is compatible with the module instance for placement.

The `Module` and `ModuleInst` concept is not available in Vivado. However, if a design in RapidWright is written out without being flattened (See `Design.flattenDesign()`), RapidWright will save module metadata in the DCP and the modules and instances can be reloaded if the DCP is reloaded in RapidWright. If the DCP is read by Vivado and then written back out, the module metadata will be lost.



## DESIGN CHECKPOINTS

### Table of Contents

- *Design Checkpoints*
  - *What is a Design Checkpoint?*
  - *What is Inside a Design Checkpoint?*
  - *RapidWright and Design Checkpoint Files*

## 6.1 What is a Design Checkpoint?

A design checkpoint (DCP) is a file used by the Vivado Design Suite that represents a snapshot of a design at any stage of the design process. The snapshot includes the netlist, constraints and implementation results.

## 6.2 What is Inside a Design Checkpoint?

A design checkpoint file (extension .dcp) is a Vivado file format that contains a synthesized netlist, design constraints and can contain placement and routing information. RapidWright provides readers and writers to parse and export the various components.

## 6.3 RapidWright and Design Checkpoint Files

RapidWright can freely read and write checkpoint files with the following exceptions:

- If the design is encrypted, RapidWright cannot open it. RapidWright is not capable of decrypting files.
  - Sometimes, however, a design may not be secured or designated to be encrypted but the EDIF file in the DCP is encrypted. This is due to RTL source references being stored in the EDIF file. Vivado will allow you to write out an EDIF file (without RTL source references) with the `write_edif` Tcl command. RapidWright can read in the alternate EDIF file along side the DCP if it has the same root name (.edf extension instead of .dcp).
- If the design checkpoint file is created with a much newer version of Vivado compared with the RapidWright release, it may not be able to read the file.
- Conversely, older versions of Vivado may not be able to read RapidWright checkpoint files

Here are a few ways to read/write a design checkpoint in RapidWright:

```
Design design = Design.readCheckpoint("my_design_routed.dcp");
// or if the EDIF inside the DCP is encrypted because of source references,
// you can alternatively supply a separate EDIF
design = Design.readCheckpoint("my_design_routed.dcp", "my_design_edif.edf");

// To write out a design
design.writeCheckpoint("my_design_post_rapidwright.dcp");
```

The interface that enables RapidWright to read and write checkpoints is handled by the RapidWright API Library in the provided rapidwright-api-library-<ver>.jar. The APIs in this tool are used in the Design class with readCheckpoint() and writeCheckpoint(). Note that it is licensed separately from the rest of RapidWright under a modified Xilinx EULA. Also note that RapidWright is not an official product from Xilinx and designs created or derived from it are not warranted. Please see [LICENSE.TXT](#) for full details.

## IMPLEMENTATION BASICS

### Table of Contents

- *Implementation Basics*
  - *Placement*
  - *Routing*

Implementation, in the context of RapidWright and compiling designs for FPGAs, is defined as the placement and routing of a synthesized/mapped netlist to a specific FPGA device. This section will describe the detailed mechanics of how placement and routing can be achieved in RapidWright.

## 7.1 Placement

As opposed to Vivado, RapidWright enables three layers or levels of placement in its design abstraction: BEL level, site level and module level. Vivado primarily only enables BEL placement (previously in ISE, sites were the major unit of placement). This section details how RapidWright represents and interacts with design elements at the three levels of placement mentioned.

### 7.1.1 BEL Placement

---

**Note:** Reliable automatic BEL placement in RapidWright is still a work in progress and care should be taken when attempting this capability.

---

Creating correct BEL placements is quite tricky as several factors must be taken into consideration when placing a cell onto a BEL site. Some questions one might need to ask when placing a cell onto a BEL site are:

1. Is the BEL site already occupied and are all pins map-able to the surrounding BEL connections?
2. Are all of the cell connections routable within the site and interconnect?
3. Are the clock and set/reset domains compatible with those already used within the site or are there resources available to route alternatives?
4. Does this cell depend on any dedicated inter-site wires (such as carry chains or DSP cascades) that are not available?

Placing a cell correctly can necessitate updates to the design in the following categories:

1. Mapping of a Cell object to a BEL in RapidWright
2. Pin mappings between the logical and physical cell pins must be added and/or routed within the site (conditions will vary).
3. Use of one or more SitePIPs as part of routing the site (stored in the respective SiteInst)

Generic pin mappings are assigned when a cell is created and placed. However, these mappings may need to be adjusted based on the context.

A SitePIP configures a routing BEL to propagate a signal from one of its inputs to its output pin. SitePIPs must be turned on in the respective SiteInst when a cell is placed onto a BEL as the common convention in Vivado is to always leave the site in a legally routed state.

### 7.1.2 Site Placement

Within RapidWright, it can be straightforward to move a SiteInst from one site to another. An example of how to relocate a site instance from one location to another is shown below:

```
Design d = Design.readCheckpoint("example.dcp");
SiteInstance si = d.getSiteInstanceFromSiteName("SLICE_X0Y0");
si.place(d.getDevice().getSite("SLICE_X1Y1"));
```

The user is responsible for changing any existing routing resources that previously routed to the old site.

### 7.1.3 Module Placement

One of RapidWright's unique capabilities is providing another level of hierarchy in implementation. Through the Module and ModuleInstance classes, a complex cell can be replicated and/or relocated across the device. When a pre-implemented module is created for a device, all valid locations are pre-calculated and stored for the anchor site within the Module. Therefore, placement of a ModuleInstance is simply selecting one of the valid anchor sites and applying it.

## 7.2 Routing

In Vivado, there is roughly three different types of routing: intra-site, inter-site and clock routing. This section provides a brief overview of each.

### 7.2.1 Site (Intra-site) Routing

When a cell is placed onto a BEL, typical Vivado convention is to route the intra-site net portions immediately after. Routing a site implies mapping the physical net to site wires and site PIPs. In RapidWright, some of this intra-site routing happens when the cell is placed and there are a few methods that can also help finish intra-site routing in special cases. SiteInst.routeIntraSiteNet() will attempt to route one BELPin to another for intra-site nets. SiteInst.routeSite() will attempt to route all the nets that pertain to the site.

### 7.2.2 Interconnect (Inter-site) Routing

The majority of work in routing a design is in inter-site routing. This is the task of selecting a set of routing resources the enable a path between a source site pin and one or more sink site pins. The physical routing of a net in RapidWright is simply described by a list of PIPs. RapidWright comes with a rudimentary router for UltraScale architectures, but

it is still a work in progress. It doesn't fully resolve congestion, but provides a working example for more specialized tasks.

### 7.2.3 Clock Routing

Clock routing is very architecture specific and is similar to inter-site routing in that it is also implemented by a list of PIPs. However, there are key steps and constraints that must be satisfied beyond typical inter-site routing.

### 7.2.4 RWRoute

RWRoute is a full design router that has been developed in the RapidWright framework leveraging its *lightweight timing model*. It is capable of routing designs in both wirelength-driven and timing-driven modes, enabling the open source community to innovate and develop new algorithms. The open source aspect enables creation of domain-specific algorithms such as bundle routing in customized cost functions for the desired figure of merit. It also supports a partial routing mode, which is an essential capability for a future library-based customized flow.

**Note:** RWRoute has some limitations:

1. It currently only supports UltraScale+ devices.
2. The timing model in RapidWright does not estimate hold time and thus RWRoute cannot address hold time violations.
3. For the most accurate clock routing in timing-driven mode, certain files will be need to be created (see tcl/rwroute/README for more information).
4. When attempting to route designs in timing-driven mode, for the most accurate timing estimates on hard blocks (such as DSPs), the design must be pre-analyzed and a set of files must be created to feed into RWRoute (see tcl/rwroute/README for more information).

By default, RWRoute runs in timing-driven mode, routing a design from scratch. To run an instance of RWRoute the syntax is:

```
rapidwright RWRoute /PATH/TO/INPUT/DCP/design.dcp /PATH/TO/OUTPUT/DCP/design_routed.  
→dcp
```

In both run instances, with the following options available:

[`--nonTimingDriven`] for wirelength-driven routing. RWRoute is non-timing-driven with this option, relying on the Manhattan distance to guide the routing expansion and optimize total wirelength.

[`--partialRouting`] for partial routing. RWRoute strictly preserves routed nets of a design and works only on the unrouted nets of the design.

[`--softPreserve`] for enabling an experimental feature during `--partialRouting`, allowing RWRoute to rip up and re-route otherwise routed (and strictly preserved) nets.

[`--wirelengthWeight <arg>`] to redefine the wirelength weighting factor. The greater alpha is, the less runtime the router takes, at the expense of longer wirelength. It is within [0, 1]. Runtimes usually converges when alpha is larger than 0.7. The default value is 0.8.

[`--timingWeight <arg>`] to redefine the timing-driven weighting factor. The smaller the timing weight is, the better critical path delay will be, at the expense of longer runtime. It is within [0, 1]. The default value is 0.35.

[`--shareExponent <arg>`] to redefine the sharing exponent for timing-driven routing. It is used to control the routing resource sharing when routing connections. When the sharing exponent is 0, the sharing mechanism is criticality-unaware and encourages resource sharing, even when connections are long and timing-critical. With an

increasing sharing exponent, the resource sharing is discouraged for critical connections, allowing more suitable routes for them to optimize timing. As a result, the wirelength and routing time are increased. For an effective criticality-aware sharing mechanism, the sharing exponent should be no less than 1. The default value is 2 for an optimized trade-off between the critical path delay reduction and the wirelength-runtime product increase.

There are three tutorials that provide information about using RWRoute in different routing modes:

1. [\*RWRoute Wirelength-driven Routing Tutorial\*](#)
2. [\*RWRoute Timing-driven Routing Tutorial\*](#)
3. [\*RWRoute Partial Routing Tutorial\*](#)

For all other configuration options, please refer to [src/com/xilinx/rapidwright/rwroute/RWRouteConfig.java](#).

---

CHAPTER  
EIGHT

---

## MERGING DESIGNS

One useful technique in constructing an FPGA implementation is the ‘divide and conquer’ approach. When dividing a design, often, Vivado can achieve higher density and quality of results when it can focus on smaller parts of a design rather than the entire implementation at once.

After dividing a design into separate pieces, it can be tricky to re-assemble the components back into a cohesive implementation. The logical netlist must be consistent as well as the physical netlist. A popular approach is to separate the design by module hierarchy, implementing each module or cell out of context. However, not all designs benefit or have the right hierarchy necessary for this approach. To provide a more robust method of assembly, we have added a design merge capability in RapidWright.

Merging two or more designs in RapidWright can be accomplished with the API:

```
public static Design MergeDesign.mergeDesigns(Design ... designs);
```

Which uses Java’s variable argument construct (`Varargs`) which can accept any Java Collection object (`List<Design>`, `Set<Design>`, `Collection<Design>`) an array (`Design[]`) or a simple comma separated list (`design0, design1, ..., designN`). The return value is the resulting merged design which is the first design passed as an argument (`design0` in the comma separated list case). All other designs are destructively changed to support the merge.

### 8.1 Customizing Merge Behavior

As there might be different valid ways to merge a design, the merge process employs an `AbstractDesignMerger` to allow a user to implement the desired merging behavior. There are five major object types that must be resolved in a merge and they are captured in the abstract class that inheritors must implement:

```
public abstract void mergePorts(EDIFPort p0, EDIFPort p1);
public abstract void mergeLogicalNets(EDIFNet n0, EDIFNet n1);
public abstract void mergeCellInsts(EDIFCellInst i0, EDIFCellInst i1);
public abstract void mergeSiteInsts(SiteInst s0, SiteInst s1);
public abstract void mergePhysicalNets(Net n0, Net n1);
```

The first three (`EDIFPort`, `EDIFNet` and `EDIFCellInst`) are all logical netlist objects. The last two (`SiteInst` and `Net`) are physical netlist objects. In the `DefaultDesignMerger`, a general merge behavior is implemented. The order in which the objects are merged is the same as that listed above. Here is a brief description of the default merge behavior of the 5 object types in `DefaultDesignMerger`.

#### 8.1.1 Merging EDIFPorts

When merging ports, the two sets of ports on the top cell of both designs is examined. All names that are unique are merged into the resulting design. If both designs contain a port with the same name, the directionality of the ports is

checked. If they are of opposite directionality (one is an input and the other an output), generally both ports will be removed and their connected nets will be joined. The net attached to the input port will be eliminated and the net on the output port will assume the sinks of the input port.

If both ports are inputs, the extra copy is removed and its sinks are added to the first design's port. If both ports are outputs, the same approach is taken except if the merging is incompatible (two different sources that cannot be merged), an error will be thrown.

## 8.1.2 Merging EDIFNets (Logical)

All unique nets are included in the merged design. If each input design has a net with the same name, sinks are moved from one copy to the merged net. If one net has a top level port source and the other has a real (hard cell pin) source, the merging will choose the real source to be included in the final merged net and the port will be omitted.

## 8.1.3 Merging EDIFCellInsts

All unique instances are included in the merged design. When two designs contain instances of the same name, only one is kept. Each of the pins on both copies of the instance are examined, if one of the ports is unconnected or undriven, it will use the connection from the other source design. If both copies of a pin are driven by a source or connection that cannot be merged, an error is thrown.

To illustrate an example of some of these merging concepts, consider the following two designs, Design A and Design B:



Fig. 1: Input Design A

If Design A and Design B were merged using the RapidWright API, the resulting design would be:

## 8.1.4 Merging SiteInsts

Generally, if a merging of two or more designs are attempted, their implementation should not overlap unless curated in a predictable manner. Merging more than one SiteInst from two different sources both placed onto the same site can be complicated and error prone. The merge API will attempt to merge placed cells and site routing even if they occupy the same site.

## 8.1.5 Merging Nets (Physical)

All unique physical nets are merged in the final result. If more than one copy of a physical net exists in the design inputs, the routing is combined simply by taking the union of the PIPs belonging to each copy. GND and VCC are



Fig. 2: Input Design B



Fig. 3: Result of Merging Design A and Design B

common cases where the physical net is merged.

## BITSTREAM MANIPULATION

### Table of Contents

- *Bitstream Manipulation*
  - *Disclaimer*
  - *Overview*
  - *Bitstream Packet Model*
  - *Configuration Array Model*
  - *Example Usages: Modify User State Bits*
  - *Example Usages: Find and Print the Frames of a Placed Cell*

This section describes the useful capabilities available in RapidWright when working on placed and routed designs and bitstreams created by Vivado.

### 9.1 Disclaimer

RapidWright cannot generate bitstreams on its own. It is necessary to create bitstreams using Vivado. RapidWright does not contain the information needed to translate a placed and routed design into a bitstream. RapidWright also has no encryption/decryption capabilities and will not be able to parse any bitstreams successfully that are encrypted. As with any files generated by RapidWright, they are not warranted and it is intended as an experimental platform only.

### 9.2 Overview

RapidWright has some new, useful, documented bitstream capabilities that can be provided for existing placed and routed circuits when a Vivado-generated bitstream is readily available. This section will describe at least three capabilities:

1. Update existing user-defined initialization state such as flip-flop, LUTRAM and BRAM initialization values.
2. Coarse-grained correlation of placed and routed circuits to approximate locations in the bitstream for reliability analysis and related analysis.
3. For highly constrained and well-planned sockets, it presents the opportunity to relocate partial bitstreams into different DFX regions (documentation coming soon).

In order to support these capabilities, RapidWright has been augmented with a set of APIs that provide bitstream parsing and a configuration array model. These two models are heavily influenced and derived from existing Xilinx Configuration User Guides:

- UltraScale Architecture: [UG570](#)
- Series 7 FPGAs: [UG470](#)
- Zynq-7000 Soc Technical Reference Manual: [UG585](#)

Users are highly encouraged to review these guides to gain a better understanding of the mechanics of bitstream delivery and structure as most of these details will not be duplicated in this description.

There are two ways to represent a bitstream in RapidWright. The first is through a packet stream model represented by the `Bitstream` class. The second is a configuration array model (see `ConfigArray` class) that loosely represents the memory array of the device as configured by the packets delivered from the bitstream. Each model is briefly described below.

## 9.3 Bitstream Packet Model

A `.bit` file is essentially a sequence of packets that contain instructions to read and write configuration registers (see configuration user guides above for greater details). RapidWright has several class objects that will parse and represent the different components of a bitstream using the `Bitstream`, `BitstreamHeader`, `Packet`, `OpCode`, `PacketType` and `RegisterType` classes and enums. A key point is that a bitstream contains a list of packets that read and write registers. One register in particular is the Frame Data Register (FDRI) that writes and read data to the configuration memory of the device.

### 9.3.1 BitstreamHeader

The bitstream header appears at the beginning of a `.bit` file and is a list of 32-bit words that contain some metadata about the bitstream (creation date/time, target part name, design name, etc). It also contains some dummy pad words and bus width detection packets. The header ends with the sync word (0xAA995566), an example is shown from an excerpt of a Series 7 below:

### 9.3.2 Packet and PacketType

Each packet has a header word (32-bits) and often a payload. There are two kinds of packets, most of which are of type 1. Type 2 packets are used for very large payloads (such as configuration array data). Bit fields are shown below from the configuration user guide:

### 9.3.3 RegisterType and Frame Address Register

There are several configuration register types, please refer to your architecture's respective guide (listed above) for details. One of the most important registers used is the frame address register (FAR). The FAR describes the address to which a frame a configuration data is written to in the configuration array.

The configuration array is divided into smaller segments called configuration rows, rows are divided into columns of blocks and then each unique block is divided into a number of frames. A block is the same height as a clock region in the fabric.

A frame address has several fields that are architecture specific. See the tables above for the bit fields used. For example, a Series 7 device distinguishes the top and bottom half of a device as a separate region whereas for UltraScale and UltraScale+ this is not the case. See figures below for to illustrate:



**Table 5-19: Sample XC7K325T Bitstream**

| <b>Configuration Data Word (hex)</b> | <b>Description</b>            |
|--------------------------------------|-------------------------------|
| FFFFFFFF                             | Dummy pad word, word 1        |
| FFFFFFFF                             | Dummy pad word, word 2        |
| ...                                  | Dummy pad words 3-7           |
| FFFFFFFF                             | Dummy pad word, word 8        |
| 000000BB                             | Bus width auto detect, word 1 |
| 11220044                             | Bus width auto detect, word 2 |
| FFFFFFF                              | Dummy pad word                |
| FFFFFFF                              | Dummy pad word                |
| AA995566                             | Sync word                     |

**Table 9-16: Type 1 Packet Header Format**

| Header Type | Opcode  | Register Address | Reserved | Word Count |
|-------------|---------|------------------|----------|------------|
| [31:29]     | [28:27] | [26:13]          | [12:11]  | [10:0]     |

**Notes:**

1. "R" means the bit is not used and reserved for future use. The reserved bits should be written as 0s.

**Table 9-18: Type 2 Packet Header**

| Header Type | Opcode  | Word Count |
|-------------|---------|------------|
| [31:29]     | [28:27] | [26:0]     |

**Table 9-17: OPCODE Format**

| OPCODE | Function |
|--------|----------|
| 00     | NOOP     |
| 01     | Read     |
| 10     | Write    |
| 11     | Reserved |

Series 7

UltraScale

UltraScale+



Series 7

Table 5-24: Frame Address Register Description

| Address Type   | Bit Index | Description                                                                                                                          |
|----------------|-----------|--------------------------------------------------------------------------------------------------------------------------------------|
| Block Type     | [25:23]   | Valid block types are CLB, I/O, CLK (000), block RAM content (001), and CFG_CLB (010). A normal bitstream does not include type 011. |
| Top/Bottom Bit | 22        | Select between top-half rows (0) and bottom-half rows (1).                                                                           |
| Row Address    | [21:17]   | Selects the current row. The row addresses increment from center to top and then reset and increment from center to bottom.          |
| Column Address | [16:7]    | Selects a major column, such as a column of CLBs. Column addresses start at 0 on the left and increase to the right.                 |
| Minor Address  | [6:0]     | Selects a frame within a major column.                                                                                               |

Table 9-20: UltraScale FPGA Frame Address Register Description

| Address Type   | Bit Index | Description                                                                                                               |
|----------------|-----------|---------------------------------------------------------------------------------------------------------------------------|
| Block type     | [25:23]   | Valid block types are CLB, I/O, CLK (000), block RAM content (001). A normal bitstream does not include types 010 or 011. |
| Row address    | [22:17]   | Selects the current row. The row addresses increment from bottom to top.                                                  |
| Column address | [16:7]    | Selects a major column, such as a column of CLBs. Column addresses start at 0 on the left and increase to the right.      |
| Minor address  | [6:0]     | Selects a frame within a major column.                                                                                    |

Table 9-21: UltraScale+ FPGA Frame Address Register Description

| Address Type   | Bit Index | Description                                                                                                                       |
|----------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------|
| Block type     | [26:24]   | Valid block types are CLB, I/O, CLK (000), block RAM content (001). A normal bitstream does not include types 010 or 011, or 100. |
| Row address    | [23:18]   | Selects the current row. The row addresses increment from bottom to top.                                                          |
| Column address | [17:8]    | Selects a major column, such as a column of CLBs. Column addresses start at 0 on the left and increase to the right.              |
| Minor address  | [7:0]     | Selects a frame within a major column.                                                                                            |



Table 9-20: UltraScale FPGA Frame Address Register Description

| Address Type   | Bit Index | Description                                                                                                               |
|----------------|-----------|---------------------------------------------------------------------------------------------------------------------------|
| Block type     | [25:23]   | Valid block types are CLB, I/O, CLK (000), block RAM content (001). A normal bitstream does not include types 010 or 011. |
| Row address    | [22:17]   | Selects the current row. The row addresses increment from bottom to top.                                                  |
| Column address | [16:7]    | Selects a major column, such as a column of CLBs. Column addresses start at 0 on the left and increase to the right.      |
| Minor address  | [6:0]     | Selects a frame within a major column.                                                                                    |

Table 9-21: UltraScale+ FPGA Frame Address Register Description

| Address Type   | Bit Index | Description                                                                                                                       |
|----------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------|
| Block type     | [26:24]   | Valid block types are CLB, I/O, CLK (000), block RAM content (001). A normal bitstream does not include types 010 or 011, or 100. |
| Row address    | [23:18]   | Selects the current row. The row addresses increment from bottom to top.                                                          |
| Column address | [17:8]    | Selects a major column, such as a column of CLBs. Column addresses start at 0 on the left and increase to the right.              |
| Minor address  | [7:0]     | Selects a frame within a major column.                                                                                            |

When a frame of data is written to the FDRI, the FAR register automatically increment each time a complete frame of data is written. Thus, no additional packets to set the FAR are necessary, although there are debug CRC bitstreams that can be generated where the FAR address is set explicitly for each frame (see [UG908](#)- Table A-1, BITSTREAM.GENERAL.DEBUGBITSTREAM YES).

## 9.4 Configuration Array Model

The ConfigArray class represents the array as defined by the address space of the FAR and holds all the frame data written to it by the packet list from the bitstream. The config array is essentially a list of configuration rows, each row is a list of configuration blocks and each block is a list of configuration frames.

## 9.5 Example Usages: Modify User State Bits

**Note:** The API ConfigArray.updateUserStateBits() only updates user state bits as documented in the logic location file generated from write\_bitstream -logic\_location\_file.

---

```
public static void main(String[] args) {
    Design design = Design.readCheckpoint(args[0]);
    Bitstream bitstream = Bitstream.readBitstream(args[1]);
    ConfigArray configArray = bitstream.configureArray();

    // Changes the initialization of the FF to 1
    Cell cell = design.getCell("myFF");
    cell.setProperty("INIT").setValue("1");
    configArray.updateUserStateBits(cell);
    bitstream.updatePacketsFromConfigArray();

    design.writeCheckpoint(args[2]);
    bitstream.writeBitstream(args[3]);
}
```

## 9.6 Example Usages: Find and Print the Frames of a Placed Cell

```
public static void main(String[] args) {
    Design design = Design.readCheckpoint(args[0]);
    Bitstream bitstream = Bitstream.readBitstream(args[1]);
    ConfigArray configArray = bitstream.configureArray();

    // Find Configuration Block of a resource and print frames
    Cell cell = design.getCell("myFF");
    Block block = configArray.getConfigBlock(cell.getTile());
    for(Frame frame : block.getFrames()) {
        System.out.println(frame.toString(true));
    }
}
```



## FPGA INTERCHANGE FORMAT

### 10.1 What is the FPGA Interchange Format?

The FPGA Interchange Format (FPGAIF) is a standard exchange format designed to provide all the information necessary to perform placement and routing in an open source context. It contains three major schemas that define how to transfer the following kinds of data in an architecture-independent way:

1. FPGA architecture device model: The available placement and programmable routing resources on the FPGA
2. Logical netlist: Cell definitions, networks, pins, hierarchy, etc.
3. Physical netlist: Placement mappings and routing configurations, i.e. mapping the logical netlist to the FPGA architecture device model

The FPGAIF is hosted as an [open source project](#) under the [CHIPS Alliance](#) and original development was started in 2020.

### 10.2 What does the FPGA Interchange Format enable?

Primarily it allows tools—both commercial and open source—an open way to exchange FPGA device and design data to enable customized place and route solutions. Some tools and efforts that support the FPGAIF:

- [DREAMPlaceFPGA](#) – An open source GPU accelerated FPGA placer ([FPGAIF Support Page](#))
- [ISFPGA 2024 Runtime-first Routing Contest](#)
- [python-fpga-interchange](#) – A Python module for reading and writing FPGA Interchange Files
- [RapidWright](#) – Full support for all AMD-Xilinx architectures and design files

### 10.3 How is RapidWright related to the FPGA Interchange Format?

RapidWright has a full reference implementation of the entire FPGA Interchange schema. It is able to generate nearly all supported FPGA devices in the format and can read and write Interchange designs. It can convert those designs to and from design checkpoint files to be exported and imported from Vivado.

### 10.4 Additional Resources

- [AMD-Xilinx announcement of support for the FPGA Interchange Format](#)
- [Google Open Source Blog Article on the FPGA Interchange Format](#)

- [ReadTheDocs Documentation for the FPGA Interchange Schema](#)
- [FPGA Interchange Schema GitHub Repository](#)

---

CHAPTER  
ELEVEN

---

## RAPIDWRIGHT PUBLICATIONS

### 11.1 Original RapidWright Publication - FCCM 2018

RapidWright: Enabling Custom Crafted Implementations for FPGAs |  
Slides | BibTex

### 11.2 Additional RapidWright Publications

| Confer-<br>ence | Title                                                                                                 |
|-----------------|-------------------------------------------------------------------------------------------------------|
| FPGA<br>2022    | RapidStream: Parallel Physical Implementation of FPGA HLS Designs  <br><b>Best Paper Award Winner</b> |
| FPT 2021        | RWRoute: An Open-source Timing-driven Router for Commercial FPGAs                                     |
| FPT 2019        | An Open-source Lightweight Timing Model for RapidWright   Slides                                      |
| FPGA<br>2019    | Build Your Own Domain-specific Solutions with RapidWright   Slides                                    |

## 11.3 Select Community Publications

| Conference  | Authors             | Title                                                                            |
|-------------|---------------------|----------------------------------------------------------------------------------|
| ICCD 2022   | Kwadjo, D., et al.  | Accelerating Hybrid Quantized Neural Networks on Multi-tenant Cloud FPGA         |
| J. of PDC*  | Kwadjo, D., et al.  | Towards a Component-based Acceleration of Convolutional Neural Networks on FPGAs |
| IPDPSW 2021 | Kwadjo, D., et al.  | Exploring a Layer-based Pre-implemented Flow for Mapping CNN on FPGA             |
| ASPLOS 2020 | Zha, Y., et al.     | Virtualizing FPGAs in the Cloud                                                  |
| FPT 2019    | Mandebi, J., et al. | Automatic Generation of Application-Specific FPGA Overlays with RapidWright      |
| FPL 2019    | Hale, R., et al.    | Preallocating Resources for Distributed Memory Based FPGA Debug                  |
| FCCM 2019   | Liu, L., et al.     | RapidRoute: Fast Assembly of Communication Structures for FPGA Overlays          |
| FPL 2018    | Hale, R., et al.    | Enabling Low Impact, Rapid Debug for Highly Utilized FPGA Designs                |

\*Journal of Parallel and Distributed Computing, May 2022

## A PRE-IMPLEMENTED MODULE FLOW

This section describes a pre-implemented module flow that can operate in two ways:

1. Target high performance implementations by reusing high quality, customized solutions.
2. A rapid prototyping demonstration vehicle that hints at a future of fast compile times.

### 12.1 Background and Flow Comparison

Both flows (high performance and rapid prototyping) start with the RapidWright provided Tcl command, `rapid_compile_ipi`. This command can be loaded by running `source ${::env(RAPIDWRIGHT_PATH)}/tcl/rapidwright.tcl` in the Vivado Tcl interpreter. Optionally, you can also configure Vivado to source the script each time it starts by modifying the `Vivado_init.tcl` (see the section ‘Loading and Running Tcl Scripts’ in [UG894: Vivado Design Suite User Guide - Using Tcl Scripting](#)).

---

**Note:** If you are using a standalone jar, you can extract the `rapidwright.tcl` (and other device/data) by running `java -jar <standalone.jar> --unpack_data` and setting the environment variable `RAPIDWRIGHT_PATH` to the standalone jar location.

---

This command runs on an open IP Integrator design by synthesizing, placing and routing all IP blocks out-of-context (OOC). Each block is provided a pblock (area constraint before placement to improves its re-usability). The implemented result for each IP is stored in the Vivado IP cache. RapidWright then uses the cache for each subsequent run (and only pre-implements one of each kind of IP—so if your design has multiple instances, only one run per type). After all IPs have been implemented OOC, it invokes the BlockStitcher in RapidWright to stitch all of the pre-implemented blocks together, places the blocks and routes them into a final implementation (note: currently RapidWright router is disabled). This command, can function in two modes as described previously. Here is a quick comparison of the high performance vs. rapid prototyping mode for pre-implemented blocks:

|                  | High Performance Flow          | Rapid Prototyping Flow       |
|------------------|--------------------------------|------------------------------|
| PBlock Selection | Application Architect (Manual) | PBlock Generator             |
| Block Placement  | Application Architect (Manual) | Block Placer                 |
| Global Routing   | Vivado                         | RapidWright Router OR Vivado |

The high performance flow (as described in more detail in the [High Performance Flow](#) section below) requires input from the application architect of the design. This does involve extra effort, but leads to potentially the highest implementation results. The [Rapid Prototyping Flow](#) is optimized more for fast compile times by automating the tasks of pblock selection for each block/IP involved and also placement of the blocks.

### 12.1.1 Module Cache

In order to better facilitate fast loading performance of modules, RapidWright has a fast and efficient file format for storing modules in a directory called a cache. The facilities for reading and writing these module storage files are found in the `BlockCreator` class found in the `ipi` package. As each IP to be implemented in a design might have different physical contexts or placement pblocks, multiple implementations of the same `Module` are stored in a `ModuleImpls` object which is simply an extended `ArrayList<Module>`. This allows all the implementations to reside in the same object and file and to reference each unique implementation with an index. Each RapidWright module entry has three relevant files:

1. Input: A metadata text file generated from Vivado to communicate information about the IP, its ports, clocks, constraints and approximate delays on inputs and outputs. This file is read into RapidWright during the module file creation process.
2. Output: To store the physical implementation data of each module implementation, a ‘.dat’ file is created from `BlockCreator`.
3. Output: The logical netlist is shared among all implementations and is stored in a compressed EDIF file format with a ‘.kryo’ extension.

The RapidWright module cache builds on top of the [IP cache in Vivado](#). By default RapidWright puts the cache in the `$HOME/blockCache` directory. This can be changed by setting the environment variable `IP_CACHE_PATH` before running the flow.

The IP cache generated by Vivado is supplemented by RapidWright by providing placed and routed DCPs and module files in each hash-named directory for each non-trivial IP. By default, the flow only creates a single implementation for each IP. Later, we describe how a user can create an implementation guide file (‘.igf’) directing the flow to create multiple unique implementations of the same module/IP.

### 12.1.2 Block Stitcher

The block stitcher (found in the class `BlockStitcher` of the `ipi` package) is the heart of the pre-implemented design flow. It manages the flow progress and ensures that all blocks have been cached and retrieved appropriately. It also reads in the IP Integrator netlist file (EDIF) that describes the block connectivity and stitches together the block implementations in the physical netlist. It also reads and parses the implementation guide file (if provided) and creates the block implementations accordingly.

## 12.2 High Performance Flow

One of the key attributes of RapidWright is the ability to capture optimized placement and routing solutions for a module and reuse them in multiple contexts or locations on a device. Vivado often provides good results for small implementation problems (smaller than 10k LUTs within a clock region). However, when those same modules are combined into a large system, total compile time increases and the probability of timing closure is reduced. This phenomenon limits achievable performance and timing closure predictability of larger designs.

RapidWright endows users with a new design vocabulary by caching, reusing and relocating pre-implemented blocks. We believe this to be an enabling concept and offer a three-step high performance design strategy:

1. Restructure the Design: Expose all modular pieces and replication in an IP Integrator design.
2. Packing & Placement Planning: Craft custom pblocks and placement patterns to match architecture layout and resources.
3. Stitch, Place & Route Implementation: Run the automated flow to create a final implementation.



Fig. 1: High level visual of the three step process for the high performance module-based design strategy

The first step requires the design architect to restructure the proposed design such that it can take full advantage of the benefits provided by pre-implemented modules. We define restructuring as a design refactoring that reflects three favorable design characteristics: (1) modularity, (2) module replication and (3) latency tolerance. Modularity uncovers design structure so it can be strategically mapped to architectural patterns. When modules are replicated, reuse of those high quality solutions and architectural patterns can be exploited to increase the benefits. Finally, if the modules within a design tolerate additional latency, inserting pipeline elements between them improves both timing performance and relocatability.

After the design architect has successfully restructured and modularized a design, step two is followed. Here, the design architect creates an implementation guide file that captures how best to map the modules of a design to the architecture of the target device. Specifically, pblocks are chosen for those pre-implemented modules of interest and physical locations are chosen for each instance. This step provides the design architect an opportunity to navigate FPGA fabric discontinuities. These discontinuities include boundaries such as IO columns, processor subsystems, and most significantly, SLR crossings. Such architectural obstacles cause design disruptions when targeting high performance. However, by leveraging a pre-implemented methodology provided in RapidWright, custom-created implementation solutions can be identified and planned out to manage the fabric discontinuities by custom module placement. Ultimately, this process is iterative and can inform useful RTL/design changes by focusing design structure to better match architectural resources.

Step three of the design strategy is an automated flow provided with RapidWright (depicted in the diagram above). We leverage a design input method in Vivado called IP Integrator (IPI). IPI offers an interactive block-based approach for system design by providing an IP library, IP creation flow and IP caching. RapidWright takes advantage of IPI by using leaf IP blocks as de-facto pre-implemented blocks and also by leveraging the IP caching mechanism. The RapidWright pre-implemented flow extends the caching mechanism to go beyond synthesis, by performing OOC placement and routing on the block within a constrained area. The flow begins by invoking Vivado's typical IPI synthesis and creating pre-implemented blocks for each module if not already found in the cache. RapidWright has an IPI Design Parser (EDIF-based) that creates a black-box netlist where each instance of a module is empty, ready to receive the pre-implemented module guts. The block stitcher reads the IP cache and populates the IPI design netlist. After stitching, the blocks are placed either by loading the implementation guide file or invoking a simulated annealing



Fig. 2: High level view of the pre-implemented flow process and interactions between Vivado and RapidWright

module placer to place the blocks onto the fabric automatically. Once all the blocks are placed, RapidWright creates a DCP file that is read into Vivado which completes the final routes.

### 12.2.1 Implementation Guide File

An implementation guide file (extension \*.igf) allows the application architect to communicate all of the specific implementation customization aspects of the packing and placement phase. The file has the following syntax structure (note the use of ... which indicates a potential repetition of the previous construct):

```
PART <part_name>
BLOCK <ip_cache_id> <# of implementations> <# of instances in the design> <# of_
  ↪clocks used in this block>
IMPL <implementation index> [<# of sub implementation entries>] <Pblock range>
  [SUB_IMPL <sub implementation index> '<Tcl command returning a subset of_
    ↪cells in the module>' <pblock range>]
  ...
  ...
INST <instance name> <implementation index to apply> <lower left corner site to place_
  ↪implementation on fabric>
  ...
CLOCK <clock name> <clock period constraint (ns)> <BUFGCE site (to use for skew_
  ↪estimation)>
  ...
END_BLOCK
  ...
END_BLOCKS
```

A parser and export for the IGF format can be found in `com.xilinx.rapidwright.design.blocksImplGuide.readImplGuide(String fileName)` and `com.xilinx.rapidwright.design.blocksImplGuide.writeImplGuide(String fileName)`.

## BLOCK (IP Cache Entry)

The block construct describes all of the potential implementations for a particular block/IP. For each uniquely configured IP (entry in the IP cache), there exists a block. Multiple instances of the same block/IP can exist and this construct allows the application architect to map instances by name to a specific implementation.

## IMPL (Implementation)

Each block has one or more IMPLs. Each implementation carries a pblock and potentially some SUB\_IMPL which allows for sub pblocks to be applied to portions of the logic inside the block. Each IMPL is indexed so that it can be referenced and applied to specific instances of the block. The application architect takes special care in selecting implementations and their pblocks to maximize there potential performance, architectural footprint and placement packing efficiency.

## SUB\_IMPL (Sub Implementation)

This is an optional construct that allows for more fine-grained pblocks being applied to a partial subset of the block/IP in an implementation. One field requires a Tcl command that returns a subset of cells that should be included in the sub implementation and associated pblock. Multiple sub implementation entries can exist for each implementation. As an example, if a particular IP is tall and narrow and there are specific cells that need to be placed at the top and/or bottom, the SUB\_IMPL contruct can be used to pblock the top and bottom specific cells in sub pblock of the overall implementation.

## INST (Instance)

In each design, there will be one or more instances of a block/IP. Each instance has a unique name and must be assigned to an implementation. Each instance also requires a placement which is provided by denoting a specific site onto which the lower left corner of the pblock of the respective implementation could be placed.

## CLOCK (Clock Input)

The clock construct describes a clock input to the block or IP and allows it to apply a clock period constraint in nanoseconds. It also requires the BUFGCE site from which the clock will be driven so that during placement and routing, the clock skew can be estimated.

## Basic Example

The diagram below illustrates a basic BLOCK example with many of the different fields highlighted.

## 12.3 Rapid Prototyping Flow

When an implementation guide file is not provided when calling the `rapid_compile_ipi` command, the flow defaults into a rapid prototyping flow that targets faster compilation. As no user input is provided to guide pblock selection or block placement, RapidWright provides automated facilities that accomplish these tasks automatically, albeit with lower average performance than the application architect.



Fig. 3: BLOCK example with multiple implementations, instances and clocks

### 12.3.1 Automatic PBlock Generator

The automatic pblock generator is found in the `design.blocks` package in the class called `PBlockGenerator`. It takes as input two files to calculate an appropriate pblock for a given circuit. First it uses a utilization report file (produced by Vivado's `report_utilization` command) to identify the types of resources needed and their quantity. Second, it reads a shapes report file that describes all of the shapes in the design to ensure that the pblock size can easily accommodate all shapes. Shapes are an internal Vivado construct to help small groups of cells be placed together (such as carry chains). In the pre-implemented flow, the `PBlockGenerator` is always invoked for each IP that is created, specific Tcl commands are found in the `tclScripts/rapidwright.tcl` file in the `compile_block_dcp` proc.

One of the techniques used by the `PBlockGenerator` is to identify the most common tile column patterns (see `TileColumnPattern` class in the `device.helper` package) found in a particular device and place the pblock onto the most common match for a given resource footprint to maximize the place-ability of the block.

Expectations for performance should be muted as the prioritization for the pblock generator is to produce a pblock that won't cause place and route to fail and lacks knowledge of the particular context of the design where the block may be destined. For this purpose, it is highly recommended that any performance critical block or design use the implementation guide file as a way to better optimize the pblock for a particular application.

Additional research and development work has been made by providing an improved horizontal block density algorithm described in Improved Horizontal Block Density.

### 12.3.2 Block Placer

The Block Placer (found in the class `BlockPlacer2` of the package `placer.blockplacer`), uses a simple simulated annealing schedule to place the blocks on to the fabric. The cost function is a function of total wire length between blocks. Again, like the pblock generator, the block placer attempts to produce valid results, with less emphasis on performance.

### 12.3.3 Router

The router is a very simple maze router with very limited routing congestion avoidance. Its clock router is still a work in progress and is currently disabled. It is currently tuned to work with UltraScale and UltraScale+ architectures. The `Router` class is found in the `router` package.



## RAPIDWRIGHT TUTORIALS

### 13.1 RWRoutE Timing-driven Routing

Routes an example design (e.g. “gnl\_2\_4\_7\_3.0\_gnl\_3500\_03\_7\_80\_80.dcp”).

This example was designed to show the default way to use RWRoutE in the timing-driven mode and validate routing results with Vivado.

#### 13.1.1 Steps to Run

1. Download the example gnl\_2\_4\_7\_3.0\_gnl\_3500\_03\_7\_80\_80.dcp design:

```
wget http://www.rapidwright.io/docs/_downloads/gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp
```

2. Invoke RWRoutE via gradle (this will ensure code is compiled before running):

```
rapidwright RWRoutE gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp
```

The main entry point of RWRoutE is RWRoutE.java and is reproduced here for convenience:

```
/**  
 * The main interface of RWRoutE that reads in a design checkpoint,  
 * and parses the arguments for the RWRoutEConfig Object of the router.  
 * It instantiates a RWRoutE Object or a PartialRouter Object  
 * based on the partialRouting parameter and calls the route method to route the  
 * design.  
 * @param args An array of strings that are used to create a RWRoutEConfig Object for  
 * the router.  
 */  
public static void main(String[] args) {  
    if(args.length < 2){  
        System.out.println("USAGE: <input.dcp> <output.dcp>");  
        return;  
    }  
    // Reads the output directory and set the output design checkpoint file name  
    String routedDCPfileName = args[1];  
  
    CodePerfTracker t = new CodePerfTracker("RWRoutE", true);  
  
    // Reads in a design checkpoint and routes it  
    Design routed = RWRoutE.routeDesignWithUserDefinedArguments(Design.  
        readCheckpoint(args[0]), args);
```

(continues on next page)

(continued from previous page)

```
// Writes out the routed design checkpoint
routed.writeCheckpoint(routedDCPfileName,t);
System.out.println("\nINFO: Write routed design\n " + routedDCPfileName + "\n"
←");
}
```

Please refer to the documentation Javadoc and code for more implementation details. The Java source code for RWRoute is located in: [src/com/xilinx/rapidwright/rwroute/](#).

### 13.1.2 Example Output

Example output using the `gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp` design is included below:

```
=====
==                               RWRoute                               ==
=====
=====
==           Reading DCP: gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp      ==
=====
XML Parse & Device Load:      2.593s
    EDIF Parse:        1.003s
    Read XDEF Header:   0.026s
    Read XDEF Caches:   0.045s
    Read XDEF Placement: 1.900s
    Read XDEF Routing:  0.071s
-----
[No GC] *Total*:      5.638s
=====
==                               Route Design                           ==
=====
INFO: Route 2123 pins of GLOBAL_LOGIC1
INFO: Estimated pre-routing max delay: 1969
-----
          Generated       RRG       Routed      Nodes With      CPD      Total Run
Iteration     RRG Nodes   Time (s)   Connections   Overlaps   (ps)   Time (s)
-----  -----  -----  -----  -----  -----  -----  -----
 1          238180     1.56      14952      3804     2469     2.32
 2          14115      0.11      6923      2366     2308     0.71
 3          14333      0.10      5103      1354     2308     0.83
 4          14207      0.10      2933      542      2308     0.71
 5          12593      0.09      1169      119      2323     0.66
 6          11313      0.05      274       6      2331     0.29
 7             587      0.00      56        0      2331     0.10
-----
INFO: Route 0 direct connections
INFO: No PIP overlaps
=====
==                               Statistics                            ==
=====
Total wirelength:          12860
Route design:              8.21s
```

(continues on next page)

(continued from previous page)

| Initialization:                              | 1.95s    |                  |            |                                       |
|----------------------------------------------|----------|------------------|------------|---------------------------------------|
| Routing:                                     | 6.26s    |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| Timing Report                                |          |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| Timing requirement (ps):                     | 3000     |                  |            |                                       |
| Critical path delay (ps):                    | 2331     |                  |            |                                       |
| Slack (ps):                                  | 669      |                  |            |                                       |
| With timing closure guarantee:               |          |                  |            |                                       |
| Critical path delay (ps):                    | 2500     |                  |            |                                       |
| Slack (ps):                                  | 500      |                  |            |                                       |
| <br>Detail delays:                           |          |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| Logic (ps)                                   | Net (ps) | (intrasite (ps)) | Total (ps) | Netlist Resource(s)                   |
| <hr/>                                        |          |                  |            |                                       |
| 0                                            | 0        | 0                | 0          | superSource                           |
| 78                                           | 451      | 0                | 529        | FD_n/Q<br>net: opr[57]                |
| 90                                           | 0        | 0                | 90         | LUT5_1b7/I0                           |
| 0                                            | 53       | 0                | 53         | LUT5_1b7/O<br>net: nfd6               |
| 125                                          | 0        | 0                | 125        | LUT3_1b6/I1                           |
| 0                                            | 193      | 0                | 193        | LUT3_1b6/O<br>net: nfd7               |
| 35                                           | 0        | 0                | 35         | LUT5_1bd/I3                           |
| 0                                            | 137      | 0                | 137        | LUT5_1bd/O<br>net: n100c              |
| 115                                          | 0        | 0                | 115        | LUT6_2_a4/LUT5/I2                     |
| 0                                            | 242      | 60               | 242        | LUT6_2_a4/LUT5/O<br>net: LUT6_2_a4/O5 |
| 115                                          | 0        | 0                | 115        | LUT6_2_a5/LUT5/I2                     |
| 0                                            | 253      | 60               | 253        | LUT6_2_a5/LUT5/O<br>net: LUT6_2_a5/O5 |
| 100                                          | 0        | 0                | 100        | LUT5_1bc/I3                           |
| 0                                            | 194      | 60               | 194        | LUT5_1bc/O<br>net: n1012              |
| 100                                          | 0        | 0                | 100        | LUT2_1b5/I1                           |
| 0                                            | 50       | 50               | 50         | LUT2_1b5/O<br>net: nfea               |
| 0                                            | 0        | 0                | 0          | FD_jmm/D                              |
| <hr/>                                        |          |                  |            |                                       |
| Arrival time:                                | 2331     |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| Write EDIF:                                  | 0.145s   |                  |            |                                       |
| Writing XDEF Header:                         | 0.195s   |                  |            |                                       |
| Writing XDEF Placement:                      | 0.367s   |                  |            |                                       |
| Writing XDEF Routing:                        | 0.453s   |                  |            |                                       |
| Writing XDEF Finalizing:                     | 0.030s   |                  |            |                                       |
| Writing XDC:                                 | 0.006s   |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| [No GC] *Total*:                             | 1.196s   |                  |            |                                       |
| <hr/>                                        |          |                  |            |                                       |
| INFO: Write routed design                    |          |                  |            |                                       |
| gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp |          |                  |            |                                       |

The output contains four main sections regarding reading the design checkpoint, RWRoute processing info, routing statistics, and timing report. The log file shows that RWRoute successfully routes the design. The originally calculated critical path delay is 2331 ps and it has been adjusted to 2500 ps through a pessimistic approach.

### 13.1.3 Validation with Vivado

To validate the routed design by Vivado, run the following at the prompt:

```
vivado -mode tcl gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp
```

and then run the following command at the Vivado Tcl prompt:

```
report_route_status
```

The resulting output should show the design is successfully routed, as all the routable nets are fully routed and there is no nets with routing errors.

```
Design Route Status
:      # nets :
-----
# of logical nets..... : 4937 :
  # of nets not needing routing..... : 1082 :
    # of internally routed nets..... : 932 :
    # of implicitly routed ports..... : 150 :
  # of routable nets..... : 3855 :
    # of fully routed nets..... : 3855 :
  # of nets with routing errors..... : 0 :
```

In Vivado 2020.1, the timing report shows that the design routed by RWRoute has a data path delay of 2.331 ns (2331 ps) for the same critical path. The full Vivado timing report is shown below:

```
-----
→---  
| Tool Version      : Vivado v.2020.1 (lin64) Build 2902540 Wed May 27 19:54:35 MDT 2020  
| Date              : Mon Nov 8 22:20:55 2021  
| Host               : yun-Latitude-3470 running 64-bit Ubuntu 16.04.7 LTS  
| Command            : report_timing  
| Design             : gnl_3500_03_7_80_80  
| Device              : xcvu3p-ffvc1517  
| Speed File         : -2 PRODUCTION 1.27 02-28-2020  
| Temperature Grade : E  
-----  
→---  
Timing Report  
  
Slack (MET) :          0.649ns (required time - arrival time)  
Source:           FD_n/C  
                  (rising edge-triggered cell FDRE clocked by clk  
→{rise@0.000ns fall@1.500ns period=3.000ns})  
Destination:       FD_jmm/D  
                  (rising edge-triggered cell FDRE clocked by clk  
→{rise@0.000ns fall@1.500ns period=3.000ns})  
Path Group:        clk
```

(continues on next page)

(continued from previous page)

| Path Type:                     | Setup (Max at Slow Process Corner)                        |                               |                    |                 |         |
|--------------------------------|-----------------------------------------------------------|-------------------------------|--------------------|-----------------|---------|
| Requirement:                   | 3.000ns (clk rise@3.000ns - clk rise@0.000ns)             |                               |                    |                 |         |
| Data Path Delay:               | 2.331ns (logic 0.753ns (32.304%) route 1.578ns (67.696%)) |                               |                    |                 |         |
| Logic Levels:                  | 7 (LUT2=1 LUT3=1 LUT5=5)                                  |                               |                    |                 |         |
| Clock Path Skew:               | -0.010ns (DCD - SCD + CPR)                                |                               |                    |                 |         |
| Destination Clock Delay (DCD): | 0.020ns = ( 3.020 - 3.000 )                               |                               |                    |                 |         |
| Source Clock Delay (SCD):      | 0.030ns                                                   |                               |                    |                 |         |
| Clock Pessimism Removal (CPR): | 0.000ns                                                   |                               |                    |                 |         |
| Clock Uncertainty:             | 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE               |                               |                    |                 |         |
| Total System Jitter (TSJ):     | 0.071ns                                                   |                               |                    |                 |         |
| Total Input Jitter (TIJ):      | 0.000ns                                                   |                               |                    |                 |         |
| Discrete Jitter (DJ):          | 0.000ns                                                   |                               |                    |                 |         |
| Phase Error (PE):              | 0.000ns                                                   |                               |                    |                 |         |
| Location                       | Delay type                                                | Incr(ns)                      | Path(ns)           | Netlist         |         |
| Resource(s)                    |                                                           |                               |                    |                 |         |
| SLICE_X22Y115                  | (clock clk rise edge)                                     | 0.000<br>0.000                | 0.000 r<br>0.000 r | clk (IN)<br>clk |         |
|                                | net (fo=1161, unset)                                      | 0.030                         | 0.030              | r FD_n/C        |         |
| SLICE_X22Y115                  | FDRE (Prop_DFF_SLICEM_C_Q)                                | 0.078                         | 0.108 r            | FD_n/Q          |         |
| SLICE_X14Y100                  | net (fo=21, routed)                                       | 0.523                         | 0.631              | opr[57]         |         |
|                                | LUT5 (Prop_G6LUT_SLICEL_I0_O)                             | 0.089                         | 0.720 r            | LUT5_1b7/       |         |
| O                              | net (fo=2, routed)                                        | 0.050                         | 0.770              | nfd6            |         |
| SLICE_X14Y100                  | LUT3 (Prop_B6LUT_SLICEL_I1_O)                             | 0.124                         | 0.894 r            | LUT3_1b6/       |         |
| O                              | net (fo=19, routed)                                       | 0.226                         | 1.120              | nfd7            |         |
| SLICE_X13Y95                   | LUT5 (Prop_F6LUT_SLICEM_I3_O)                             | 0.037                         | 1.157 r            | LUT5_1bd/       |         |
| O                              | net (fo=4, routed)                                        | 0.139                         | 1.296              | LUT6_2_         |         |
| a4/I2                          | SLICE_X14Y95                                              | LUT5 (Prop_C5LUT_SLICEL_I2_O) | 0.110              | 1.406 r         | LUT6_2_ |
| a4/LUT5/O                      |                                                           |                               |                    |                 |         |
|                                | net (fo=8, routed)                                        | 0.219                         | 1.625              | LUT6_2_         |         |
| a5/I2                          | SLICE_X14Y95                                              | LUT5 (Prop_B5LUT_SLICEL_I2_O) | 0.116              | 1.741 r         | LUT6_2_ |
| a5/LUT5/O                      |                                                           |                               |                    |                 |         |
|                                | net (fo=2, routed)                                        | 0.213                         | 1.954              | n100f           |         |
| SLICE_X14Y95                   | LUT5 (Prop_A5LUT_SLICEL_I3_O)                             | 0.100                         | 2.054 r            | LUT5_1bc/       |         |
| O                              | net (fo=2, routed)                                        | 0.157                         | 2.211              | n1012           |         |
| SLICE_X14Y95                   | LUT2 (Prop_H6LUT_SLICEL_I1_O)                             | 0.099                         | 2.310 r            | LUT2_1b5/       |         |
| O                              | net (fo=1, routed)                                        | 0.051                         | 2.361              | nfea            |         |

(continues on next page)

(continued from previous page)

| SLICE_X14Y95 | FDRE                        | r      | FD_jmm/D         |
|--------------|-----------------------------|--------|------------------|
| <hr/>        |                             |        |                  |
|              | (clock clk rise edge)       | 3.000  | 3.000 r          |
|              |                             | 0.000  | 3.000 r clk (IN) |
|              | net (fo=1161, unset)        | 0.020  | 3.020 clk        |
| SLICE_X14Y95 | FDRE                        |        | r FD_jmm/C       |
|              | clock pessimism             | 0.000  | 3.020            |
|              | clock uncertainty           | -0.035 | 2.985            |
| SLICE_X14Y95 | FDRE (Setup_HFF_SLICEL_C_D) | 0.025  | 3.010 FD_jmm     |
| <hr/>        |                             |        |                  |
|              | required time               |        | 3.010            |
|              | arrival time                |        | -2.361           |
| <hr/>        |                             |        |                  |
|              | slack                       |        | 0.649            |

It should be noted that the critical path reported by Vivado can be different from that of RWRoute for the same routed design. This is reasonable, as they use different timing models. The main point is that RWRoute is able to estimate a similar critical path delay to that of Vivado timing analysis.

## 13.2 RWRoute Wirelength-driven Routing

Routes an example design (e.g. “gnl\_2\_4\_7\_3.0\_gnl\_3500\_03\_7\_80\_80.dcp”).

This example shows how to use RWRoute in the faster, wirelength-driven mode and validate routing results with Vivado.

### 13.2.1 Steps to Run

1. Download the example gnl\_2\_4\_7\_3.0\_gnl\_3500\_03\_7\_80\_80.dcp design:

```
wget http://www.rapidwright.io/docs/_downloads/gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp
```

2. Invoke RWRoute via gradle (this will ensure code is compiled before running):

```
rapidwright RWRoute gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp --nonTimingDriven
```

Please refer to the documentation Javadoc and code for more implementation details. The Java source code for RWRoute is located in: [RapidWright/src/com/xilinx/rapidwright/rwroute/](#).

### 13.2.2 Example Output

Example output using the gnl\_2\_4\_7\_3.0\_gnl\_3500\_03\_7\_80\_80.dcp design is included below:

```
=====
==                               RWRoute                                ==
=====
=====                                         Reading DCP: gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp      ==
=====
(continues on next page)
```

(continued from previous page)

```
=====
XML Parse & Device Load:      2.293s
    EDIF Parse:        2.078s
    Read XDEF Header:   0.054s
    Read XDEF Caches:   0.060s
    Read XDEF Placement: 0.511s
    Read XDEF Routing:   0.100s
=====
[No GC] *Total*:      5.097s
=====
==                      Route Design ==
=====
INFO: Route 2123 pins of GLOBAL_LOGIC1
=====


| Iteration | Generated RRG Nodes | Time (s) | RRG Connections | Routed Nodes With Overlaps | Total Run Time (s) |
|-----------|---------------------|----------|-----------------|----------------------------|--------------------|
| 1         | 226748              | 0.94     | 14952           | 3847                       | 1.83               |
| 2         | 11743               | 0.08     | 7082            | 2483                       | 0.58               |
| 3         | 15013               | 0.10     | 5378            | 1343                       | 0.70               |
| 4         | 18160               | 0.13     | 3235            | 566                        | 0.73               |
| 5         | 17743               | 0.13     | 1411            | 106                        | 0.53               |
| 6         | 5002                | 0.04     | 328             | 7                          | 0.40               |
| 7         | 1654                | 0.01     | 30              | 0                          | 0.06               |


=====
INFO: Route 0 direct connections
INFO: No PIP overlaps
=====
==                      Statistics ==
=====
Total wirelength:          12309
Route design:              5.57s
└ Initialization:          0.25s
└ Routing:                 5.32s
=====
Write EDIF:            0.128s
Writing XDEF Header:    0.169s
Writing XDEF Placement: 0.464s
    Writing XDEF Routing: 0.614s
Writing XDEF Finalizing: 0.051s
    Writing XDC:          0.007s
=====
[No GC] *Total*:      1.433s
=====
INFO: Write routed design
gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp
=====
```

The output contains three main sections regarding reading the design checkpoint, RWRoute processing info, and routing statistics.

### 13.2.3 Validation with Vivado

To validate the routed design by Vivado, run the following at the prompt:

```
vivado -mode tcl gnl_2_4_7_3.0_gnl_3500_03_7_80_80_routed.dcp
```

and then run the following command at the Tcl prompt:

```
report_route_status
```

The design is successfully routed, as all the routable nets are fully routed and there is no nets with routing errors.

```
Design Route Status
:      # nets :
-----
# of logical nets..... : 4937 :
# of nets not needing routing..... : 1082 :
# of internally routed nets..... : 932 :
# of implicitly routed ports..... : 150 :
# of routable nets..... : 3855 :
# of fully routed nets..... : 3855 :
# of nets with routing errors..... : 0 :
----- :
```

## 13.3 RWRout Partial Routing

Routes an example design (e.g. “picoblaze\_partial.dcp”).

This example was designed to show the way to use RWRout in the partial mode for wirelength-driven routing and validate routing results with Vivado.

### 13.3.1 Steps to Run

1. Download the example picoblaze\_partial.dcp design:

```
wget http://www.rapidwright.io/docs/_downloads/picoblaze_partial.dcp
```

2. Invoke RWRout via gradle (this will ensure code is compiled before running):

```
rapidwright PartialRouter picoblaze_partial.dcp picoblaze_partial_routed.dcp --
  ↵nonTimingDriven
```

Please refer to the documentation Javadoc and code for more implementation details. The Java source code for RWRout is located in: src/com/xilinx/rapidwright/rwroute/.

### 13.3.2 Example Output

Example output using the picoblaze\_partial.dcp design is included below:

```
=====
==                               RWRout                               ==
=====
==                               Reading DCP: picoblaze_partial.dcp      ==
=====
XML Parse & Device Load:    2.365s
```

(continues on next page)

(continued from previous page)

```

EDIF Parse:      1.108s
Read XDEF Header: 0.027s
Read XDEF Caches: 0.148s
Read XDEF Placement: 5.297s
Read XDEF Routing: 3.486s
-----
[No GC] *Total*: 12.430s
=====
==                      Route Design
=====

Iteration   Generated       RRG          Routed    Nodes With      Total Run
RRG Nodes   Time (s)     Connections   Overlaps
-----  -----
1           2705791     13.74        12144      10146      16.63
2           689195      1.94         6496       5093       5.14
3           482106      1.41         4037       1604       3.84
4           292903      0.94         1609       298        2.49
5           176537      0.54         336        45        1.62
6           178330      0.61         59         10        1.20
7           261196      1.40         12         2         2.83
8           250050      1.75         3         0         2.65
=====

INFO: Route 0 direct connections

INFO: No PIP overlaps
=====
==                      Statistics
=====

Total wirelength:          101840
Route design:              41.73s
└ Initialization:          2.10s
└ Routing:                 39.62s
=====
Write EDIF:      0.209s
Writing XDEF Header: 1.744s
Writing XDEF Placement: 5.939s
Writing XDEF Routing: 3.902s
Writing XDEF Finalizing: 0.246s
Writing XDC:      0.008s
-----
[No GC] *Total*: 12.048s

INFO: Write routed design
picoblaze_partial_routed.dcp

```

The output contains three main sections regarding reading the design checkpoint, RWRoute processing info, and routing statistics.

### 13.3.3 Validation with Vivado

1. If you would like to visualize the original design shown in Vivado device view, run Vivado in its GUI mode:

```
vivado
```

2. To load the original checkpoint, run the following command in the Tcl console:

```
open_checkpoint picoblaze_partial.dcp
```

3. After the original checkpoint is loaded, to highlight unrouted nets, run:

```
highlight_objects -color red [get_nets * -filter {ROUTE_STATUS == UNROUTED}]
```

As a result, the device view of Vivado will show:



Nets highlighted in red are unrouted.

4. To check the route status of the original design checkpoint, run:

```
report_route_status
```

The design route status is as follows:

| Design Route Status                | : # nets : |
|------------------------------------|------------|
| <hr/>                              | <hr/>      |
| # of logical nets.....             | 147009 :   |
| # of nets not needing routing..... | 58434 :    |
| # of internally routed nets.....   | 47124 :    |
| # of nets with no loads.....       | 11132 :    |
| # of implicitly routed ports.....  | 178 :      |
| # of routable nets.....            | 88575 :    |
| # of unrouted nets.....            | 12144 :    |
| # of fully routed nets.....        | 76431 :    |

(continues on next page)

(continued from previous page)

|                                    |       |   |       |
|------------------------------------|-------|---|-------|
| # of nets with routing errors..... | :     | 0 | :     |
|                                    | ----- |   | ----- |

It is shown that there are 12144 unrouted nets.

- To load the routed design checkpoint into Vivado and validate the routed design by RWRoute, run:

```
open_checkpoint picoblaze_partial_routed.dcp
report_route_status
```



| Design Route Status                |        |
|------------------------------------|--------|
|                                    | # nets |
| -----                              | -----  |
| # of logical nets.....             | 147009 |
| # of nets not needing routing..... | 58434  |
| # of internally routed nets.....   | 47124  |
| # of nets with no loads.....       | 11132  |
| # of implicitly routed ports.....  | 178    |
| # of routable nets.....            | 88575  |
| # of fully routed nets.....        | 88575  |
| # of nets with routing errors..... | 0      |
| -----                              | -----  |

The design is successfully routed, as all the routable nets are fully routed.

## 13.4 RapidWright Report Timing Example

Reports the critical path within an example design (e.g. “microblaze4.dcp”).

### 13.4.1 Background

Please see our FPT'19 paper, "An Open-source Lightweight Timing Model for RapidWright" (Presentation) for background details on our RapidWright Timing Model.

### 13.4.2 Steps to Run

1. Download the example microblaze4.dcp design:

```
wget http://www.rapidwright.io/docs/_downloads/microblaze4.dcp
```

2. Invoke RWRout via gradle (this will ensure code is compiled before running):

```
rapidwright ReportTimingExample microblaze4.dcp
```

The source code for ReportTimingExample.java is provided below for easy reference:

```
public static void main(String[] args) {
    if(args.length != 1) {
        System.out.println("USAGE: <dcp_file_name>");
        return;
    }
    CodePerfTracker t = new CodePerfTracker("Report Timing Example");
    t.useGCToTrackMemory(true);

    // Read in an example placed and routed DCP
    t.start("Read DCP");
    Design design = Design.readCheckpoint(args[0], CodePerfTracker.SILENT);

    // Instantiate and populate the timing manager for the design
    t.stop().start("Create TimingManager");
    TimingManager tim = new TimingManager(design);

    // Get and print out worst data path delay in design
    t.stop().start("Get Max Delay");
    GraphPath<TimingVertex, TimingEdge> criticalPath = tim.getTimingGraph() .
    ↪getMaxDelayPath();

    // Print runtime summary
    t.stop().printSummary();
    System.out.println("\nCritical path: "+ ((int)criticalPath.getWeight()) +
    ↪" ps");
    System.out.println("\nPath details:");
    System.out.println(criticalPath.toString().replace(", ", ",\n")+"\n");
}
```

### 13.4.3 Example Output

Please refer to the timing library Javadoc and code for more implementation details. The Java source code for the timing library is located in: [RapidWright/src/com/xilinx/rapidwright/timing/](#).

Example output using the microblaze4.dcp design is included below:

```
=====
==                               Report Timing Example
=====
      Read DCP:      6.275s    436.922MBs
Create TimingManager:   1.838s    19.600MBs
      Get Max Delay: 0.087s    0.213MBs
=====
      *Total*:     8.200s    456.734MBs

Critical path: 1921 ps

Path details:
[superSource -> microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/
→ Operand_Select_I/EX_Op2_reg[31]/Q,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Operand_Select_I/
→ EX_Op2_reg[31]/Q -> microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/
→ ALU_I/Using_FPGA.ALL_Bits[31].ALU_Bit_I1/Not_Last_Bit.I_ALU_LUT_V5/Using_FPGA.
→ Native/LUT6/I0,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_FPGA.
→ ALL_Bits[31].ALU_Bit_I1/Not_Last_Bit.I_ALU_LUT_V5/Using_FPGA.Native/LUT6/I0 ->
→ microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_FPGA.ALL_
→ Bits[31].ALU_Bit_I1/Not_Last_Bit.I_ALU_LUT_V5/Using_FPGA.Native/LUT6/O,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_FPGA.
→ ALL_Bits[31].ALU_Bit_I1/Not_Last_Bit.I_ALU_LUT_V5/Using_FPGA.Native/LUT6/O ->
→ microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Use_Carry_
→ Decoding.CarryIn_MUXCY/Using_FPGA.Native_CARRY4_CARRY8/S[1],
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Use_Carry_
→ Decoding.CarryIn_MUXCY/Using_FPGA.Native_CARRY4_CARRY8/S[1] -> microblaze_0/U0/
→ MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Use_Carry_Decoding.CarryIn_
→ MUXCY/Using_FPGA.Native_CARRY4_CARRY8/CO[7],
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Use_Carry_
→ Decoding.CarryIn_MUXCY/Using_FPGA.Native_CARRY4_CARRY8/CO[7] -> microblaze_0/U0/
→ MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_FPGA.ALL_Bits[24].ALU_
→ Bit_I1/Not_Last_Bit.MUXCY_XOR_I/Using_FPGA.Native_I1_CARRY4_CARRY8/CI,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_FPGA.
→ ALL_Bits[24].ALU_Bit_I1/Not_Last_Bit.MUXCY_XOR_I/Using_FPGA.Native_I1_CARRY4_CARRY8/
→ CI -> microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_
→ FPGA.ALL_Bits[24].ALU_Bit_I1/Not_Last_Bit.MUXCY_XOR_I/Using_FPGA.Native_I1_CARRY4_
→ CARRY8/O[2],
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/ALU_I/Using_FPGA.
→ ALL_Bits[24].ALU_Bit_I1/Not_Last_Bit.MUXCY_XOR_I/Using_FPGA.Native_I1_CARRY4_CARRY8/
→ O[2] -> microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/Using_FPGA.
→ Native_i_1_73/I0,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/Using_FPGA.Native_i_
→ 1_73/I0 -> microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/Using_FPGA.
→ Native_i_1_73/O,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/Using_FPGA.Native_i_
→ 1_73/O -> microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/PreFetch_
→ Buffer_I1/Instruction_Prefetch_Mux[33].Gen_Instr_DFF/Using_FPGA.Native_i_1_33/I0,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/PreFetch_Buffer_I1/
→ Instruction_Prefetch_Mux[33].Gen_Instr_DFF/Using_FPGA.Native_i_1_33/I0 ->
→ microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/PreFetch_Buffer_I1/
→ Instruction_Prefetch_Mux[33].Gen_Instr_DFF/Using_FPGA.Native_i_1_33/O,
  microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Decode_I/PreFetch_Buffer_I1/
→ Instruction_Prefetch_Mux[33].Gen_Instr_DFF/Using_FPGA.Native_i_1_33/O ->
→ microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Operand_Select_I/EX_
→ Branch_CMP_Opl_reg[22]/D,
```

(continues on next page)

(continued from previous page)

```
microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Operand_Select_I/  
↳EX_Branch_CMP_Op1_reg[22]/D -> superSink]
```

The `GraphPath<TimingVertex, TimingEdge>` object contains a path/set of edges. From each `TimingEdge` you can access the net delay and/or the logic delay values for more detail.

This example was designed to illustrate the default way to use the timing library to report the critical path and its associated data path delay.

### 13.4.4 Compare with Vivado

To compare the output of the RapidWright timing model to Vivado, run the following at the prompt:

```
vivado -mode tcl microblaze4.dcp
```

and then run the following command at the Tcl prompt:

```
report_timing
```

In Vivado 2020.1, the timing report shows a data path delay of 1.846 ns (1846 ps). Which has an error of 30 ps or ~1.6%. The full Vivado timing report is shown below:

```
-----  
| Tool Version      : Vivado v.2020.1 (lin64) Build 2902540 Wed May 27 19:54:35 MDT  
| 2020  
| Date              : Mon Nov 8 22:17:03 2021  
| Host               : yun-Latitude-3470 running 64-bit Ubuntu 16.04.7 LTS  
| Command            : report_timing  
| Design             : design_1  
| Device              : xcvu3p-ffvc1517  
| Speed File         : -2 PRODUCTION 1.27 02-28-2020  
| Temperature Grade : E  
-----  
|-----  
Timing Report  
  
Slack (MET) :          0.051ns (required time - arrival time)  
Source:           microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_  
|  I/Operand_Select_I/EX_Branch_CMP_Op1_reg[8]/C  
|          (rising edge-triggered cell FDRE clocked by TS_clk  
|  ↳{rise@0.000ns fall@1.000ns period=2.000ns})  
Destination:        microblaze_0/U0/MicroBlaze_Core_I/Performance.Core/Use_Debug_  
|  Logic.Master_Core.Debug_Perf/single_step_count_reg[0]/CE  
|          (rising edge-triggered cell FDRE clocked by TS_clk  
|  ↳{rise@0.000ns fall@1.000ns period=2.000ns})  
Path Group:        TS_clk  
Path Type:         Setup (Max at Slow Process Corner)  
Requirement:       2.000ns (TS_clk rise@2.000ns - TS_clk rise@0.000ns)  
Data Path Delay:   1.846ns (logic 0.730ns (39.545%) route 1.116ns (60.455%))  
Logic Levels:      7 (CARRY8=4 LUT2=1 LUT4=1 LUT6=1)  
Clock Path Skew:   -0.007ns (DCD - SCD + CPR)  
Destination Clock Delay (DCD): 0.021ns = ( 2.021 - 2.000 )  
Source Clock Delay (SCD):    0.028ns
```

(continues on next page)

(continued from previous page)

| Location                                                                                                                       | Delay type | Incr(ns)       | Path(ns)           | Netlist <a href="#">[L]</a>    |
|--------------------------------------------------------------------------------------------------------------------------------|------------|----------------|--------------------|--------------------------------|
| Resource(s)                                                                                                                    |            |                |                    |                                |
| ---                                                                                                                            |            |                |                    |                                |
| clock TS_clk rise edge                                                                                                         |            | 0.000<br>0.000 | 0.000 r<br>0.000 r | Clk_0 (IN)<br>microblaze_0/U0/ |
| net (fo=835, unset)                                                                                                            |            | 0.028          | 0.028              | microblaze_0/U0/               |
| U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Operand_Select_I/Clk                                                         |            |                |                    |                                |
| SLICE_X75Y108 FDRE                                                                                                             |            |                | r                  | microblaze_0/U0/               |
| MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Operand_Select_I/EX_Branch_CMP_Op1                                              |            |                |                    |                                |
| reg[8]/C                                                                                                                       |            |                |                    |                                |
| ---                                                                                                                            |            |                |                    |                                |
| SLICE_X75Y108 FDRE (Prop_DFF2_SLICEL_C_Q)                                                                                      |            | 0.081          | 0.109 f            | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Operand_Select_I/EX_Branch_CMP_Op1_reg[8]/Q                                  |            |                |                    |                                |
| net (fo=1, routed)                                                                                                             |            | 0.300          | 0.409              | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Zero_Detect_I/Using_FPGA.Native_0[21]                                        |            |                |                    |                                |
| SLICE_X75Y108 LUT6 (Prop_C6LUT_SLICEL_I2_O)                                                                                    |            | 0.088          | 0.497 r            | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Zero_Detect_I/S0_inferred_3/i_0                                              |            |                |                    |                                |
| net (fo=1, routed)                                                                                                             |            | 0.010          | 0.507              | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Zero_Detect_I/Part_Of_Zero_Carry_Start_lopt_5                                |            |                |                    |                                |
| SLICE_X75Y108 CARRY8 (Prop_CARRY8_SLICEL_S[2]_CO[7])                                                                           |            | 0.155          | 0.662 f            | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Data_Flow_I/Zero_Detect_I/Part_Of_Zero_Carry_Start_Using_FPGA.Native_CARRY4_CARRY8_CO[7] |            |                |                    |                                |
| net (fo=1, routed)                                                                                                             |            | 0.026          | 0.688              | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Decode_I/jump_logic_I1/MUXCY_JUMP_CARRY2/jump_carry1                                     |            |                |                    |                                |
| SLICE_X75Y109 CARRY8 (Prop_CARRY8_SLICEL_CI_CO[1])                                                                             |            | 0.042          | 0.730 f            | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Decode_I/jump_logic_I1/MUXCY_JUMP_CARRY2/Using_FPGA.Native_CARRY4_CARRY8_CO[1]           |            |                |                    |                                |
| net (fo=1, routed)                                                                                                             |            | 0.117          | 0.847              | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Decode_I/jump_logic_I1/MUXCY_JUMP_CARRY3/ex_jump_wanted                                  |            |                |                    |                                |
| SLICE_X74Y109 LUT4 (Prop_D6LUT_SLICEM_I0_O)                                                                                    |            | 0.051          | 0.898 r            | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Decode_I/jump_logic_I1/MUXCY_JUMP_CARRY3/Using_FPGA.Native_i_1_100/0                     |            |                |                    |                                |
| net (fo=1, routed)                                                                                                             |            | 0.025          | 0.923              | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Decode_I/mem_wait_on_ready_N_carry_or/MUXCY_I_lopt_9                                     |            |                |                    |                                |
| SLICE_X74Y109 CARRY8 (Prop_CARRY8_SLICEM_S[3]_CO[7])                                                                           |            | 0.163          | 1.086 r            | microblaze_0/                  |
| U0/MicroBlaze_Core_I/Performance.Core/Decode_I/mem_wait_on_ready_N_carry_or/MUXCY_I_Using_FPGA.Native_CARRY4_CARRY8_CO[7]      |            |                |                    |                                |

(continues on next page)

(continued from previous page)

|                                                                                                                              |                                      |        |        |                    |
|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|--------|--------|--------------------|
|                                                                                                                              | net (fo=1, routed)                   | 0.026  | 1.112  | microblaze_0/      |
| ↳ U0/MicroBlaze_Core_I/Performance.Core/Decode_I/Use_MuxCy[7].OF_Piperun_Stage/MUXCY_I/of_PipeRun_carry_5                    |                                      |        |        |                    |
| SLICE_X74Y110                                                                                                                | CARRY8 (Prop_CARRY8_SLICEM_CI_CO[4]) | 0.099  | 1.211  | r microblaze_0/    |
| ↳ U0/MicroBlaze_Core_I/Performance.Core/Decode_I/Use_MuxCy[7].OF_Piperun_Stage/MUXCY_I/Using_FPGA.Native_CARRY4_CARRY8/CO[4] |                                      |        |        |                    |
|                                                                                                                              | net (fo=324, routed)                 | 0.472  | 1.683  | microblaze_0/      |
| ↳ U0/MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/of_piperun_for_ce                             |                                      |        |        |                    |
| SLICE_X80Y99                                                                                                                 | LUT2 (Prop_F6LUT_SLICEM_I1_O)        | 0.051  | 1.734  | r microblaze_0/    |
| ↳ U0/MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/single_step_count[0]_i_1/0                    |                                      |        |        |                    |
|                                                                                                                              | net (fo=2, routed)                   | 0.140  | 1.874  | microblaze_0/      |
| ↳ U0/MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/single_step_count[0]_i_1_n_0                  |                                      |        |        |                    |
| SLICE_X80Y99                                                                                                                 | FDRE                                 |        |        | r microblaze_0/U0/ |
| ↳ MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/single_step_count_reg[0]/CE                      |                                      |        |        |                    |
| <hr/>                                                                                                                        |                                      |        |        |                    |
| ↳ ----                                                                                                                       |                                      |        |        |                    |
|                                                                                                                              | (clock TS_clk rise edge)             | 2.000  | 2.000  | r                  |
|                                                                                                                              |                                      | 0.000  | 2.000  | r Clk_0 (IN)       |
| SLICE_X80Y99                                                                                                                 | net (fo=835, unset)                  | 0.021  | 2.021  | microblaze_0/      |
| ↳ MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/Clk                                              |                                      |        |        |                    |
|                                                                                                                              | FDRE                                 |        |        | r microblaze_0/U0/ |
| ↳ MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/single_step_count_reg[0]/C                       |                                      |        |        |                    |
|                                                                                                                              | clock pessimism                      | 0.000  | 2.021  |                    |
|                                                                                                                              | clock uncertainty                    | -0.035 | 1.986  |                    |
| SLICE_X80Y99                                                                                                                 | FDRE (Setup_HFF2_SLICEM_C_CE)        | -0.061 | 1.925  | microblaze_0/      |
| ↳ U0/MicroBlaze_Core_I/Performance.Core/Use_Debug_Logic.Master_Core.Debug_Perf/single_step_count_reg[0]                      |                                      |        |        |                    |
| <hr/>                                                                                                                        |                                      |        |        |                    |
|                                                                                                                              | required time                        |        | 1.925  |                    |
|                                                                                                                              | arrival time                         |        | -1.874 |                    |
| <hr/>                                                                                                                        |                                      |        |        |                    |
|                                                                                                                              | slack                                |        | 0.051  |                    |

## 13.5 Reuse Timing-closed Logic As A Shell

### 13.5.1 Background

Often in FPGA development, a desirable timing-closed implementation is only achieved after several iterations or many parallel implementation runs of a design. Elusive timing closure can be caused by one or a few stubborn modules in a design that have tight constraints or a large number of moderately difficult paths that have a lower probability of timing closure on any given run.

One advantageous strategy to improve timing closure success can be to preserve and enable reuse of a known good implementation of the stubborn logic. By preserving the implementation, place and route tools can (hopefully) avoid rediscovering difficult timing closure and simply focus on the other logic.

Some traditional approaches in Vivado to employ this preservation strategy might be to use or [Incremental Implementation Flows Dynamic Function eXchange](#) (DFX, previously known as partial reconfiguration or PR). Incremental Implementation Flows can work if the design has mostly converged and the amount of future changes to the design is small. However, if significant development still remains, this strategy is unlikely to save compile time.

Using DFX, one can lock down a portion of the design to form a reusable shell along with one or more reconfigurable partitions that contains logic under development. However, using DFX for this reuse methodology comes with some additional restrictions such as requiring area constraints and partition pin placements between the static and dynamic partitions of the design. It is more difficult to achieve an overlap of the preserved logic and the new logic and the nature of DFX requires additional DRCs that would not normally be run without using DFX.

This tutorial offers an alternative to the DFX flow with fewer restrictions and the ability to reused timing-closed logic without the need of area constraints by using the capabilities inherent in RapidWright.

### 13.5.2 Approach

To enable reuse of a timing-closed design as a shell in RapidWright, the original design will need some minor modifications.



1. The design should be logically partitioned into two parts: static and dynamic (as shown in the diagram above). The static part of the design is everything that should be preserved and be part of the “shell”. For example, many designs include components for handling network, DDR memory or a PCIe interface. These kinds of modules typically will have more demanding timing constraints and benefit from reusing their timing closure. The dynamic component is the portion of the design that the designer wants to change over time. The main requirement is that the dynamic component must be composed of one or more logical modules. If there is logic that needs to be modified at the top level of the design, it should be migrated into an existing module or a new module should be created and the logic added to it.
2. The interface of the dynamic modules must be consistent with all future logic modules that will populate it. In theory, this is straight-forward. However, during synthesis, design optimization, placement and routing, optimizations can modify the original interface of a module so that it no longer is consistent with the original definition and subsequent runs can cause divergence. To avoid this, dynamic modules should have the

`DONT_TOUCH` synthesis attribute applied to the module instance. The alternative `KEEP_HIERARCHY` is not sufficient as `DONT_TOUCH` will stay persistent on the netlist through routing whereas `KEEP_HIERARCHY` will only persist through synthesis.

Note that applying `DONT_TOUCH` to a module instance means that Vivado cannot add or remove pins of the instance, but can connect or disconnect pins and optimize logic inside the hierarchical module. Once the design is properly partitioned and synthesis attributes applied to dynamic modules, the design should be implemented using the typical implementation flow in Vivado. Once a fully placed and routed implementation that meets all requirements has been achieved, this design can be preserved as a design checkpoint (DCP) and used to seed the shell creation process.

This candidate shell design can then be loaded into RapidWright and all dynamic modules turned into black boxes.

### 13.5.3 Getting Started

#### 1. Prerequisites

To run this tutorial, you will need:

1. RapidWright 2023.1.3 or later
2. Vivado 2023.1 or later

#### 2. Creating a Candidate Implementation

For the ease of demonstration purposes in this tutorial, we have chosen a simple RISCV design targeting a KCU105 board (Kintex UltraScale xcku040-ffval1156-2-e). The design was created using the [Linux on LiteX-VexRiscv](#) project, but we will recreate the design using a minimal set of steps and dependencies.

---

**Note:** This design compilation step can take up to 30 minutes to complete and it is highly recommended to skip past it to save time. To do so, you can download the output files instead by running:

```
wget http://www.rapidwright.io/docs/_downloads/kcu105_step2.zip  
unzip kcu105_step2.zip  
cd kcu105  
vivado &
```

and then skip to step 3.

---

To get started, follow the commands below to download the source files:

```
wget http://www.rapidwright.io/docs/_downloads/kcu105_example.zip  
unzip kcu105_example.zip  
cd kcu105  
vivado -source kcu105.tcl &
```

The included script will create a Vivado project, load the generated Verilog and synthesize, optimize and place and route the design. The Verilog module for one of the RISCV CPUs has already been annotated for you with `DONT_TOUCH` and will serve as our dynamic module for this tutorial. The script will take several minutes to complete but will generate a placed and routed DCP and EDIF file ready for RapidWright. Notice we are running Vivado in the background as we will come back to the terminal shortly.

A sample result is shown in the image below with the leaf cells of CPU core (`cores_1_cpu_logic_cpu`) highlighted in yellow.



Out of convenience for this tutorial, we will generate the logic that will populate the dynamic module directly from this project. We simply need to change the top of the design to the `VexRiscv_1` core and then resynthesize using the `-mode out_of_context` option:

```
set_property top VexRiscv_1 [current_fileset]
reset_run synth_1
synth_design -mode out_of_context
write_checkpoint riscv_1_synth.dcp
```

At this point we should have two DCPs, one placed and routed candidate DCP to be made into a shell and one synthesized RISCV core that will populate the dynamic region in our shell.

### 3. Creating a Shell

To create a shell implementation, we need to take our top-level RISCV design that has the static portion meeting all necessary constraints and remove all logic from the dynamic components.

To remove the logic in the dynamic module, we need to use RapidWright in order to carefully separate the static logic from the dynamic logic as no area constraints (i.e. pblocks) were used to separate the two. Vivado can create a black box but can only do so correctly when the module made into a black box was sufficiently constrained such that all of its logic does not share any sites with any static logic. RapidWright has a built-in command that can accept a DCP and one or more cell instance names and produce a shell-based design with the cell instances turned into black boxes. For our example, we can run RapidWright from the command line (outside of Vivado):

```
rapidwright MakeBlackBox kcu105_route.dcp kcu105_route_shell.dcp
  ↳ VexRiscvLitexSmpCluster_Cc4_Iw64Is8192Iy2_Dw64Ds8192Dy2_ITs4DTs4_Ldw512_Cdma_Ood/
  ↳ cores_1_cpu_logic_cpu
```

This will create a new “shell” DCP (`kcu105_route_shell.dcp`) where the dynamic module has been turned into a black box. This DCP can then be used again and again as a base starting point as it contains an implemented solution for all of the static logic and we will use Vivado (and RapidWright in the future) to place and route additional dynamic modules on top of it.

#### 4. Populating a Black Box

Returning to our running Vivado instance, we can close our previous project and load the shell DCP using `open_checkpoint` at the Tcl command prompt:

```
close_project  
open_checkpoint kcu105_route_shell.dcp
```

---

**Note:** Due to the large number of constraints generated in RapidWright, opening the checkpoint might take a few minutes.

---

If RapidWright was able to correctly create the black box, you should see exactly one critical warning, which may show up in a dialog from Vivado as shown below:



The implemented design will look similar to the original design, except that the cells previously highlighted in yellow above will be missing:



You may also notice that several BEL sites have been marked with a **PROHIBIT** property that prevents any cells from being placed in those locations. Through experimentation, it has been found that cells placed in the same half SLICE as those in the existing static logic portion of the design can lead to congestion. Therefore, RapidWright adds the **PROHIBIT** property to the remaining BEL sites in any occupied half SLICEs to avoid this issue. These prohibited locations can be seen in the image below (the red circles with a slash):



We can also verify that the design is consistent by checking the routing status:

```
report_route_status
```

Which should return a result something like this:

| Design Route Status                       |   |                |
|-------------------------------------------|---|----------------|
|                                           | : | # nets :       |
| <i># of logical nets.....</i>             | : | <i>65546 :</i> |
| <i># of nets not needing routing.....</i> | : | <i>23431 :</i> |
| <i># of internally routed nets.....</i>   | : | <i>20613 :</i> |
| <i># of nets with no loads.....</i>       | : | <i>2818 :</i>  |
| <i># of routable nets.....</i>            | : | <i>42115 :</i> |
| <i># of unrouted nets.....</i>            | : | <i>38 :</i>    |
| <i># of fully routed nets.....</i>        | : | <i>42077 :</i> |
| <i># of nets with routing errors.....</i> | : | <i>0 :</i>     |

The key element to look for is that there are no nets with routing errors. Since we see that value is 0 we can proceed.

At this point, we want to lock down the implementation so that further place and route runs do not upset the timing

closure of the design. We can do this by running the Vivado Tcl command:

```
lock_design -level routing
```

This tags the netlist, placement and routing such that `place_design` and `route_design` do not modify the netlist of the existing implementation—thus preserving the original timing closure.

To populate the black box with the synthesized, out-of-context version of the RISCV core, we can load it directly in Vivado with `read_checkpoint -cell` (this is different from `open_checkpoint`).

```
read_checkpoint -cell VexRiscvLiteXSmpCluster_Cc4_Iw64Is8192Iy2_Dw64Ds8192Dy2_
↳ ITs4DTs4_Ldw512_Cdma_Ood/cores_1_cpu_logic_cpu riscv_1_synth.dcp
```

Once the dynamic module has been loaded with the synthesized RISCV core, we can implement the design and check the results

```
# We need to waive a DRC due to the nature of the design
set_msg_config -id {Common 17-55} -new_severity {Warning}
set_property SEVERITY {Warning} [get_drc_checks REQP-1753]
place_design
route_design
report_route_status
report_timing
```

Results should look similar to:

| Design Route Status                |            |
|------------------------------------|------------|
|                                    | : # nets : |
| # of logical nets.....             | : 75917 :  |
| # of nets not needing routing..... | : 28293 :  |
| # of internally routed nets.....   | : 24553 :  |
| # of nets with no loads.....       | : 3740 :   |
| # of routable nets.....            | : 47624 :  |
| # of nets with fixed routing.....  | : 41853 :  |
| # of fully routed nets.....        | : 47624 :  |
| # of nets with routing errors..... | : 0 :      |

and should meet timing:

| Timing Report                                         |                                                                                             |
|-------------------------------------------------------|---------------------------------------------------------------------------------------------|
| Slack (MET) :                                         | 0.253ns (required time - arrival time)                                                      |
| Source:                                               | main_crg_idelayctrl_ic_reset_reg/C<br>(rising edge-triggered cell FDRE clocked by main_crg_ |
| ↳ clkout1 {rise@0.000ns fall@2.500ns period=5.000ns}) |                                                                                             |
| Destination:                                          | IDELAYCTRL_REPLICATED_0_2/RST<br>(recovery check against rising-edge clock main_crg_        |
| ↳ clkout1 {rise@0.000ns fall@2.500ns period=5.000ns}) |                                                                                             |
| Path Group:                                           | **async_default**                                                                           |
| Path Type:                                            | Recovery (Max at Slow Process Corner)                                                       |
| Requirement:                                          | 5.000ns (main_crg_clkout1 rise@5.000ns - main_crg_clkout1_                                  |
| ↳ rise@0.000ns)                                       |                                                                                             |
| Data Path Delay:                                      | 3.838ns (logic 0.117ns (3.048%) route 3.721ns (96.952%))                                    |
| Logic Levels:                                         | 0                                                                                           |
| Clock Path Skew:                                      | -0.211ns (DCD - SCD + CPR)                                                                  |
| Destination Clock Delay (DCD):                        | 5.765ns = ( 10.765 - 5.000 )                                                                |

(continues on next page)

(continued from previous page)

| Location<br>↳ Resource(s)   | Delay type                                                                 | Incr(ns) | Path(ns)              | Netlist |
|-----------------------------|----------------------------------------------------------------------------|----------|-----------------------|---------|
| <hr/>                       |                                                                            |          |                       |         |
| ↳ G10                       | (clock main_crg_clkout1 rise edge)                                         |          |                       |         |
|                             |                                                                            | 0.000    | 0.000 r               |         |
|                             |                                                                            | 0.000    | 0.000 r clk125_p (IN) |         |
|                             | net (fo=0)                                                                 | 0.001    | 0.001 IBUFDS/I        |         |
|                             | HPIOBDIFFINBUF_X1Y59 DIFFINBUF (Prop_DIFFINBUF_HPIOBDIFFINBUF_DIFF_IN_P_O) | 0.521    | 0.522 r IBUFDS/       |         |
| ↳ DIFFINBUF_INST/O          |                                                                            |          |                       |         |
| G10                         | net (fo=1, routed)                                                         | 0.090    | 0.612 IBUFDS/OUT      |         |
|                             | IBUFCTRL (Prop_IBUFCTRL_HPIOB_I_O)                                         | 0.000    | 0.612 r IBUFDS/       |         |
| ↳ IBUFCTRL_INST/O           |                                                                            |          |                       |         |
|                             | net (fo=1, routed)                                                         | 0.750    | 1.362 IBUFDS_n_0_     |         |
| ↳ BUFG_inst_n_0             |                                                                            |          |                       |         |
| BUFGCE_X1Y52                | BUFGCE (Prop_BUFGCE_BUFGCE_I_O)                                            | 0.083    | 1.445 r IBUFDS_n_0_   |         |
| ↳ BUFG_inst/O               |                                                                            |          |                       |         |
|                             | net (fo=9, routed)                                                         | 1.687    | 3.132 main_crg_       |         |
| ↳ clkin                     |                                                                            |          |                       |         |
| MMCME3_ADV_X1Y2             | MMCME3_ADV (Prop_MMCME3_ADV_CLKIN1_CLKOUT1)                                | -0.231   | 2.901 r MMCME2_ADV/   |         |
| ↳ CLKOUT1                   |                                                                            |          |                       |         |
|                             | net (fo=1, routed)                                                         | 0.437    | 3.338 main_crg_       |         |
| ↳ clkout1                   |                                                                            |          |                       |         |
| BUFGCE_X1Y69                | BUFGCE (Prop_BUFGCE_BUFGCE_I_O)                                            | 0.083    | 3.421 r BUFG/O        |         |
| X0Y1 (CLOCK_ROOT)           | net (fo=31, routed)                                                        | 2.666    | 6.087 idelay_clk      |         |
| SLICE_X0Y139                | FDRE                                                                       |          | r main_crg_           |         |
| ↳ idelayctrl_ic_reset_reg/C |                                                                            |          |                       |         |
| <hr/>                       |                                                                            |          |                       |         |
| ↳ SLICE_X0Y139              | FDRE (Prop_HFF2_SLICEL_C_Q)                                                | 0.117    | 6.204 f main_crg_     |         |
| ↳ idelayctrl_ic_reset_reg/Q |                                                                            |          |                       |         |
|                             | net (fo=25, routed)                                                        | 3.721    | 9.925 main_crg_       |         |
| ↳ idelayctrl_ic_reset       |                                                                            |          |                       |         |
| BITSLICE_CONTROL_X0Y3       | IDELAYCTRL                                                                 |          | f IDELAYCTRL_         |         |
| ↳ REPLICATED_0_2/RST        |                                                                            |          |                       |         |
| <hr/>                       |                                                                            |          |                       |         |
| ↳ G10                       | (clock main_crg_clkout1 rise edge)                                         |          |                       |         |
|                             |                                                                            | 5.000    | 5.000 r               |         |
|                             |                                                                            | 0.000    | 5.000 r clk125_p (IN) |         |

(continues on next page)

(continued from previous page)

|                         |                                                        |        |        |               |
|-------------------------|--------------------------------------------------------|--------|--------|---------------|
|                         | net (fo=0)                                             | 0.001  | 5.001  | IBUFDS/I      |
| HPIOBDIFFINBUF_X1Y59    | DIFFINBUF (Prop_DIFFINBUF_HPIOBDIFFINBUF_DIFF_IN_P_O)  | 0.324  | 5.325  | r IBUFDS/     |
| ↳ DIFFINBUF_INST/O      |                                                        |        |        |               |
| G10                     | net (fo=1, routed)                                     | 0.051  | 5.376  | IBUFDS/OUT    |
|                         | IBUFCTRL (Prop_IBUFCTRL_HPIOB_I_O)                     | 0.000  | 5.376  | r IBUFDS/     |
| ↳ IBUFCTRL_INST/O       |                                                        |        |        |               |
|                         | net (fo=1, routed)                                     | 0.649  | 6.025  | IBUFDS_n_0_   |
| ↳ BUFG_inst_n_0         |                                                        |        |        |               |
| BUFGCE_X1Y52            | BUFGCE (Prop_BUFCE_BUFGCE_I_O)                         | 0.075  | 6.100  | r IBUFDS_n_0_ |
| ↳ BUFG_inst/O           |                                                        |        |        |               |
|                         | net (fo=9, routed)                                     | 1.524  | 7.624  | main_crg_     |
| ↳ clkin                 |                                                        |        |        |               |
| MMCME3_ADV_X1Y2         | MMCME3_ADV (Prop_MMCM3_ADV_CLKIN1_CLKOUT1)             | 0.335  | 7.959  | r MMCME2_ADV/ |
| ↳ CLKOUT1               |                                                        |        |        |               |
|                         | net (fo=1, routed)                                     | 0.372  | 8.331  | main_crg_     |
| ↳ clkout1               |                                                        |        |        |               |
| BUFGCE_X1Y69            | BUFGCE (Prop_BUFCE_BUFGCE_I_O)                         | 0.075  | 8.406  | r BUFG/O      |
| X0Y1 (CLOCK_ROOT)       | net (fo=31, routed)                                    | 2.359  | 10.765 | idelay_clk    |
| BITSLICE_CONTROL_X0Y3   | IDELAYCTRL                                             |        |        | r IDELAYCTRL_ |
| ↳ REPLICATED_0_2/REFCLK |                                                        |        |        |               |
|                         | clock pessimism                                        | 0.112  | 10.876 |               |
|                         | clock uncertainty                                      | -0.065 | 10.812 |               |
| BITSLICE_CONTROL_X0Y3   | IDELAYCTRL (Recov_CONTROL_BITSlice_CONTROL_REFCLK_RST) | -0.633 | 10.179 | IDEDELAYCTRL_ |
| ↳ REPLICATED_0_2        |                                                        |        |        |               |
|                         | -----                                                  |        |        |               |
|                         | required time                                          |        | 10.179 |               |
|                         | arrival time                                           |        | -9.925 |               |
|                         | -----                                                  |        |        |               |
|                         | slack                                                  |        | 0.253  |               |

The final implementation with the newly populated dynamic module highlighted in green is shown below.



Complexity can vary widely amongst different designs, so not all designs may benefit from this approach. However, please [reach out](#) to the RapidWright team if you encounter challenges when applying this approach for your own projects.

## 13.6 Use DREAMPlaceFPGA to Place a Netlist via FPGA Interchange Format

### 13.6.1 Background

DREAMPlaceFPGA is an open source GPU-accelerated placer for FPGAs that uses a deep learning toolkit. It is being developed at the University of Texas at Austin in Dr. David Pan's research group. DREAMPlaceFPGA has published work demonstrating some compelling placement runtime acceleration compared to other published placers. DREAMPlaceFPGA has also adopted support for the [FPGA Interchange Format](#).

The [FPGA Interchange Format](#) (FPGAIF) is a standard exchange format designed to provide all the information necessary to perform placement and routing in an open source context. See [FPGA Interchange Format](#) for additional details and resources.

### 13.6.2 Approach

This tutorial will demonstrate how to convert an existing design from Vivado into the FPGA Interchange Format to be placed in DREAMPlaceFPGA. It will then demonstrate how the resulting placed design can be routed either by the router in Vivado or in RapidWright via RWRoute as shown in the diagram below.



### 13.6.3 Getting Started

#### 1. Prerequisites

To run this tutorial, you will need:

1. RapidWright 2023.1.3 or later
2. Vivado 2023.1 or later
3. DREAMPlaceFPGA commit fb6d086

**Attention:** If you are using a pre-configured AWS Instance from a RapidWright hands-on conference event, DREAMPlaceFPGA has already been setup for you in `~/DREAMPlaceFPGA`.

To checkout and build DREAMPlaceFPGA, please see their [build instructions](#). Also see the [note here](#) for how to generate an FPGA Interchange device model file. Our notes on the install process for CentOS 7 can be found here: [Notes on Setting Up DREAMPlaceFPGA](#).

#### 2. Getting an example design and converting it to the FPGA Interchange Format

For the ease of demonstration purposes in this tutorial, we have chosen a simple design targeting a VU3P (Virtex UltraScale+ xcvu3p-ffvc1517-2-e). To get started, follow the commands below (alternate design DCP download link here: [gnl\\_2\\_4\\_7\\_3.0\\_gnl\\_3500\\_03\\_7\\_80\\_80.dcp](#)):

```
wget http://www.rapidwright.io/docs/_downloads/gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp
rapidwright DcpToInterchange gnl_2_4_7_3.0_gnl_3500_03_7_80_80.dcp
```

This will convert the design checkpoint file into two files:

1. `gnl_2_4_7_3.0_gnl_3500_03_7_80_80.netlist` – a logical netlist file in the FPGA Interchange Format
2. `gnl_2_4_7_3.0_gnl_3500_03_7_80_80.phys` – a physical netlist (placement and routing information) file in the FPGA Interchange Format

For this tutorial, we are only interested in #1 (the logical netlist) as we will be generating a new implementation with the tools mentioned above.

### 3. Placing the design with DREAMPlaceFPGA

There are a few preparatory steps in order to perform a placement run with DREAMPlaceFPGA. Currently, DREAMPlaceFPGA reads Interchange files by converting them to bookshelf format consistent with the [ISPD'16 contest](#). Convert the example DCP with the following command:

```
cd DREAMPlaceFPGA # Or wherever your DREAMPlaceFPGA installation is located
python3 IFsupport/IF2bookshelf.py --netlist ../gnl_2_4_7_3.0_gnl_3500_03_7_80_80.
↪netlist
```

Next, DREAMPlaceFPGA uses a JSON settings file to configure the placement run that we need to configure. Here is an example JSON settings file for our example design (which you can also download here `gnl_2_4_7_3.0_gnl_3500_03_7_80_80.json`):

```
wget -O test/gnl_2_4_7_3.0_gnl_3500_03_7_80_80.json http://www.rapidwright.io/docs/_  
↪downloads/gnl_2_4_7_3.0_gnl_3500_03_7_80_80.json
```

```
{
  "aux_input" : "benchmarks/IF2bookshelf/gnl_2_4_7_3.0_gnl_3500_03_7_80_80/design.aux",
  "gpu" : 0,
  "num_bins_x" : 512,
  "num_bins_y" : 512,
  "global_place_stages" : [
    {"num_bins_x" : 512, "num_bins_y" : 512, "iteration" : 2000, "learning_rate" : 0.01,
     ↪"wirelength" : "weighted_average", "optimizer" : "nesterov"}
  ],
  "routability_opt_flag" : 0,
  "target_density" : 1.0,
  "density_weight" : 8e-5,
  "random_seed" : 1000,
  "scale_factor" : 1.0,
  "global_place_flag" : 1,
  "legalize_flag" : 1,
  "detailed_place_flag" : 0,
  "dtype" : "float32",
  "plot_flag" : 0,
  "num_threads" : 1,
  "deterministic_flag" : 1,
  "enable_if" : 1,
  "part_name" : "xcvu3p-ffvc1517-2-e"
}
```

By default, the `"gpu" : 0`, acceleration option is disabled so the tutorial is compatible with a greater number of compute configurations, however, this is an option with a compatible GPU (see [DREAMPlaceFPGA External Dependencies](#) for details). For a full description of the options available, see [Running DREAMPlaceFPGA](#).

To run DREAMPlaceFPGA with the configuration file, run the following at a terminal:

```
python3 dreamplacefpga/Placer.py test/gnl_2_4_7_3.0_gnl_3500_03_7_80_80.json
```

Placement will proceed and may take a few minutes, afterwards a result new FPGA Interchange physical netlist file will be generated here: results/design/design.phys.

#### 4. Converting the placed design to a DCP and routing it in Vivado

Now that the design is fully placed by DREAMPlaceFPGA, we can convert it back to a DCP and open it in Vivado by running the following command:

```
rapidwright PhysicalNetlistToDcp ../gnl_2_4_7_3.0_gnl_3500_03_7_80_80.netlist results/
↪design/design.phys ../gnl_2_4_7_3.0_gnl_3500_03_7_80_80.xdc placed.dcp --out_of_
↪context
```

This command will invoke RapidWright to load the logical netlist (which has not changed) and physical netlist (which now contains the new placement information) into a placed design checkpoint (placed.dcp), readable by Vivado. Opening this design in Vivado will show the resulting placement solution:

```
vivado placed.dcp &
```



By default, the design has all the cells locked (notice the orange colored cells that have been placed) as this is advantageous for some implementation flows used by RapidWright. However, the placement can be unlocked with the Vivado Tcl command `lock_design -unlock -level placement`. Also, the command above added the `--out_of_context` option to ensure that when the DCP was opened in Vivado, that it treated it as an out of context implementation and would not automatically insert buffers on all the top level ports.

Now that the placed design is loaded in Vivado, we can route it by running the following Tcl command in Vivado:

```
route_design
```

Afterwards, we should see something like this:



We can then validate the solution of the route by running:

```
report_route_status
```

Which should report something similar to this:

```
Design Route Status
:
-----
# of logical nets..... :      # nets :
----- : -----
# of nets not needing routing..... : 4937 :
# of internally routed nets..... : 898 :
# of implicitly routed ports..... : 748 :
# of routable nets..... : 150 :
# of routable nets..... : 4039 :
# of fully routed nets..... : 4039 :
# of nets with routing errors..... : 0 :
----- : -----
```

The key metric to look for is the last one to ensure there are 0 nets with routing errors.

As an alternative to Vivado, we can also use RWRoute (the main router in RapidWright) to route the design—showing how the FPGA Interchange Format allows placement and routing to happen in different open source tools on the same design.

## 5. Routing the placed solution with RWRoute in RapidWright

If we return to the placed solution of our design generated by DREAMPlaceFPGA, we can take another path through RapidWright to have it routed by its main router, RWRoute. To load the FPGA Interchange design files in RWRoute, we need to have the `.netlist` and `.phys` files in the same directory with the same root name. We can accomplish this by simply copying the files over and invoking RWRoute:

```
cp ./gnl_2_4_7_3.0_gnl_3500_03_7_80_80.netlist .
cp results/design/design.phys gnl_2_4_7_3.0_gnl_3500_03_7_80_80.phys
rapidwright RWRoute gnl_2_4_7_3.0_gnl_3500_03_7_80_80.phys rwroute_routed.dcp --
˓→nonTimingDriven --outOfContext
```

The last `rapidwright` command will accomplish 3 things:

1. Load the existing FPGA Interchange placed result from DREAMPlaceFPGA into RapidWright
2. Route the design using RWRoute (non-timing driven mode)

- Once routing is complete, it will export a routed design checkpoint called `rwrout_routed.dcp`. The `--outOfContext` option is added since the example design's top level ports do not connect to IOBs and allows Vivado to import the design without inserting buffers.

## 6. Validate the RWRout routing solution in Vivado

We can open the routed DCP from RWRout by running the following in our existing Vivado Tcl prompt:

```
open_checkpoint rwrout_routed.dcp
```

The result should look similar to the solution below:



We can similarly validate the routed solution with Vivado by running the Tcl command:

```
report_route_status
```

Which should produce an identical one as to that shown above for the Vivado routed solution:

| Design Route Status                | : # nets :   |
|------------------------------------|--------------|
| <hr/>                              | <hr/> -----: |
| # of logical nets.....             | 4937 :       |
| # of nets not needing routing..... | 898 :        |
| # of internally routed nets.....   | 748 :        |
| # of implicitly routed ports.....  | 150 :        |
| # of routable nets.....            | 4039 :       |
| # of fully routed nets.....        | 4039 :       |
| # of nets with routing errors..... | 0 :          |
| <hr/>                              | <hr/> -----: |

## 13.7 Polynomial Generator: Placed and Routed Circuits in Seconds

### 13.7.1 Background

Often, FPGA compilation runtime can be long due to the complexity and nature of problems that are solved in mapping a user's design onto the fabric of an FPGA. However, if we scope a user's design to a specific domain, the compilation process can become significantly simplified.

This tutorial aims to provide a limited scope proof-of-concept of this idea. Consider the mathematical formula called a [polynomial](#). A polynomial equation consists of an expression of variables and coefficients that involves only the

operations of addition, subtraction, multiplication and positive integer-powers of variables. Since a polynomial relies on a finite set of mathematical operations, we can devise circuit “generators” for these operators that can be created on-the-fly. These circuit generators (adders, subtractors, multipliers and raise-to-integer-power) have very predictable implementation patterns on FPGA fabric and can thus be placed and internally routed very quickly.

In this application, the polynomials supported have the following attributes:

- Coefficients are integers and the first coefficient is positive
- No division operations are present
- Four mathematical operators: addition, subtraction, multiplication and positive integer powers

RapidWright has three generators to support the `PolynomialGenerator` that implement the four supported mathematical operators. A combination adder/subtractor generator for addition and subtraction and a multiplier generator for multiplication and raise to the integer power (chaining multiple multipliers together).

## 13.7.2 Getting Started

### 1. Prerequisites

To run this tutorial, you will need:

1. RapidWright 2023.1.4 or later
2. Vivado 2023.1 or later

### 2. Creating a Simple Polynomial Circuit in Seconds

The interface to run the `PolynomialGenerator` is quite simple:

```
rapidwright PolynomialGenerator
```

Which should produce the usage message:

```
USAGE: <polynomial> <bit width 1 to 18> [--hand-placer]
```

The polynomial syntax requires explicit operators and expanded set of terms (no parenthesis or factors), for example:

$$(x - 1)(x + 2) = x^2 + x - 2$$

should be rewritten as `x^2+x-2`. Coefficient also will require the explicit multiplication operator `*`, for example:

$$3x^2 + x - 2$$

should be rewritten as `3*x^2+x-2`.

The mathematical generators will create placed and routed circuits up to 18 bits of width. Although the FPGA fabric can support much larger dimensions, for simplicity we limit this proof-of-concept to 18 bits. Further work could push this limit far beyond 18 bits.

We can generate this polynomial with 16 bit operators with the following command:

```
rapidwright PolynomialGenerator 3*x^2+x-2 16
```

With the following output (`RWRoute` output removed for brevity):

```
=====
==                               Polynomial Generator
=====
Load Device:      1.692s
Init Operators:   0.878s
Build Operator Tree: 0.183s
...
<Removed RWRout Output>
...
Final Route:     0.474s
Write DCP:        0.455s
-----
[No GC] *Total*:    3.681s
Wrote DCP: polynomial.dcp
```

The resulting DCP, `polynomial.dcp` should be generated in just a few seconds and can be examined by Vivado by running:

```
vivado polynomial.dcp &
```

(Let's run it in the background so we can return to the terminal later with Vivado still running).

Once loaded, we can zoom to the placed and routed circuit in clock region X3Y3, we can also highlight the individual operators by color by running the following Tcl command in the Vivado Tcl prompt:

```
foreach c [get_cells] { incr i; highlight_objects -leaf_cells $c -color_index $i }
```

The resulting circuit should look similar to this:



We can also run `report_route_status`:

```
report_route_status
```

## Design Route Status

|                                    | : # nets : |
|------------------------------------|------------|
| -----                              | ----- :    |
| # of logical nets.....             | 1775 :     |
| # of nets not needing routing..... | 1659 :     |
| # of internally routed nets.....   | 1334 :     |
| # of nets with no loads.....       | 292 :      |
| # of implicitly routed ports.....  | 33 :       |
| # of routable nets.....            | 116 :      |
| # of fully routed nets.....        | 116 :      |
| # of nets with routing errors..... | 0 :        |
| -----                              | ----- :    |

which shows the design being fully routed without any errors or violations. We can also check timing with `report_timing`:

## report\_timing

|                                |                                                                                                                                                |          |                     |             |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|----------|---------------------|-------------|
| Slack (MET) :                  | 0.310ns (required time - arrival time)                                                                                                         |          |                     |             |
| Source:                        | mult2_51/mult/DSP_OUTPUT_INST/CLK<br>(rising edge-triggered cell DSP_OUTPUT clocked by clk<br>↳{rise@0.000ns fall@0.646ns period=1.291ns})     |          |                     |             |
| Destination:                   | mult_34/mult/DSP_A_B_DATA_INST/A[9]<br>(rising edge-triggered cell DSP_A_B_DATA clocked by clk<br>↳{rise@0.000ns fall@0.646ns period=1.291ns}) |          |                     |             |
| Path Group:                    | clk                                                                                                                                            |          |                     |             |
| Path Type:                     | Setup (Max at Slow Process Corner)                                                                                                             |          |                     |             |
| Requirement:                   | 1.291ns (clk rise@1.291ns - clk rise@0.000ns)                                                                                                  |          |                     |             |
| Data Path Delay:               | 0.577ns (logic 0.207ns (35.875%) route 0.370ns (64.125%))                                                                                      |          |                     |             |
| Logic Levels:                  | 0                                                                                                                                              |          |                     |             |
| Clock Path Skew:               | -0.095ns (DCD - SCD + CPR)                                                                                                                     |          |                     |             |
| Destination Clock Delay (DCD): | 1.763ns = ( 3.054 - 1.291 )                                                                                                                    |          |                     |             |
| Source Clock Delay (SCD):      | 2.047ns                                                                                                                                        |          |                     |             |
| Clock Pessimism Removal (CPR): | 0.189ns                                                                                                                                        |          |                     |             |
| Clock Uncertainty:             | 0.035ns ((TSJ <sup>2</sup> + TIJ <sup>2</sup> ) <sup>1/2</sup> + DJ) / 2 + PE                                                                  |          |                     |             |
| Total System Jitter (TSJ):     | 0.071ns                                                                                                                                        |          |                     |             |
| Total Input Jitter (TIJ):      | 0.000ns                                                                                                                                        |          |                     |             |
| Discrete Jitter (DJ):          | 0.000ns                                                                                                                                        |          |                     |             |
| Phase Error (PE):              | 0.000ns                                                                                                                                        |          |                     |             |
| Location                       | Delay type                                                                                                                                     | Incr(ns) | Path(ns)            | Netlist ↴   |
| ↳Resource(s)                   |                                                                                                                                                |          |                     |             |
| -----                          |                                                                                                                                                |          |                     |             |
| ↳-----                         |                                                                                                                                                |          |                     |             |
|                                | (clock clk rise edge)                                                                                                                          | 0.000    | 0.000 r             |             |
|                                |                                                                                                                                                | 0.000    | 0.000 r clk (IN)    |             |
|                                | net (fo=107, unset)                                                                                                                            | 2.047    | 2.047 mult2_51/     |             |
| ↳mult/CLK                      |                                                                                                                                                |          |                     |             |
| DSP48E2_X12Y73                 | DSP_OUTPUT                                                                                                                                     |          |                     | r mult2_51/ |
| ↳mult/DSP_OUTPUT_INST/CLK      |                                                                                                                                                |          |                     |             |
| -----                          |                                                                                                                                                |          |                     |             |
| ↳-----                         |                                                                                                                                                |          |                     |             |
| DSP48E2_X12Y73                 | DSP_OUTPUT (Prop_DSP_OUTPUT_DSP48E2_CLK_P[41])                                                                                                 | 0.207    | 2.254 r             | mult2_51/   |
| ↳mult/DSP_OUTPUT_INST/P[41]    |                                                                                                                                                |          |                     |             |
|                                | net (fo=1, routed)                                                                                                                             | 0.370    | 2.624 mult_34/mult/ |             |
| ↳A[9]                          |                                                                                                                                                |          |                     |             |

(continues on next page)

(continued from previous page)

|                         |                                                    |                            |
|-------------------------|----------------------------------------------------|----------------------------|
| DSP48E2_X12Y72          | DSP_A_B_DATA                                       | r mult_34/mult/            |
| ↳DSP_A_B_DATA_INST/A[9] |                                                    |                            |
| -----                   |                                                    |                            |
| ↳-----                  |                                                    |                            |
|                         | (clock clk rise edge)                              | 1.291 1.291 r              |
|                         |                                                    | 0.000 1.291 r clk (IN)     |
|                         | net (fo=107, unset)                                | 1.763 3.054 mult_34/mult/  |
| ↳CLK                    |                                                    |                            |
| DSP48E2_X12Y72          | DSP_A_B_DATA                                       | r mult_34/mult/            |
| ↳DSP_A_B_DATA_INST/CLK  |                                                    |                            |
|                         | clock pessimism                                    | 0.189 3.243                |
|                         | clock uncertainty                                  | -0.035 3.207               |
| DSP48E2_X12Y72          | DSP_A_B_DATA (Setup_DSP_A_B_DATA_DSP48E2_CLK_A[9]) | -0.273 2.934 mult_34/mult/ |
| ↳DSP_A_B_DATA_INST      |                                                    |                            |
|                         | required time                                      | 2.934                      |
|                         | arrival time                                       | -2.624                     |
|                         | slack                                              | 0.310                      |

By default, the clock constraint in the polynomial design is set to 775MHz, or the highest specification of the DSP in speed grade 2 UltraScale+ devices. As can be seen above, this circuit has been placed and routed successfully and has margin to spare to run at this frequency. Of course, as polynomials grow larger, this frequency may be impacted, but it strives to run at the spec of the device.

We can repeat this process for a more complex polynomial in the next step—keep your Vivado instance open so we can reload the next iteration more quickly.

### 3. More Complex Polynomial and Inspection with the RapidWright Hand Placer

For the next step, let's consider a more complex polynomial:

$$8y^4 + 43yx^3 + 7x^2 - 14$$

```
rapidwright PolynomialGenerator 8*y^4+43*y*x^3+7*x^2-14 18 --hand-placer
```

This will generate a multi-variable polynomial with inputs x and y and before the design is routed, will invoke the RapidWright hand placer that will allow the placement of the polynomial to be inspected by the user. After running the command, a window should pop up that looks similar to this:



This is a simplified device model view in RapidWright of the targeted device with the operator module overlaid in green and orange. The user can use a mouse scroll wheel up to zoom in (or **CTRL + -**) and down to zoom out (or **CTRL + =**). Alternatively, there are toolbar buttons to control zoom, or zoom to a selected module (which can be selected on the right window pane list).



After zooming in, try selecting one of the module instances by moving the mouse over one of the shapes, click and hold the left mouse button and move the block around the fabric as shown in the animation below:



Notice that the color of the block changes color based on what area of the fabric is located. Green means a valid placement location, red is invalid and orange is valid although its footprint overlaps with another module. Also notice that when a module instance is being drag selected, it has translucent lines to other module instances. The thickness of these lines is determined by the number of net connections between those two module instances. In this fashion, the modules can be placed or re-placed by hand.

Try moving the add18\_0 block away from the rest of the module instances onto another valid location (where the block turns green) as shown in the image below:



If you make a mistake, the hand placer also has an Undo/Redo stack so **CTRL + Z** will undo the last movement and **CTRL + SHIFT + Z** will redo a movement. When completed, the window can be closed and the new placement is automatically applied.

Once the window is closed, the `PolynomialGenerator` will automatically resume and generate the `polynomial.dcp`. We can then use `refresh_design` in Vivado's Tcl prompt to re-load the DCP:

```
refresh_design
set i 0; foreach c [get_cells] { incr i; highlight_objects -leaf_cells $c -color_
↳index $i }
```

After we zoom in, we should see a very similar floorplan layout to the one chosen interactively in the hand placer:



Notice in the snapshot above, the `add18_0` module instance is in the top right of the screen, in step with where we placed it in the hand placer.

Again, we can repeat the Vivado Tcl commands `report_route_status` and `report_timing` to validate the result. Although we do not replicate the output here, the design should be valid and meet timing as in step 2.

At this point you are invited to try different polynomials of your own and try making your own placements in the hand placer to explore the several possibilities available to you in this proof-of-concept.

## 13.8 Inserting and Routing a Debug Core As An ECO

### 13.8.1 Context

An Engineering Change Order, or ECO, is a method that allows small modifications to be made to an existing design without needing to reimplement it from scratch. In doing so, by preserving as much of the existing implementation as possible and only making incremental changes, ECOs can save on compilation runtime.

In this tutorial, we will demonstrate how simple trace-buffer(s) can be rapidly inserted into an existing place-and-routed design and then unintrusively connected to signals of interest to aid debugging.

This trace-buffer consists of a FIFO36 primitive configured as a ring-buffer that continuously samples its 36-bit data input on each clock cycle. Once the clock is stopped, this trace-buffer will contain a 1024 cycle history of the activity on those inputs. Unloading the contents of this trace-buffer is assumed to be realized using the [Readback Capture](#) process, which leverages built-in configuration resources (as opposed to the user-programmable resources) to transparently extract the contents of the user state including the contents of block RAMs that host our FIFO36.

Even though this particular debug core is simplistic, the techniques described in this tutorial can be extended to more complex cores. An overview of the sections that follow is shown below:



### 13.8.2 Getting Started

#### 1. Prerequisites

To run this tutorial, you will need:

1. Java 11 or later
2. Vivado 2023.1 or later
3. `git`

In this tutorial, RapidWright will be used as a precompiled library downloaded from a Java package distribution site (Maven Central).

#### 2. Setup

Start by cloning and entering the tutorial repository:

```
git clone https://github.com/eddieh-xlnx/eco_insert_route_debug
cd eco_insert_route_debug
```

This repository contains:

- The Gradle Wrapper (`gradlew`) which is a script for launching the [Gradle Build Tool](#).
- Gradle settings (`build.gradle`) for this project, indicating what its dependencies (e.g. RapidWright) are, where to download them from, as well as the location of source files.
- Java sources used in this tutorial (e.g. `src/EcoInsertRouteDebug.java`).
- Example Vivado Design Checkpoints (DCPs) for use in this tutorial.

The example design that we will be using in this tutorial is an open source RISC-V processor core by the name of [Berkeley Out-of-Order Machine](#) that has been placed-and-routed onto a Xilinx UltraScale+ XCVU3P device. The configuration used (MediumBoomConfig) resulted in a design that occupies around 36,000 LUTs.

This design can be examined by opening it up in Vivado:

```
vivado files/boom_medium_routed.dcp
```

Here, the placed and routed result is shown:



Note that only the upper-center part of the device is occupied by the user design, leaving a significant amount of free resources to aid debug.

Next, we can examine our simplified debug core by also opening it with Vivado. This debug core was generated from an RTL description and synthesized out-of-context, placed, and routed as a standard Vivado project. An [out-of-context synthesis run](#) refers to compilation of a sub-module that is intended to be integrated with a top-level design at some future time. In such a flow, for example, any top-level ports will not have I/O buffer cells inserted. Run the following command using the Tcl Console located in the lower portion of the Vivado GUI:

```
open_checkpoint files/fifo36_routed.dcp
```

A new window will appear with this design. Although it may look like the device is empty at first, navigating to “Leaf Cells” in the “Netlist” tab in the left-hand side of the Vivado GUI and selecting the FIFO36E2\_inst element will zoom to the FIFO36 primitive, which is located in the lower-left corner of the device:



Note that this debug core contains a number of unconnected inputs (specifically, its write clock and data inputs which are to be connected later to the design under debug) as well as control inputs (e.g. write enable, sleep, etc.) that are pre-routed to VCC or GND as appropriate. In particular, GND is supplied from LUT resources situated to the right of the block RAM primitive.

Once you are satisfied with the state of both designs, please close both Vivado windows.

In the following sections, we will demonstrate how to use RapidWright to combine both the base design and the debug core into a single design in a way that preserves the placement and routing of both. Additionally, we show how to incrementally connect and re-route the signals of interest without disrupting this placement and routing, as well as how to instantiate and relocate multiple debug cores.

### 3. Inserting the debug core into a place-and-routed design

RapidWright will be used to merge both the base design and the debug core into a single design without losing any of its existing placement and routing. The Java code to achieve this is available at `src/EcoInsertRouteDebug.java`, the relevant parts of which is duplicated below:

```
class EcoInsertRouteDebug {
    public static void main(String[] args) {
        Design baseDesign = Design.readCheckpoint("files/boom_medium_routed.dcp");
        Design debugDesign = Design.readCheckpoint("files/fifo36_routed.dcp");

        boolean unrouteStaticNets = false;
```

(continues on next page)

(continued from previous page)

```
Module debugModule = new Module(debug, unrouteStaticNets);

ModuleInst debug1ModuleInst = baseDesign.createModuleInst("debug1", _  
↳debugModule);
    debug1ModuleInst.placeOnOriginalAnchor();

// << commented out code omitted >>

baseDesign.writeCheckpoint("boom_medium_debug.dcp");
}

}
```

This code describes a Java class with a single “main” method that serves as its entrypoint when executed.

The first two `Design.readCheckpoint()` calls loads the two DCPs into RapidWright’s data structures. Next, the design containing the debug core is converted into a RapidWright `Module` object representing a “template” that can be copied and moved into other designs. This `Module` object is then instantiated inside the base design (under a level of hierarchy named `debug1`) and placed at its original location. Lastly, the newly merged design is written to disk.

Compile and run this source code with the following command that invokes the Gradle wrapper, and then open Vivado (in the background) to examine the generated DCP:

```
./gradlew -Dmain=EcoInsertRouteDebug :run  
vivado boom_medium_debug.dcp &
```

Once again, it is not immediately obvious that the debug core has been merged in with the base design; select “debug1 > Leaf Cells -> FIFO36E2\_inst” from the left-hand “Netlist” tab to verify its existence and location. The following image shows the result after zooming out six steps:



To verify the state of the design,

```
report_route_status
```

can be run in the Vivado Tcl Console to give the following result:

```
report_route_status
Design Route Status
:      # nets :
-----
# of logical nets..... : 87712 :
# of nets not needing routing..... : 33882 :
# of internally routed nets..... : 30546 :
# of nets with no loads..... : 3299 :
# of routable nets..... : 53830 :
# of fully routed nets..... : 53830 :
# of nets with routing errors..... : 37 :
# of nets with no driver..... : 37 :
-----

Nets with Routing Errors: (only the first 10 nets are listed)
debug1/DIN[0]
debug1/DIN[10]
debug1/DIN[11]
```

(continues on next page)

(continued from previous page)

```
debug1/DIN[12]
debug1/DIN[13]
debug1/DIN[14]
debug1/DIN[15]
debug1/DIN[16]
debug1/DIN[17]
debug1/DIN[18]
```

This output is reporting that 37 nets have no driver — these refer to the unconnected 36 data inputs plus its accompanying clock signal.

Please keep Vivado open as we will be reusing it in the next section.

#### 4. Connecting the debug core

Now that the debug core has been inserted into the base design, the next step is to use RapidWright to connect and route the signals of interest from the design under debug into the debug core for tracing.

Return to `src/EcoInsertRouteDebug.java` and uncomment the commented lines of code to get:

```
class EcoInsertRouteDebug {
    public static void main(String[] args) {
        Design baseDesign = Design.readCheckpoint("files/boom_medium_routed.dcp");
        Design debugDesign = Design.readCheckpoint("files/fifo36_routed.dcp");

        boolean unrouteStaticNets = false;
        Module debugModule = new Module(debugDesign, unrouteStaticNets);

        ModuleInst debug1ModuleInst = baseDesign.createModuleInst("debug1", ↵
            ↵debugModule);
        debug1ModuleInst.placeOnOriginalAnchor();

        List<ModuleInst> debugInsts = new ArrayList();
        debugInsts.add(debug1ModuleInst);

        String clkName = "clock_uncore_clock_IBUF_BUFG";
        List<String> netNames = new ArrayList();
        for (int i = 0; i < 36; i++) {
            netNames.add("system/tile_prci_domain/tile_reset_domain_tile/core/csr/s1_"
            ↵pc_reg[" + i + "]");
        }
        EDIFNetlist baseNetlist = baseDesign.getNetlist();
        List<String> netPinList = buildNetPinList(baseNetlist, clkName, netNames, ↵
            ↵debugInsts);
        ECOTools.connectNet(baseDesign, netPinList);

        PartialRouter.routeDesignPartialNonTimingDriven(baseDesign, null);

        baseDesign.writeCheckpoint("boom_medium_debug.dcp");
    }
}
```

These new lines of code are responsible for connecting nets from the base design to the debug core. This includes specifying the base design's global clock net (named `clock_uncore_clock_IBUF_BUFG`) that will form the write clock of our debug core, and collecting a list of all program counter (PC) nets in the RISC-V core (nets `system/tile_prci_domain/tile_reset_domain_tile/core/csr/s1_pc_reg[35:0]`) to be connected to

the debug core's data inputs.

The mapping of the each net (captured in the `netPinList` member variable) to its debug core input is done in the `buildNetPinList()` method, which is not shown. `ECOTools.connectNet()` (a RapidWright method modelled on Vivado's `connect_net` Tcl API) is then provided with this mapping and connections are made through the design hierarchy as needed.

Lastly, `PartialRouter.routeDesignPartialNonTimingDriven()` calls a variant of RapidWright's `router` (named `RWRoute`) that will incrementally route only those newly connected pins using just unoccupied resources, without disrupting any part of the existing place and route solution.

Re-compile and execute the modified source code by running from the terminal

```
./gradlew -Dmain=EcoInsertRouteDebug :run
```

again. Once complete, reload the design in Vivado using the following Tcl command:

```
refresh_design
```

which will reload `boom_medium_debug.dcp` from disk to give the following output:



Notice that there now exists routing (green lines) connecting the design under debug in the upper portion of the device with the debug core in the lower left corner. Running `report_route_status` now shows that the design contains no routing errors:

|                        |            |
|------------------------|------------|
| Design Route Status    | : # nets : |
| ----- : ----- :        |            |
| # of logical nets..... | : 87675 :  |

(continues on next page)

(continued from previous page)

```
# of nets not needing routing..... : 33814 :  
# of internally routed nets..... : 30515 :  
# of nets with no loads..... : 3299 :  
# of routable nets..... : 53861 :  
# of fully routed nets..... : 53861 :  
# of nets with routing errors..... : 0 :  
----- : ----- :
```

## 5. Relocating the debug core

During the original creation of the debug core, the placer decided to locate it in the bottom left corner of the device. Given its distance from the design under debug, routing delays caused by connecting any signals of interest to this debug core may cause an undesirable performance impact. RapidWright's ModuleInst functionality allows the debug core to be relocated to legal positions closer to the design under debug. For the scope of this tutorial, we will visually identify a new location for placing the debug core but it should be noted that automated methods also exist.

Using Vivado (which should still have the last `boom_medium_debug.dcp` open) it can be observed that there are free block RAM resources to the left and right of the design under debug which would represent better locations for any debug core.

Select and zoom into the following site:

```
select_objects [get_sites RAMB36_X7Y34]
```

Note that this site is unoccupied, and that LUT resources to the right of this RAM resource are also unoccupied as they are necessary to host a number of GND sources.

Edit `src/EcoInsertRouteDebug.java` again, comment out the `placeOnOriginalAnchor()` call and instead place the debug core at this new location, as shown below:

```
ModuleInst debug1ModuleInst = baseDesign.createModuleInst("debug1", debugModule);  
// debug1ModuleInst.placeOnOriginalAnchor(); // Comment out this line  
  
Device device = baseDesign.getDevice(); // Add this and the following line  
debug1ModuleInst.place(device.getSite("RAMB36_X7Y34"));  
  
List<ModuleInst> debugInsts = new ArrayList();
```

Re-compile and execute the modified source code by calling

```
./gradlew -Dmain=EcoInsertRouteDebug :run
```

and execute

```
refresh_design
```

inside Vivado to view this latest result. Ensure that this result is also legal with a call to

```
report_route_status
```

## 6. Inserting and routing multiple debug cores

A single debug core (in this example, supporting the tracing of up to 36 signals) may not be sufficient. Besides being able to relocate a single ModuleInst, RapidWright also supports the creation of multiple instantiations of the same

Module object. Incidentally, the program counter of the BOOM processor is 40-bits wide thus requiring a second debug core for full visibility.

Edit `src/EcoInsertRouteDebug.java` to create and place a second instantiation, then connect that up, so that the `main` method looks like the following:

```
public static void main(String[] args) {
    Design baseDesign = Design.readCheckpoint("files/boom_medium_routed.dcp");
    Design debugDesign = Design.readCheckpoint("files/fifo36_routed.dcp");

    boolean unrouteStaticNets = false;
    Module debugModule = new Module(debugDesign, unrouteStaticNets);

    ModuleInst debug1ModuleInst = baseDesign.createModuleInst("debug1", debugModule);
    // debug1ModuleInst.placeOnOriginalAnchor();
    Device device = baseDesign.getDevice();
    debug1ModuleInst.place(device.getSite("RAMB36_X7Y34"));

    // Second instantiation and placement into new site
    // >>>>
    ModuleInst debug2ModuleInst = baseDesign.createModuleInst("debug2", debugModule);
    debug2ModuleInst.place(device.getSite("RAMB36_X4Y41"));
    // <<<<

    List<ModuleInst> debugInsts = new ArrayList();
    debugInsts.add(debug1ModuleInst);
    // Addition of second debug core to list of instances
    // >>>>
    debugInsts.add(debug2ModuleInst);
    // <<<<

    String clkName = "clock_uncore_clock_IBUF_BUFG";
    List<String> netNames = new ArrayList();
    // Increase PC from 36 bits to full 40 bits
    // >>>>
    for (int i = 0; i < /*36*/ 40; i++) {
        // <<<<
        netNames.add("system/tile_prci_domain/tile_reset_domain_tile/core/csr/s1_pc_
        ↪reg[" + i + "]");
    }
    EDIFNetlist baseNetlist = baseDesign.getNetlist();
    List<String> netPinList = buildNetPinList(baseNetlist, clkName, netNames, ↪
    ↪debugInsts);
    ECOTools.connectNet(baseDesign, netPinList);

    PartialRouter.routeDesignPartialNonTimingDriven(baseDesign, null);

    baseDesign.writeCheckpoint("boom_medium_debug.dcp");
}
```

Re-compile and execute the modified source code by calling

```
./gradlew -Dmain=EcoInsertRouteDebug :run
```

and execute

```
refresh_design
```

inside Vivado to view this latest result. Again, verify the result by calling

```
report_route_status
```

and close Vivado once you are satisfied it is legal.

## 7. Inserting and routing debug cores without leaving Vivado

It is possible to adapt the these techniques into a standalone application to be run directly from and integrated with Vivado. The source code for this standalone application is located at `src/EcoInsertRouteDebugApp.java` and differs from that in the prior section by accepting two command-line arguments corresponding to the input and output DCPS to be processed, and to accept signals for tracing as marked inside the Vivado GUI. To build this standalone application, execute the following command:

```
./gradlew -Dmain=EcoInsertRouteDebugApp :fatJar
```

to build an all-in-one “JAR” (Java Archive) file containing all its compiled code and dependencies.

Next, create a new Tcl source file named `eco_insert_route_debug.tcl` with the following contents:

```
# Write the design
write_checkpoint -force eco_input.dcp
write_edif -force eco_input.edf
# Execute the EcoInsertRouteDebugApp.jar and display its output upon exit
puts [exec java -jar EcoInsertRouteDebugApp.jar eco_input.dcp eco_output.dcp]
# Close the old checkpoint
close_design
# Re-open the modified checkpoint
open_checkpoint eco_output.dcp
# Check design is fully routed
report_route_status
# Find all signals marked for debug and display them in a new GUI tab
show_objects -name find_1 [get_nets -hierarchical -top_net_of_hierarchical_group -
→filter { MARK_DEBUG == "TRUE" } ]
```

Lastly, launch Vivado with our original base design once again:

```
vivado files/boom_medium_routed.dcp
```

We will use the “Mark Debug” feature within the Vivado GUI to select the signals to be connected to the debug core. From the “Netlist” tab in the left hand side, open up the top-level “Nets” folder and right click on the `tl_slave_0_a_bits_data_OBUF (64)` entry and select “Mark Debug” as shown below:



From the Tcl Console, execute the previously created script in the following manner:

```
source eco_insert_route_debug.tcl
```

As the comments in the Tcl script indicate, this causes the base design (with signals marked for debug) to be written to disk, operated on by the `EcoInsertRouteDebugApp` and then re-opened in Vivado, all without leaving the Vivado interface. Verify that all traced nets are indeed fully routed.

## 8. Closing Comments

In this tutorial, we've demonstrated how RapidWright can be used as part of a custom application that is capable of inserting, relocating, connecting and routing one or more debug cores (trace buffers) without disrupting the existing placement and routing of the base design.

More specifically, we've demonstrated how RapidWright's `Module` capabilities can be used to insert and relocate designs within other designs, how `ECOTools` can be used to connect nets and pins from such merged designs, and how `PartialRouter` can be used to incrementally route just the unrouted pins.

Beyond those, RapidWright contains many more capabilities – for example, `ECOTools` supports the ability to also disconnect pins from nets, remove cells, create new nets and cells, etc. [Pre-implemented Modules](#) is a separate tutorial that discusses `Module`-s in more details, in which a manual `HandPlacer` (with GUI) and automated simulated-annealing based `BlockPlacer` are both described and could be adapted to ease the process for finding module placements.

## 13.9 Create Placed and Routed DCP to Cross SLR

### What You'll Need to Get Started:

- RapidWright 2023.1 or later
- Vivado 2018.2 or later

One of the example programs that is provided with RapidWright solves a challenging problem on UltraScale+ devices (this approach is not valid for Series 7 or UltraScale parts). Crossing super logic region (SLR) boundaries at high speed can prove quite difficult in conventional Vivado flows. The hardware provides dedicated TX/RX flip flops in Laguna sites to enable the creation of paths with very short delay but experience two significant problems:

1. The dedicated super long lines (SLLs) that connect TX and RX Laguna flip flop pairs are often sensitive to hold time violations due to the higher multi-die variability.
2. Paths crossing the SLR boundary are taxed with an additional delay penalty called “Inter-SLR Compensation” (ISC). This penalty increases the calculated delay and reduces its potential for high speed.



Fig. 1: Example Vivado tooltip window describing the Inter-SLR Compensation delay penalty

In RapidWright, we have created a parametrized, stand-alone application that can automatically generate a placed and routed DCP from scratch that implements a circuit that eliminates and minimizes the two challenges mentioned above. First, it creates a netlist with pairs of flops that are connected and placed and routed across SLR crossings using the dedicated Laguna TX/RX flip flop sites. Next, it custom routes the clock (the circuit has its own BUFGCE) such that it can individually tune the leaf clock buffers (LCBs) for each direction on each side of the SLR. By using the LCBs, the hold time in the first challenge mentioned above is eliminated. To minimize the ISC penalty, a clock root is generated for each clock region (CR) that contains an SLR crossing.

### 13.9.1 Steps to Run

1. Ensure you have RapidWright correctly setup and/or installed. See the [Getting Started](#) page for details.
2. Run the command below to print available options to parameterize the SLR crossing output

```
rapidwright SLRCrosserGenerator -h
```

Example output below:

```
=====
==                               SLR Crossing DCP Generator                   ==
=====
This RapidWright program creates a placed and routed DCP that can be
imported into UltraScale+ designs to aid in high speed SLR crossings. See
RapidWright documentation for more information.

Option                                Description
-----
-?, -h                                 Print Help
-a [String: Clk input net name]          (default: clk_in)
-b [String: Clock BUFGCE site name]     (default: BUFGCE_X0Y218)
-c [String: Clk net name]                (default: clk)
-d [String: Design Name]                (default: slr_crossover)
-i [String: Input bus name prefix]       (default: input)
-l [String: Comma separated list of
    Laguna sites for each SLR crossing]
-n [String: North bus name suffix]        (default: _north)
-o [String: Output DCP File Name]        (default: slr_crossover.dcp)
-p [String: UltraScale+ Part Name]       (default: xcvu9p-flgc2104-2-i)
-q [String: Output bus name prefix]       (default: output)
-r [String: INT clk Laguna RX flops]    (default: GCLK_B_0_1)
-s [String: South bus name suffix]        (default: _south)
-t [String: INT clk Laguna TX flops]    (default: GCLK_B_0_0)
-u [String: Clk output net name]          (default: clk_out)
-v [Boolean: Print verbose output]        (default: true)
-w [Integer: SLR crossing bus width]      (default: 512)
-x [Double: Clk period constraint (ns)]  (default: 1.538)
-y [String: BUFGCE cell instance name]    (default: BUFGCE_inst)
-z [Boolean: Use common centroid]         (default: false)
```

3. A default scenario of a single bi-directional crossing of 512 bits is generated at the LAGUNA\_X2Y120 site on a VU9P part if no options are provided. The DCP is generated in the current working directory with the name `slr_crossover.dcp` unless the `-o` option is specified.

```
rapidwright SLRCrosserGenerator
```

```
=====
==                               SLRCrosserGenerator                   ==
=====
Init:      4.787s
Create Netlist: 0.123s
Place SLR Crossings: 0.121s
Custom Clock Route: 3.756s
Route VCC/GND: 0.079s
Write EDIF: 0.148s
Writing XDEF Header: 0.090s
Writing XDEF Placement: 0.213s
Writing XDEF Routing: 0.404s
Writing XDEF Finalizing: 0.079s
Writing XDC: 0.039s
-----
[No GC] *Total*: 9.839s
```

(continues on next page)

(continued from previous page)

Wrote final DCP: /home/user/sl\_r\_crosser.dcp

4. Open the DCP using Vivado to view the design. It should look similar to the annotated screenshot below:



Fig. 2: Vivado Screenshot with bubble annotations of a single, bi-direction 512-bit SLR crossing circuit.

5. You can also unzip the DCP (treating it like an ordinary ZIP file) and inside you'll find Verilog and VHDL stubs that can be imported into RTL designs for black box inclusion. Example output below:

```
$ unzip slr_crosser.dcp
Archive: slr_crosser.dcp
  inflating: slr_crosser.edf
  inflating: slr_crosser.xdef
  inflating: slr_crosser_late.xdc
  inflating: slr_crosser_stub.v
  inflating: slr_crosser_stub.vhdl
  inflating: dcp.xml
$ cat slr_crosser_stub.v
// This file was generated by RapidWright 2018.2.0.

// This empty module with port declaration file causes synthesis tools to infer a
// black box for IP.
// Please paste the declaration into a Verilog source file or add the file as an
// additional source.
module slr_crosser(clk_in, clk_out, input0_north, input0_south, output0_north,
                   output0_south);
  input clk_in;
  output clk_out;
  input [511:0]input0_north;
```

(continues on next page)

(continued from previous page)

```

input [511:0]input0_south;
output [511:0]output0_north;
output [511:0]output0_south;
endmodule
$
```

Optionally, you can open the DCP in Vivado and write out the netlist as EDIF, Verilog or VHDL to be packaged as an IP. The DCP can then be dropped into the IP cache later.

- As one additional example, the generator is capable of using every SLL in the device. To generate such a DCP for a VU9P device, run:

```

rapidwright SLRCrosserGenerator -w 720 -l LAGUNA_X0Y120,LAGUNA_X2Y120,LAGUNA_X4Y120,
↪LAGUNA_X6Y120,LAGUNA_X8Y120,LAGUNA_X10Y120,LAGUNA_X12Y120,LAGUNA_X14Y120,LAGUNA_
↪X16Y120,LAGUNA_X18Y120,LAGUNA_X20Y120,LAGUNA_X22Y120,LAGUNA_X0Y360,LAGUNA_X2Y360,
↪LAGUNA_X4Y360,LAGUNA_X6Y360,LAGUNA_X8Y360,LAGUNA_X10Y360,LAGUNA_X12Y360,LAGUNA_
↪X14Y360,LAGUNA_X16Y360,LAGUNA_X18Y360,LAGUNA_X20Y360,LAGUNA_X22Y360

```

The resultant DCP should look similar to the following in Vivado:



Fig. 3: Vivado Screenshot of all SLLs being used at potentially a 760MHz for a speed grade 2 device.

## 13.10 Build an IP Integrator Design with Pre-Implemented Blocks

---

**Note:** This tutorial has been retired and efforts are being made to replace it or refresh it with a more stable example.

---

## 13.11 RapidWright PipelineGenerator Example

Generates a placed and routed circuit of flops that form a pipelined bus (think 2-D array of flops) having parameterizable spacing between pipeline stages. The generated .dcp file can be loaded into Vivado to view.

### 13.11.1 Input Parameters

- Width (bits)
- Depth (pipeline stages)
- Distance (tiles)
- Direction (horizontal or vertical)

### 13.11.2 Background

The selected device is a Xilinx VU3P (UltraScale+ device).

Figure 1-3 on pg. 8 of user guide UG574 shows the FFs contained within an UltraScale+ CLB (similar to UltraScale devices). Please see: [https://www.xilinx.com/support/documentation/user\\_guides/ug574-ultrascale-clb.pdf](https://www.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf) for more description.



**Figure 1-3: LUTs and Storage Elements in One Slice**

In this example, the PipelineGenerator places flops (instantiated as FDRE) by specifying slice locations and the individual FF BEL sites that are within each slice. These are grouped in pairs and referenced by a letter. Please note that each letter contains a pair of FFs.

### 13.11.3 Steps to Run

1. Ensure you are familiar with the RapidWright directories and have an IDE project created for RapidWright. Using an IDE such as IntelliJ or Eclipse is highly recommended for exercises in this tutorial for easy compilation and for help with the RapidWright libraries and functions. While we don't provide any IDE "how to" steps within this tutorial, if you do have questions please feel free to ask.
2. If you need to recompile the code, run: `./gradlew compileJava` from within your `<workspace_dir>/RapidWright` subdirectory. Alternatively, build this example using your IDE.
3. After compiling, run: `rapidwright PipelineGenerator`. This will generate an output called `"pipeline.dcp"`, containing the placed and routed circuit design.
4. To see a list of available input options specify `"-h"` as an argument. Note: the horizontal direction is assigned within the source code, but it can alternatively be changed to vertical within `main()`. The source code for this example is located in: `<workspace_dir>/RapidWright/com/xilinx/rapidwright/examples/PipelineGenerator.java`.

```
=====
==                               Pipeline Generator                         ==
=====
This RapidWright program creates an example pipelined bus as a placed and routed DCP.
See the RapidWright documentation for more information.

Option                                Description
-----
-?, -h                                Print Help
-c [String: Clk net name]                (default: clk)
-d [String: Design Name]                 (default: pipeline)
-l [Integer: distance]                  (default: 10)
-m [Integer: depth]                     (default: 3)
-n [Integer: width]                     (default: 10)
-o [String: Output DCP File Name]       (default: pipeline.dcp)
-p [String: Ultrascale/UltraScale+
Part Name]                            (default: xcvu3p-ffvc1517-2-e)
-s [String: Lower left slice to be
used for pipeline]                     (default: SLICE_X42Y70)
-v [Boolean: Print verbose output]       (default: true)
-x [Double: Clk period constraint (ns)] (default: 1.291)
```

### 13.11.4 Example Design

- Width = 10 bits
- Depth = 3 pipeline stages
- Distance = 10 tiles
- Direction = horizontal





The above screenshot show the device view, zoomed in on the placed and routed circuit. This circuit consists of three pairs of slices, using the <horizontal> spacing distance of <10> tiles.

Although each CLB FF letter site contains a pair of flops, as described above, this example only makes use of the first flop in each pair, as a demo. This means that the lower slice for each of the pairs uses eight flops, and the upper slice uses two flops to satisfy the <width> request of ten bits. This was done intentionally towards setting up an example that could be easily modified to use both of the flops in the pair. The screenshot below shows a zoomed in view of the lower slice.



Please refer to the example code for more implementation details. The Java source code for this example is located in: <workspace\_dir>/RapidWright/com/xilinx/rapidwright/examples/PipelineGenerator.java.

This example was designed to illustrate basic functions, and please feel free to modify this example to experiment building other implementations.

### 13.11.5 Additional Exercises

1. Try modifying the PipelineGenerator to use all 16 flip flops in an UltraScale slice, this will lead to a more compact usage of CLBs at the potential expense of greater routing congestion.



Hint: When using all sixteen flops or designs with higher bit widths, the minimum distance should be at least 10 tiles for routing.

2. This example of a PipelineGenerator is ideal for creating a long haul pipelined bus connection at high speed. This would be useful in connecting two modules physically distant on a device but need to communicate at high speed. Currently, the implementation can only pipeline in a single plane (horizontal or vertical). Modify the example such that it can pipeline both vertically and horizontally.

## 13.12 RapidWright PipelineGeneratorWithRouting Example

As part of the introduction of our new RapidWright timing library, we have extended the Pipeline Generator tutorial to demonstrate how to use our timing library, or at least how one might call our new timing library towards implementing a timing-driven router.

### 13.12.1 Background

Please see our FPT'19 paper, "An Open-source Lightweight Timing Model for RapidWright" (Presentation) for background details on our RapidWright Timing Model. Our model abstracts groups of low level wires and MUXes into an abstraction that we call timing groups (TGs).

Please note this tutorial does not cover some deeper background details, and it assumes a basic knowledge of routing concepts and the routing resources available within Xilinx devices. For more background, the tutorial library contains a separate tutorial dedicated on routing, including implementation examples that are depth-based (non-timing driven). That tutorial includes some device model details specific to RapidWright. RapidWright also provides an even more substantial depth-based router implementation within its source library (at least more substantial than the dedicated tutorial).

In this tutorial we further model our router cost function to compare the net delays estimated for different paths as a component of the cost. However, our cost function, algorithm, and overall implementation are for illustration purposes and have not been optimized for runtime performance or modular design. Our goal here is merely to present an example of using our timing library for exploring potential routing resources.

This tutorial and router method are provided in Java, as we are leveraging the circuit generator from the earlier Pipeline Generator tutorial. In the Pipeline Generator tutorial, we describe a circuit generator that instantiates and connects a 2-D array of flip flops, which represent an n-bit wide bus pipelined over multiple clock cycles. For this tutorial, we consider and generate a one-bit wide bus over two cycles (essentially we connect only a pair of flops). The more interesting aspect is that our pipeline generator has parameters for placement allowing the user to select the distance between flops and a relative direction.

### 13.12.2 Steps to Run

1. Ensure you are familiar with the RapidWright directories and have an IDE project created for RapidWright. Using an IDE such as IntelliJ or Eclipse is highly recommended for exercises in this tutorial for easy compilation and for help with the RapidWright libraries and functions.
2. If you need to recompile the code, run:

```
./gradlew compileJava
```

from within your RapidWright directory. Alternatively, build this example using your IDE.

3. After compiling, run:

```
rapidwright PipelineGeneratorWithRouting
```

This will generate an output called `pipeline.dcp`, containing the placed and routed circuit design.

4. To see a list of available input options specify “-h” as an argument. Note: the source code for this example is located in: [RapidWright/src/com/xilinx/rapidwright/examples/PipelineGeneratorWithRouting.java](#).

### 13.12.3 Example Design

- Width = 1 bit
- Depth = 2 pipeline stages
- DistanceY = 16 tiles
- DistanceX = 4 tiles

The logical view of the circuit as a Vivado schematic is shown below (pair of flops).



The selected device is a Xilinx VU3P (UltraScale+ device). The zoomed out device view in Vivado is shown below, with an example placed and routed circuit in the lower left corner.



*distanceY = 16, distanceX = 4  
(in terms of absolute tile coordinates for rows, columns)*

As discussed, we specify parameters for the distance between flops, and we do this by giving a distanceX and a distanceY change in coordinates. The example shown here has a diagonal direction. The screenshot from Vivado below shows a routed implementation having multiple wires shown below in white, connecting from the first flop (lower left) to the second flop (upper right).



Please refer to the [example code](#) for more implementation details.

### 13.12.4 Router Method

In our example router implementation, we have a queue, where we store and are building up candidate next hop locations as well as a set of candidate solutions. We have a main while loop that iterates extending each candidate closer to the target. We terminate the loop when we have either: (a) exhausted the candidate next hops as not feasible or (b) when the watchdog timer expires. The watchdog timer is initialized to 500,000 and counts down.

#### Within the while loop:

1. We get the distance in the horizontal and vertical dimensions to our target location.
2. We record that we have visited the current TG location.
3. If we have reached our target location, then we add the solution to our list of candidate solutions.
4. Else, we call the cost function.

#### Within the cost function:

1. We get an array of all potential next hop TGs from our current TG location.
2. We consider distance and direction. We choose the direction in that we have furthest to go. We choose the distance by selecting from a set of bands: SAME, NEAR, MID, or FAR. These are defined within the code.
3. We use the filter function with distance and direction band to filter our array of potential next hops to the relevant distance and direction.
4. We compute an artificial cost that has a component for the distance remaining to the target and also the delay so far.

#### 5. For each potential next hop TG:

- a. We store the delays within a table.
- b. We keep track of the history of moving from the current TG to the next hop TG.
- c. We add the next hop TG to the queue.

#### After while loop terminates:

1. We select the solution with lowest delay cost.

### 13.12.5 Interpreting Program Output

We print the TGs for the chosen solution. This printout includes the nodes included by each TG and the wires included by each node. Please feel free to skip over the low level details.

Last, after routing, if “verbose2” is set to true, then we open the Design with the TimingModel to find the critical path within the completed design. This computes a more accurate timing estimate according to our model. It will print a list of TGs that have delay. It will show net delay and logic delay, and sum them for the total datapath delay.

Mainly we are showing from a current TG, how to get to the next hop and how to see TG delays. Our printed messages shows some example TGs as well as the low level individual wires that have been abstracted away from the user having to consider by using the TGs.

This example was designed to illustrate basic functions, and please feel free to modify this example to experiment building other implementations.

### 13.12.6 Additional Exercises

1. Try other values for distanceX and distanceY.
2. Try modifying the PipelineGeneratorWithRouting to use the RapidWright built-in depth-based (non-timing driven) router to compare the results.

## 13.13 Pre-implemented Modules - Part I

*"If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" – Seymour Cray*

This tutorial has two parts. In this first part, we illustrate how you can create pre-implemented modules tailored to fit your architecture. In the second part of this tutorial, we show how the modules can be used and replicated as part of a design. At a high level, we will complete three tasks in Part I:

1. *Design Utilization Analysis*: Examine a synthesized PicoBlaze module and identify its footprint. 2. *Architecture Pattern Analysis*: Identify the best instance patterns for our PicoBlaze module. 3. *PBlock Selections*: Create a set of pblocks for implementing our pre-implemented PicoBlaze.

### 13.13.1 Background

Often times when trying to accelerate an application on an FPGA, a specific computation or routine is parallelized and reused many times. However, the conventional FPGA compilation flow may not always take full advantage of this optimization opportunity. One of RapidWright's key features is the ability to preserve, replicate and reuse placed and routed circuitry in the form of a pre-implemented module.

For the sake of simplicity and ease of implementation for this tutorial, consider the PicoBlaze. The PicoBlaze is an 8-bit programmable micro-controller provided by Xilinx (see block diagram below, Figure 1-1 from UG129, p.8)):



UG129\_c1\_01\_051204

The PicoBlaze is a small module that consumes 1 Block RAM and ~20 CLBs. In this tutorial we will examine how to create a reusable, pre-implemented PicoBlaze to construct a programmable processing overlay on a Xilinx VU3P device.

### 13.13.2 Getting Started

For convenience, we have provided a synthesized, out-of-context PicoBlaze design as a starting point DCP. This was built using the reference RTL for PicoBlaze available from [Xilinx.com](#). To get started, let's do the following:

1. Open a terminal and create a new directory called picoblaze.

```
mkdir picoblaze
cd picoblaze
```

2. Download picoblaze\_synth.dcp to your new picoblaze directory and open it in Vivado.

```
vivado picoblaze_synth.dcp
```

#### 1. Design Utilization Analysis

Once the design has been loaded in Vivado, let's get the utilization report by choosing Reports->Report Utilization... then click OK at the window prompt. A report window similar to the one below will open:



From this report we can analyze the synthesized resources used by the PicoBlaze. As expected, 1 block RAM is consumed, with 115 LUTs, 117 flip flops and 7 CARRY8 blocks. In the UltraScale architecture, each SLICE/CLB contains 8 LUTs, 16 flip flops and 1 CARRY8 block. Therefore the minimum number of SLICES needed for the PicoBlaze is:

$$\begin{aligned}
 &= \text{ceiling}(\max(115/8, 117/16, 7/1)) \\
 &= \text{ceiling}(\max(14.375, 7.3125, 7)) \\
 &= \text{ceiling}(14.375) \\
 &= 15
 \end{aligned}$$

So, in the absolute best case, we could squeeze a PicoBlaze into 15 UltraScale SLICES. To attempt this, we would create a pblock (area constraint) that would force the placer to only use 15 SLICES and 1 Block RAM tile. A block RAM tile is 5 SLICES tall in the UltraScale architecture, so we would need 3 nearby columns of SLICES in order to make a compact rectangle. If we tried to use 2 SLICE columns instead of three, our SLICE footprint height would be 8 ( $\text{ceiling}(15/2)$ ) which would not stride well with the 5 SLICE height of the block RAM.

To create the pblock, run the following Tcl constraints:

```

create_pblock pblock_1
resize_pblock pblock_1 -add {SLICE_X27Y60:SLICE_X29Y64 RAMB18_X2Y24:RAMB18_X2Y25_
↪RAMB36_X2Y12:RAMB36_X2Y12}
add_cells_to_pblock pblock_1 -top
set_property CONTAIN_ROUTING 1 [get_pblocks pblock_1]

```

Note that we also use the CONTAIN\_ROUTING property on the pblock of the PicoBlaze. This will ensure that the implementation is more amenable to relocation (can be more densely packed) later. Without this attribute, the routing will not be very reusable as it will be allowed to spread out far around the rectangle of the pblock. Once the pblock is created, it should look like this:



We will also need to add a timing constraint to push implementation to get the best performance possible. In order to push the tools, we should choose a target frequency that will push the tools just beyond their capacity to achieve timing closure. To begin, we'll add a 400MHz clock constraint and also provide a skew estimation target for the clock buffer to provide a more accurate timing estimation:

```
create_clock -period 2.5 -name clk -waveform {0.000 1.25} [get_ports clk]
set_property HD.CLK_SRC BUFGCTRL_X0Y2 [get_ports clk]
```

By running place\_design we can gauge the feasibility of using this footprint size for implementation (spoiler... this will not fit). The placer will report the errors similar to the following:

```
ERROR: [Place 30-488] Failed to commit 4 instances:
processor/reset_lut/LUT6 with block Id: 119 (LUT) at SLICE_X85Y150
processor/reset_lut/LUT6 with block Id: 119 (LUT) at SLICE_X85Y150
processor/reset_lut/LUT6 with block Id: 119 (LUT) at SLICE_X85Y150
processor/stack_loop[0].lsb_stack.stack_muxcy_CARRY4_CARRY8 with block Id: 134_
→ (CARRY) at SLICE_X85Y150
```

**Warning:** It has been found that Vivado 2022.2 may cause a segmentation fault when attempting to place this pblock and users of that version are advised to skip this step

It turns out the logic is packed too tightly into the area. Another way to gauge logic density would be to check the pblock statistics by selecting the pblock in Vivado by running the Tcl command:

```
select_objects [get_pblocks pb1]
```

Then choosing the Statistics tab of Pblock Properties, which would have something similar to that below:



A quick analysis shows that we are attempting to use ~96% of the LUTs in that area which is unlikely to place correctly. Again, since BRAM tiles are stacked vertically, we must grow horizontally to ensure that we can step and repeat without blocking access to other BRAMs with used SLICEs. Close and re-open the checkpoint then stretch/grow the pblock with the following Tcl commands:

```
close_design
open_checkpoint picoblaze_synth.dcp
create_pblock pblock_1
resize_pblock pblock_1 -add {SLICE_X26Y60:SLICE_X29Y64 RAMB18_X2Y24:RAMB18_X2Y25_
    ↪RAMB36_X2Y12:RAMB36_X2Y12}
add_cells_to_pblock pblock_1 -top
set_property CONTAIN_ROUTING 1 [get_pblocks pblock_1]
create_clock -period 2.5 -name clk -waveform {0.000 1.25} [get_ports clk]
```

(continues on next page)

(continued from previous page)

```
set_property HD.CLK_SRC BUFGCTRL_X0Y2 [get_ports clk]
```

To validate our new footprint, we can run the Tcl command:

```
place_design
```

again to see if we can get things to fit. This time, Vivado should successfully place the design.

## 2. Architecture Pattern Analysis

With a feasible pblock shape, we can now examine the architectural patterns that will lead to the highest number of compatible places this instance of a PicoBlaze could be placed. Xilinx architectures are column-based, meaning that every tile or resource type is the same for a column of the device layout. Consider the device floorplan view below where major tile types have been highlighted:



Tiles of the same type have all of the same logic and local interconnect and are repetitive in their respective columns. The main constraint for the PicoBlaze is a block RAM and we can leverage RapidWright to help us analyze the fabric to find the most repeated tile column patterns adjacent to block RAMs. To do this, in our terminal open the RapidWright Python interpreter by running:

```
rapidwright Jython
```

Then in the terminal we can use a class called `TileColumnPattern` to analyze the fabric and create a map of all the tile patterns in the device. We can do this by running:

```
device = Device.getDevice("xcvu3p-ffvc1517-2-i")
colMap = TileColumnPattern.genColumnPatternMap(device)
```

After a few seconds it will create a map where the keys are a sequence of tile type names (a tile column pattern) and values are a list of fabric tile column indices where the keyed tile column pattern begins. As a simple example, we can filter the map down to a pattern of 1 BRAM to find out how many BRAM columns exist in the device:

```
filtered = list(filter(lambda e: TileTypeEnum.BRAM in e.getKey() and e.getKey().size() == 1, colMap.entrySet()))
print filtered
```

The output should look like this:

```
[[BRAM]=[75, 97, 137, 193, 268, 331, 340, 396, 471, 534, 571, 594]]
```

In this example, we have a tile column pattern length of 1, a BRAM tile. The BRAM appears in tile columns indices 75, 97, 137, ... as shown in the image below:



Note that the tile column numbers appear much higher than what would be expected based on the number of visible columns. This is expected as there are several tile columns not necessarily shown in the Vivado GUI, but RapidWright is able to filter and account for the non-visible tiles. Now, for our pattern, we need to filter the map down to only include those keys that:

- 1) Have 1 BRAM column
- 2) Have 4 SLICE (CLB) columns

To do this, we can run the following code that will print out the patterns we are interested in and sort them by most number of instances first:

```
filtered = list(filter(lambda e: TileTypeEnum.BRAM in e.getKey() and not TileTypeEnum.  
    ↵DSP in e.getKey() and e.getKey().size() == 5, colMap.entrySet()))  
filtered.sort(key=lambda x: x.getValue().size(), reverse=True)  
from pprint import pprint  
pprint(filtered)
```

The output should look like this:

```
[[CLEM, CLEL_R, BRAM, CLEL_R, CLEM]=[94, 134, 265, 328, 337, 468, 531, 568],  
 [CLEL_R, CLEM, CLEL_R, BRAM, CLEL_R]=[93, 131, 262, 334, 465],  
 [CLEL_R, CLEM, CLEL_R, BRAM, CLEM]=[70, 188, 391],  
 [CLEM, CLEM, CLEL_R, BRAM, CLEM]=[68, 186, 389],  
 [CLEL_R, CLEM, CLEL_R, CLEM, BRAM]=[65, 183, 386],  
 [CLEL_R, BRAM, CLEL_R, CLEM, CLEM_R]=[593],  
 [CLEM_R, CLEM, CLEL_R, CLEM, CLEM_R]=[591],
```

(continues on next page)

(continued from previous page)

```
[CLEL_R, BRAM, CLEL_R, CLEM, CLEL_R]=[330],
[CLEM, CLEL_R, CLEM, CLEL_R, BRAM]=[129],
[BRAM, CLEL_R, CLEM, CLEL_R, BRAM]=[331],
[CLEL_R, CLEM_R, CLEL_R, BRAM, CLEL_R]=[588],
[BRAM, CLEL_R, CLEL_R, CLEM_R, CLEL_R]=[594]]
```

Our first pattern match ([CLEM, CLEL\_R, BRAM, CLEL\_R, CLEM]) is the most common with 8 instances in the fabric (we can determine this by there being 8 indices in the value array). To help visualize the pattern, here is the first instance (index 94) outlined in the previous floorplan view above with the highlighted tiles:



The second match is a juxtaposition of the first pattern and covers the same columns (note the indices are very close to those of the first). The third, forth and fifth are also juxtapositions of each other but cover a unique set of BRAM columns not covered and one of them will cover 3 more unique BRAM columns. Therefore, the final BRAM column 594, can be covered by the 6th, 7th, 11th or 12th pattern. For this tutorial, we will use the following three patterns:

```
[CLEM, CLEL_R, BRAM, CLEL_R, CLEM]=[94, 134, 265, 328, 337, 468, 531, 568]
[CLEL_R, CLEL_R, BRAM, CLEL_R, CLEM]=[70, 188, 391]
[CLEL_R, CLEM_R, CLEL_R, BRAM, CLEL_R]=[588]]
```

### 3. PBlock Selections

Now that we have identified the tile column patterns for our PicoBlaze to be implemented, we must select actual locations on the fabric to produce our replicate-able implementation. A few architectural considerations to take into account when deciding the set of pblocks to use for an implementation are:

- 1) Laguna tiles: In multi-SLR devices, some SLICEs along the top and bottom clock region rows are replaced with SLR-crossing resources called Laguna tiles. These tiles cause discontinuities in the regularity of the fabric and can require special handling when creating pre-implemented modules. To best handle them, special instantiations in the neighborhood of laguna tiles will be needed to achieve coverage in those regions.
- 2) Device edge: Around the edge of a device or SLR, the regular routing patterns have U-turn interconnect. These U-turns actually make routing easier around the edge of the device, however, if you hope to create a pre-implemented module, they must be a separate implementation if the pre-implemented module is to include routing.

3) Clock region edge: Another routing edge case relates to clock region edges. If timing is especially critical, some routes, even though a pblock using `CONTAIN_ROUTING=1` at the edge of the clock region is turned on, can have side loads that differ from other instances that can be just enough larger to missing timing if a pre-implemented module is created at a non-edge location. In Vivado, these side loads can be seen by clicking the settings (gear icon) at the top right of the device window and turning on `Device->Nets->Used Stub` as shown in the screenshot below.



An example of these stubs (side loads) can be seen in a PicoBlaze implementation seen in the image below:



As this PicoBlaze instance moves around the fabric, if the used stubs ever cross a clock region boundary, their timing will be increased slightly and can cause the pre-implemented module to close timing at a slightly lower frequency (0 to 5%). To avoid this problem, one can pre-implement the replicated circuit at both the top and bottom edges of a clock region so that the worst case timing is already factored in. The top or bottom implementation can then be used throughout the middle of a clock region without affecting its timing characteristics negatively.

Heterogeneous architectures can become an obstacle to relocatability, however, with the proper pblock selection, full coverage can be achieved. For simplicity of this tutorial, we will work around these issues by ignoring clock region timing edge effects and not using areas next to Laguna and SLR edges. Ultimately, we must do the following to create re-usable pblocks:

- 1) Decide on the number of instances required for the desired coverage
- 2) Identify the proper origin of the pblock(s)
- 3) Correctly calculate the pblock ranges by capturing all resource coordinate systems

To make things simple, we will only use three pblocks to achieve complete coverage in the center three clock region rows.

Next, we can use the Vivado Device view of an already open instance of the PicoBlaze design to help us visually locate our pblock origins.

For our first pblock, we can select the bottom of a middle clock region with an instance of the first pattern:

```
[CLEM, CLEL_R, BRAM, CLEL_R, CLEM]=[94, 134, 265, 328, 337, 468, 531, 568]
```

The first instance column is 94, meaning the pattern begins with tile types CLEM in column 94, for example, in RapidWright we can query the device for a tile in that column:

```
device.getTile(1, 94)
```

Which returns:

```
CLEM_X9Y299
```

**Note:** Notice that we used a row index of 1 (0 is the edge of the device) but that the Y coordinate is 299. The row/column coordinate system has an origin at the top left (North West) corner of the device whereas the X/Y coordinate system.

As we expect, the tile type is CLEM. We must now create a pblock that captures the pattern on the edge of a middle clock region. By subtracting 60 (the number of SLICEs in a clock region), we arrive at tile CLEM\_X9Y239. We can select this tile in Vivado by running the Tcl command:

```
select_objects [get_tiles CLEM_X9Y239]
```

then using the toolbar button for pblock creation, we can use the mouse to create an outlined rectangular region that includes 20 SLICEs and 1 RAMB36 as shown in the screenshot below:



A confirmation window will pop up, make sure all the *Grids* are selected then click OK. By using this technique, we can be assured to get the proper ranges for both BRAM and SLICEs in our pblock. To get the created pblock ranges, run the Tcl command:

```
get_property GRID_RANGES [get_selected_objects]
```

This should print:

```
RAMB36_X1Y47:RAMB36_X1Y47 RAMB18_X1Y94:RAMB18_X1Y95 SLICE_X13Y235:SLICE_X16Y239
```

This is our first pblock. We can repeat this process for the other two patterns to get the following list of pblocks:

```
RAMB36_X1Y47:RAMB36_X1Y47 RAMB18_X1Y94:RAMB18_X1Y95 SLICE_X13Y235:SLICE_X16Y239
RAMB36_X0Y47:RAMB36_X0Y47 RAMB18_X0Y94:RAMB18_X0Y95 SLICE_X7Y235:SLICE_X10Y239
RAMB36_X11Y47:RAMB36_X11Y47 RAMB18_X11Y94:RAMB18_X11Y95 SLICE_X157Y235:SLICE_X160Y239
```

Now store these three pblocks in a text file (or download the one we have already created) called `picoblaze_pblocks.txt` in our `picoblaze` directory. With these three pblocks, we are ready to move on to full implementation of these modules. Please continue with *Pre-implemented Modules - Part II*.

## 13.14 Pre-implemented Modules - Part II

*“If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?” – Seymour Cray*

This tutorial has two parts. In the first part, we showed how you can create pre-implemented modules tailored to fit your architecture. In this second part of the tutorial, we show how to assemble the PicoBlaze instances into a programmable overlay. To accomplish this, we will perform the following tasks:

*1. Implementation Optimization:* Use RapidWright and Vivado to get the best PicoBlaze implementations. *2. Building the Overlay:* Replicate and stitch our pre-implemented PicoBlazes into an overlay.

### 13.14.1 1. Implementation Optimization

From *Pre-implemented Modules - Part I*, we finished by creating three pblocks to be used for our PicoBlaze implementation. Now that we know what our three pblock sizes are, we can use `PerformanceExplorer`, a tool provided with RapidWright, to help us explore implementation performance of each of these instances. The `PerformanceExplorer` is able to parallelize many different runs of place and route using different directives and also sweep clock uncertainty to explore the solution space. By leveraging Vivado and RapidWright’s `PerformanceExplorer`, we are able to capture the best implementation runs for reuse.

The RapidWright `PerformanceExplorer` can be run directly from the command line:

```
rapidwright PerformanceExplorer -h
```

which prints help and options detail:

```
=====
==                               DCP Performance Explorer                  ==
=====
This RapidWright program will place and route the same DCP in a variety of
ways with the goal of achieving higher performance in timing closure. This
tool will launch parallel jobs with the cross product of:
    < placer directives x router directives x clk uncertainty settings >

Option (* = required)           Description
-----
-?, -h                          Print Help
-b <String: PBlock file, one set of
    ranges per line>
* -c <String: Name of clock to
    optimize>
-d [String: Run directory (jobs data
    location)]                   (default: <current directory>)
* -i <String: Input DCP>
-m [String: Min clk uncertainty (ns)]   (default: -0.1)
-p [String: Comma separated list of
    place_design -directives]        (default: Default, Explore)
-q [Boolean: Sets attribute on pblock
    to contain routing]            (default: true)
-r [String: Comma separated list of
    route_design -directives]      (default: Default, Explore)
-s [String: Clk uncertainty step (ns)]  (default: 0.025)
* -t <String: Target clock period (ns)>
-u [String: Comma separated list of
    clk uncertainty values (ns)]
-x [String: Max clk uncertainty (ns)]  (default: 0.25)
```

(continues on next page)

(continued from previous page)

```
-y [String: Specifies vivado path]      (default: vivado)
-z [Integer: Max number of concurrent    (default: 12)
    job when run locally]
```

To run `PerformanceExplorer` for our PicoBlaze design and three selected pblocks, we would run the following at the command line (where `picoblaze_pblocks.txt` the pblock file from [Part I](#)):

**Danger:** **DO NOT USE THIS IN A TUTORIAL VIRTUAL MACHINE**, it will crash the VM. `PerformanceExplorer` is best used with a compute cluster (such as [LSF](#)). It can be used on a single workstation, but, the number of parallel runs combined with their length can quickly add up to days of compute time.

```
# DON'T RUN THIS IN A TUTORIAL VIRTUAL MACHINE
rapidwright PerformanceExplorer -c clk -i picoblaze_synth.dcp -t 2.85 -b picoblaze_
→pblocks.txt
```

The `PerformanceExplorer` will then create a unique directory and launch a Vivado run for each unique job specification. There are four main parameters by which a job can be specified:

- 1) Placer Directive (`place_design -directive` option)
- 2) Router Directive (`route_design -directive` option)
- 3) Clock Uncertainty (applied before placement, then removed before routing)
- 4) PBlock (optional)

In our run of `PerformanceExplorer` above, we have the following set:

- 1) [Default, Explore]
- 2) [Default, Explore]
- 3) [-0.100, -0.075, -0.050, -0.025, 0.0, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175, 0.200, 0.225, 0.250]
- 4) [pb0, pb1, pb2]

This yields a total of  $2 \times 2 \times 15 \times 3 = 180$  runs. On a single workstation, this would take several hours depending on the number of parallel cores used (defaults to half the number of CPU cores, use `-z` option to specific core count). To avoid this lengthy step in the tutorial, we provide histograms of the results and best implementations here:



It seems Vivado was able to get the best performance from pblock0 which is the one with the floorplan that occurs most often. Although the histograms provide a view of what was achieved across 60 runs for each pblock, we really only care about the best results as those are what we move on with to the next step. For those curious, full performance results can be downloaded here: picoblaze\_results.xlsx.

| PBlock  | WNS (2.850ns period) | Max Operating Freq. |
|---------|----------------------|---------------------|
| pblock0 | 0.300ns              | 392MHz              |
| pblock1 | 0.178ns              | 374MHz              |
| pblock2 | 0.207ns              | 378MHz              |

Download the best placed and routed implementations here: picoblaze\_best.zip into your picoblaze directory then unzip the file:

```
unzip picoblaze_best.zip
```

### 13.14.2 2. Building the Overlay

Each PicoBlaze instance has a set of 4, 8-bit input ports  $\{a, b, c, d\}$  and 4, 8-bit output ports  $\{w, x, y, z\}$ . Our array of PicoBlaze instances will create columns on top of BRAM sites. The inter-module connectivity pattern for each column of PicoBlaze instances will follow this pattern:



For each column, there will be one 8-bit top-level input that will drive any inputs that don't have matching connecting instances. There will be one 8-bit top level output driven by the top PicoBlaze's output  $z$ , all other outputs without matching connecting instances will be left unconnected.

RapidWright Java code to instantiate and place the three PicoBlaze pre-implemented modules and stitch them together is found in `RapidWright/com/xilinx/rapidwright/examples/PicoBlazeArray.java`. This can be run at the command line with the following command:

```
rapidwright PicoBlazeArray
```

Without any parameters, we get a simple usage message:

```
USAGE: <pblock dcp directory> <part> <output_dcp> [--no_hand_placer]
```

To run, we must provide the path to the directory where our pblock DCPS are located, the target device part name and an output DCP name:

```
rapidwright PicoBlazeArray `pwd` xcvu3p-ffvc1517-2-i picoblaze_array.dcp
```

The program will read each of the pblock DCPs and stitch them together, printing out runtime numbers for each step. By default, the program will open the HandPlacer to enable the user to examine the placed PicoBlazes (you can skip the hand placer by adding --no\_hand\_placer as the last argument). Here is a screenshot of the tool:



You can zoom in/out using the scroll wheel of your mouse (or **Ctrl** + **-** to Zoom Out and **Ctrl** + **=** to Zoom In) and can move the pre-implemented PicoBlaze instances if you wish to change any of their placement. As you move the blocks, you'll notice two things. First, the color of the block will change depending on its contextual location:

- Green = Valid Placement
- Orange = Valid Placement but overlapping
- Red = Invalid Placement



You'll also notice colored lines that appear as you drag the blocks. These lines show high-level connectivity of the blocks to other blocks. The thicker the lines, the more tightly connected it is to its neighbors. If you choose to change the placement, its results will automatically be saved. Close the Hand Placer window, and the program will write out a placed and routed PicoBlaze array DCP.

Close any existing DCPs that are open in Vivado and open our new `picoblaze_array.dcp`:

```
close_design  
open_checkpoint picoblaze_array.dcp
```

Once the design opens in Vivado, we find that RapidWright has “copied and pasted” our PicoBlaze 396 times in the center clock rows of the VU3P as shown in the screenshot below:



To finalize the design, we simply need to update the clock tree, route the interconnections between PicoBlaze instances and check timing. This can be performed with the following Tcl commands:

```
update_clock_routing
route_design
report_timing_summary -delay_type min_max -report_unconstrained -check_timing_verbose
    -max_paths 10 -input_pins -routable_nets -name timing_1
```

Once we are done, we should get a fully routed implementation that looks similar to this (or you can download our result here [picoblaze\\_array\\_routed.dcp](#)):



In our example, we had over 100ps of positive slack on the worst setup paths and meeting all hold requirements with at least 10ps of slack:

| Design Timing Summary                       |                                          |                                                          |
|---------------------------------------------|------------------------------------------|----------------------------------------------------------|
| Setup                                       | Hold                                     | Pulse Width                                              |
| Worst Negative Slack (WNS): <b>0.108 ns</b> | Worst Hold Slack (WHS): <b>0.010 ns</b>  | Worst Pulse Width Slack (WPWS): <b>0.883 ns</b>          |
| Total Negative Slack (TNS): <b>0.000 ns</b> | Total Hold Slack (THS): <b>0.000 ns</b>  | Total Pulse Width Negative Slack (TPWS): <b>0.000 ns</b> |
| Number of Failing Endpoints: <b>0</b>       | Number of Failing Endpoints: <b>0</b>    | Number of Failing Endpoints: <b>0</b>                    |
| Total Number of Endpoints: <b>167904</b>    | Total Number of Endpoints: <b>167904</b> | Total Number of Endpoints: <b>62568</b>                  |

All user specified timing constraints are met.

Although our clock period constraint is 2.85ns, we could run the array a bit higher at 365MHz. With some additional effort, we could increase the number of instances on the VU3P to 720 if we were to work around device edge cases, laguna tiles and one of the columns that wasn't utilized due pattern overlap.

### 13.14.3 Conclusion

Although building an array of PicoBlaze microcontrollers probably won't be used as the next architecture for deep learning accelerators or crypto miners, it has demonstrated how RapidWright and Vivado can be used together to achieve some interesting architectural structures in FPGA fabric. Specifically we have shown:

- 1) **PBlock / Area Constraint Analysis** - Getting the area constraint to the right footprint size
- 2) **Tile Column Pattern Analysis** - Picking the right patterns for maximum placement coverage
- 3) **Performance Exploration** - Using RapidWright and Vivado to find and harvest the best implementations
- 4) **Overlay Construction** - Using RapidWright to *copy & paste* implementations and stitch them together

## 13.15 Create and Use an SLR Bridge

The goal of this tutorial is to combine a RapidWright generated circuit with a Vivado design.

### 13.15.1 Background

In this example, we implement a 4-to-1 TDM (Time-division Multiplexing) design that reduces the number of valuable *SLR (Super Logic Region)* crossing resources by 4X. SLR crossing resources (super long lines or *SLLs*) are inter-die connectivity resources within the package and are often in high demand. RapidWright can generate a highly tuned SLR bridge within seconds as a drop-in implementation (.DCP) capable of running at near-spec performance (~750MHz). This tutorial will demonstrate how to use such a bridge and maintain high performance in common design flows.

The TDM circuit and its connectivity with a RapidWright SLR bridge is shown in the figure below.



The TDM circuit switches between 4 low frequency signals (1X CLK) to drive data into the faster clock domain (4X CLK), and vice versa. The red-dotted line shows the boundary and encompasses the circuit that will be generated directly from RapidWright. Due to the challenging nature of crossing SLRs, RapidWright has a dedicated circuit generator for SLR crossings that can custom route the clock to avoid hold time issues and minimize inter-SLR delay penalties to provide an implementation that achieves high performance (>700MHz).

By taking this approach, greater bandwidth over the SLR boundaries can be achieved and/or minimizing the total number of SLLs used—leaving them available for other applications such as when building a [shell](#).

### 13.15.2 Getting Started

#### Building the Bridge

Begin by creating a directory for our work in this tutorial:

```
mkdir bridge_tutorial
cd bridge_tutorial
```

Our first task is to generate an SLR crossing bridge from RapidWright. RapidWright has a dedicated generator for this purpose called the **SLRCrossingGenerator** which can be run from the command line. To invoke the help/options output of the tool simply run:

```
rapidwright SLRCrosserGenerator -h
```

This should produce the following output:

| =====                                                                   |                                |
|-------------------------------------------------------------------------|--------------------------------|
|                                                                         | SLR Crossing DCP Generator     |
|                                                                         | =====                          |
| -?, -h                                                                  | Print Help                     |
| -a [String: Clk input net name]                                         | (default: clk_in)              |
| -b [String: Clock BUFGCE site name]                                     | (default: BUFGCE_X0Y218)       |
| -c [String: Clk net name]                                               | (default: clk)                 |
| -d [String: Design Name]                                                | (default: slr_crosser)         |
| -i [String: Input bus name prefix]                                      | (default: input)               |
| -l [String: Comma separated list of Laguna sites for each SLR crossing] | (default: LAGUNA_X2Y120)       |
| -n [String: North bus name suffix]                                      | (default: _north)              |
| -o [String: Output DCP File Name]                                       | (default: slr_crosser.dcp)     |
| -p [String: UltraScale+ Part Name]                                      | (default: xcvu7p-flva2104-2-i) |
| -q [String: Output bus name prefix]                                     | (default: output)              |
| -r [String: INT clk Laguna RX flops]                                    | (default: GCLK_B_0_1)          |
| -s [String: South bus name suffix]                                      | (default: _south)              |
| -t [String: INT clk Laguna TX flops]                                    | (default: GCLK_B_0_0)          |
| -u [String: Clk output net name]                                        | (default: clk_out)             |
| -v [Boolean: Print verbose output]                                      | (default: true)                |
| -w [Integer: SLR crossing bus width]                                    | (default: 512)                 |
| -x <Double: Clk period constraint (ns)>                                 | (default: BUFGCE_inst)         |
| -y [String: BUFGCE cell instance name]                                  |                                |
| -z [Boolean: Use common centroid]                                       | (default: false)               |

As you can see, this generator has several parameterizable options. In this case, we will want a bridge that provides 32 wires in both directions using a single column of Laguna tiles. We will use the xcvu7p-flva2104-2-i part for our example and use the far edge Laguna column for our crossing. As RapidWright must custom route the clock to preserve the carefully tuned leaf clock buffer delays, it must include a BUFGCE instance. We also specify the location of the BUFG to improve timing reproducibility in the application context. To generate such a bridge run the following at the command line:

```
rapidwright SLRCrosserGenerator -l LAGUNA_X20Y120 -b BUFGCE_X1Y80 -w 32 -o slr_
↪crosser_vu7p_32.dcp -p xcvu7p-flva2104-2-i
```

After several seconds, a new file, `slr_crosser_vu7p_32.dcp` should appear in our working directory, let's open it in Vivado to examine what we have created.

```
vivado slr_crosser_vu7p_32.dcp
```

Once open, the device view (Window->Device) should look something like this:



We can also add a timing constraint to test the pre-implemented performance of the bridge with the following Tcl commands:

```
create_clock -name clk -period 1.333 [get_nets clk]
report_timing_summary -delay_type min_max -report_unconstrained -check_timing_verbose
-max_paths 10 -input_pins -routable_nets -name timing_1
```

We have specified a 750MHz clock constraint (1.333 ns period) and the timing report should show positive slack for both setup and hold. Close this design in Vivado once you are done (don't save your changes):

```
close_design
```

## Combining the Designs

Now that we know we have a correct bridge, we can begin on our main design. To do so, we have provided a synthesized version of our TDM circuit where N=32. To open `synth32_BB.dcp`, run the following Tcl commands in Vivado's Tcl prompt:

```
exec wget http://www.rapidwright.io/docs/_downloads/synth32_BB.dcp
open_checkpoint synth32_BB.dcp
```

Look at the Vivado netlist view of the `synth32_BB.dcp` design. The SLR Bridge (crossing instance) has been left open as a black box to be populated with our RapidWright bridge implementation, see the screenshot below for reference:



---

**Note:** For ease of use of this tutorial, we have provided a synthesized circuit with a black box. However, in common practice, the generated DCP from RapidWright can simply be instantiated in Verilog/VHDL directly and the DCP added to the sources of the project.

---

To import our SLR bridge, we will use the `read_checkpoint` command at the Tcl prompt:

```
read_checkpoint -cell crossing slr_crosser_vu7p_32.dcp
```

Note that the netlist icon next to `crossing` should change from dark to white. The black box has now been populated with our custom SLR bridge implementation we just created in RapidWright.

## Implementation

We can now proceed to constrain the design and run place and route by sourcing the `run_PnR.tcl` script in Vivado by running the following Tcl commands in Vivado's Tcl prompt:

```
exec wget http://www.rapidwright.io/docs/_downloads/run_PnR.tcl
source run_PnR.tcl
```

Alternatively, you can copy and paste the contents of the Tcl file below into the Tcl console in Vivado:

```
# Add pblocks
create_pblock pblock_top
add_cells_to_pblock pblock_top [get_cells [list T_top]] -clear_locs
resize_pblock [get_pblocks pblock_top] -add {CLOCKREGION_X5Y5:CLOCKREGION_X5Y5}
create_pblock pblock_bot
add_cells_to_pblock pblock_bot [get_cells [list T_bot]] -clear_locs
resize_pblock [get_pblocks pblock_bot] -add {CLOCKREGION_X5Y4:CLOCKREGION_X5Y4}
# Implement design and save
```

(continues on next page)

(continued from previous page)

```
place_design
route_design
write_checkpoint -force routed_32.dcp
```

This can take several minutes (up to 30 minutes inside the tutorial virtual machine). For those wishing to skip ahead, we have provided our own implementation of the results of the above Tcl commands here: `routed_32.dcp`. In the Device model view, our implementation looks like this:



For additional analysis of timing reports can be performed on the specific paths crossing the SLR and leading up to it by sourcing the `run_timing.tcl` script in Vivado by running the following Tcl commands in Vivado's Tcl prompt:

```
exec wget http://www.rapidwright.io/docs/_downloads/run_timing.tcl
source run_timing.tcl
```

Alternatively, you can copy and paste the contents of the Tcl file below into the Tcl console in Vivado:

```
# report to GUI
report_timing_summary -delay_type min_max -report_unconstrained -check_timing_verbose
    ↵-max_paths 10 -input_pins -routable_nets -name timing_all
report_timing -from {*/input0_north_reg*} -delay_type min_max -max_paths 10 -sort_by_
    ↵group -input_pins -name timing_North
report_timing -from {*/output0_north_reg*} -delay_type min_max -max_paths 10 -sort_by_
    ↵group -input_pins -name timing_North_after
report_timing -to {*/input0_north_reg*} -delay_type min_max -max_paths 10 -sort_by_
    ↵group -input_pins -name timing_North_before
report_timing -from {*/input0_south_reg*} -delay_type min_max -max_paths 10 -sort_by_
    ↵group -input_pins -name timing_South
report_timing -from {*/output0_south_reg*} -delay_type min -max_paths 10 -sort_by_
    ↵group -input_pins -name timing_South_after
report_timing -to {*/input0_south_reg*} -delay_type min_max -max_paths 10 -sort_by_
    ↵group -input_pins -name timing_South_before
```

This will produce several tabs in the Timing window tab as shown below:



The clock constraint for the design is 1.4ns and our implementation met timing with 0.02ns of positive slack, meaning it can be implemented with a >710MHz fast (4X) clock. This is quite close to the spec of the VU7P which is 775MHz.

## Conclusion

We have shown how pre-implemented designs can be integrated into existing Vivado design flows to achieve near-spec performance.

## 13.16 RapidWright FPGA 2019 Deep Dive Tutorial

Before starting the tutorials, see [Getting Started](#) below to setup your machine.

| Tutorial Segment                                                                  | Time    | Purpose                                             |
|-----------------------------------------------------------------------------------|---------|-----------------------------------------------------|
|  | 5 mins  | Intro to RapidWright within Jupyter Notebook        |
|  | 10 mins | How to build a netlist from scratch                 |
|                                                                                   | 15 mins | How to generate a circuit in RapidWright            |
|                                                                                   | 15 mins | How to create a pre-implemented module              |
|                                                                                   | 15 mins | How to use and relocate pre-implemented modules     |
|  | 20 mins | Fast probe routing on existing implementation       |
|  | 15 mins | How to use a SAT engine to solve routing congestion |
|                                                                                   | 20 mins | Combine Vivado and RapidWright generated circuits   |



= Jupyter Notebook Tutorial

These tutorials were given in the Sunday afternoon session of FPGA 2019 (February 24th).

### 13.16.1 Supplementary Materials:

- Slides from the Sunday morning session: [FPGA19-RapidWright-Presentation.pdf](#)
- The invited tutorial paper: [FPGA19-RapidWright.pdf](#)

### 13.16.2 Getting Started

Before attempting the tutorials above, please install and/or setup the following tools:

1. [RapidWright 2018.3.1](#)
2. [Vivado 2018.3](#)
3. [Eclipse or IntelliJ](#) (not required, but mentioned in )
4. [Jupyter Notebook and the RapidWright Kernel](#) (for Jupyter Notebook tutorials)
5. Download the RapidWright-binder repository by running the following at the command line:

```
git clone https://github.com/clavin-xlnx/RapidWright-binder.git
```

6. Start the Jupyter notebook server and point it at your RapidWright-binder directory:

```
jupyter notebook --notebook-dir=RapidWright-binder
```

At this point the above Jupyter notebook tutorial links should open properly.

## 13.17 RapidWright FCCM 2019 Workshop

| Tutorial Segment                                                                  | Time    | Purpose                                             |
|-----------------------------------------------------------------------------------|---------|-----------------------------------------------------|
|  | 5 mins  | Intro to RapidWright within Jupyter Notebook        |
|  | 10 mins | How to build a netlist from scratch                 |
|                                                                                   | 15 mins | How to generate a circuit in RapidWright            |
|                                                                                   | 15 mins | How to create a pre-implemented module              |
|                                                                                   | 15 mins | How to use and relocate pre-implemented modules     |
|  | 20 mins | Fast probe routing on existing implementation       |
|  | 15 mins | How to use a SAT engine to solve routing congestion |
| (Linux only)                                                                      | 20 mins | Combine Vivado and RapidWright generated circuits   |
|  | 20 mins | How to build a basic router in RapidWright          |

These tutorials were given in the Wednesday morning workshop of FCCM 2019 (May 1st).

### 13.17.1 Getting Started

Before attempting the tutorials above, please install and/or setup the following tools:

1. [RapidWright 2018.3.3](#)
2. [Vivado 2018.3](#)
3. [Eclipse or IntelliJ](#) (not required, but mentioned in )
4. [Jupyter Notebook and the RapidWright Kernel](#) (for Jupyter Notebook tutorials)
5. Download the RapidWright-binder repository by running the following at the command line:

```
git clone https://github.com/clavin-xlnx/RapidWright-binder.git
```

6. Start the Jupyter notebook server and point it at your RapidWright-binder directory:

```
jupyter notebook --notebook-dir=RapidWright-binder
```

At this point the above Jupyter notebook tutorial links should open properly.



= Jupyter Notebook Tutorial

## 13.18 RapidWright FPL 2019 Tutorial

**Title:** RapidWright: Enabling Application-optimized FPGA Implementations **Where:** Vertex building at the UPC/BSC Campus, Barcelona, Spain - FPL 2019 **When:** Thursday, September 12th, 2019, morning half-day **Organizers:** Chris Lavin and Alireza Kaviani

### 13.18.1 What is RapidWright?

RapidWright is an open source framework providing a gateway to Vivado's back-end implementation tools. It enables a broad range of new capabilities related to FPGA implementation such as:

- Build well-defined placed and routed circuits in seconds
- Enables parameterizable placed and routed circuit generators
- Reuse and relocate P&R circuits from Vivado
- Quickly combines P&R circuits that enable efficient shells and overlays

Additionally, RapidWright provides a new validation path for FPGA CAD researchers. New techniques and algorithms can be demonstrated on the latest commercial devices—crisply quantifying their contributions to both industry and academia.

### 13.18.2 Tutorial Content

This tutorial will combine presentation and hands-on tutorials. An overview of RapidWright, its capabilities, and vision for the future will be presented. For the hands-on portion, attendees will be provided with a USB flash drive and instructions to run the tutorials on their own laptop using a virtual machine. The hands-on session will consist of 1:1 Q&A while participants work through selected tutorials at their own leisure.

**The list of tutorial topics will include (but are not limited to):**

- Building placed and routed circuits from scratch
- Creating parameterized, generated circuits
- Walk-through of how to build a ~400 instance PicoBlaze accelerator overlay
- A fast, non-intrusive ILA (ChipScope) debug probe re-router
- Using a SAT engine to resolve routing congestion
- How to combine a RapidWright SLR bridge with Vivado-based designs

**Attendees of this tutorial can expect to:**

1. Gain a deeper understanding of how to leverage Xilinx architecture
2. Know how to use RapidWright and apply its capabilities in their own designs
3. Learn about design methodologies that can lead to near-spec performance

RapidWright opens a new path for domain-specific implementation tools that will improve performance and productivity and we invite the community to help us further the potential of FPGA technology. Please join us in Barcelona on September 12th for the RapidWright Tutorial held at FPL!

### 13.18.3 Details

You can register for the tutorial at the [FPL 2019 website](#).

**Attendees will need to bring a laptop that can support the following:**

- A laptop capable of reading and writing to a USB 3.1 flash drive with a Type A port
- Able to install [Virtual Box 6.0.x](#)
- Laptop can support a virtual machine with 6GB of RAM (8GB preferred)
- Enable Intel VT / AMD-V (64-bit OS virtualization) in BIOS

- If running Linux, you'll need to install exFAT packages in order to mount the USB flash drive:

```
sudo yum install exfat-utils fuse-exfat # CentOS / RedHat
sudo apt install exfat-fuse exfat-utils # Ubuntu / Debian
```

### 13.18.4 Questions?

**Contact organizers:**

- Chris Lavin - chris.lavin‘at‘xilinx.com
- Alireza Kaviani - alireza.kaviani‘at‘xilinx.com

## 13.19 RapidWright ICCAD 2023 Hands-on Tutorial

**Title:** RapidWright: Unleashing the Full Power of FPGA Technology with Domain-Specific Tooling

**Organizers:** Chris Lavin and Eddie Hung

**Where:** Artisan Room, Hyatt Regency San Francisco Downtown SOMA, [ICCAD 2023](#)

**When:** Wednesday, November 1st, 2023, 11:00am PDT

- 11:00am - 11:05am : Machine Allocation
- 11:05am - 11:15am : Introduction and Overview
- 11:15am - 1:00pm : Hands-on, self-guided tutorials

| Featured Tutorial Segments | Time    | Description                                                            |
|----------------------------|---------|------------------------------------------------------------------------|
|                            | 30 mins | Create a pre-implemented shell from an existing design without pblocks |
|                            | 25 mins | Use a 3rd party placer with the FPGA Interchange Format                |
|                            | 15 mins | Create placed and routed circuits from scratch in seconds              |
|                            | 35 mins | Add debug logic without changing existing placement and routing        |

| Additional Tutorial Segments                                                        | Time    | Description                                |
|-------------------------------------------------------------------------------------|---------|--------------------------------------------|
|  | 5 mins  | Intro to RapidWright in Jupyter Notebooks  |
|  | 10 mins | How to build a netlist from scratch        |
|  | 15 mins | How to create a pre-implemented module     |
|  | 15 mins | Use & relocate pre-implemented modules     |
|  | 15 mins | Use SAT to solve hard routing congestion   |
|  | 20 mins | Combine Vivado & RapidWright circuits      |
|  | 20 mins | How to build a basic router in RapidWright |



= Jupyter Notebook Tutorial

---

**Note:** To run the Jupyter Notebook tutorials (those marked with the  icon above), first run

```
cd ~/RapidWright-binder  
jupyter notebook
```

in a separate terminal in the AWS Instance to start the server, then click on the corresponding tutorial segments above.

---

### 13.19.1 Questions?

#### Contact organizers:

- Chris Lavin - chris.lavin‘at‘amd.com
- Eddie Hung - eddie.hung‘at‘amd.com



## 14.1 Call RapidWright from C/C++ Using GraalVM

Several RapidWright users have wondered about the prospects of using RapidWright in a C or C++ application even though it is written in Java. Previously, the only option was to use the [Java Native Interface \(JNI\)](#) and run an instance of a JVM in order to make such communication possible. However, a new project called [GraalVM](#) provides some exciting new capabilities to Java as it is a universal virtual machine and compiler ecosystem built around the JVM. It has [several features](#), but some highlights are:

- As GraalVM is a JVM, it comes with new just-in-time compilation technology to run Java faster
- Compile Java applications to native code for fast startup times
- Write Java programs using interpreted languages such as Python, Ruby, JavaScript and also support their C extensions
- Compile Java code as a native shared object library

In this article, we'll focus on that last feature which enables us to package up RapidWright as a shared object library with header files to be called by C/C++ applications. To get started, we are going to target a Linux environment and use Bash commands for our example (GraalVM is still in the early stages for support in Windows).

For the impatient, we have provided an example tar ball with example source code and Makefile to run the entire flow, just run these four commands:

```
wget http://www.rapidwright.io/docs/_downloads/GraalVMExample.tar.gz
tar zxf GraalVMExample.tar.gz
cd GraalVMExample
make
```

For a more in depth explanation of how this all works, see the rest of the article below.

### 14.1.1 Get Setup

First, navigate to a directory where you would like to install/practice the steps provided in this article. We'll need to install GraalVM and use the [GraalVM Updater](#) to install it's native-image package:

```
wget https://github.com/oracle/graal/releases/download/vm-19.0.0/graalvm-ce-linux-
˓amd64-19.0.0.tar.gz
tar zxf graalvm-ce-linux-amd64-19.0.0.tar.gz
export PATH=$PWD/graalvm-ce-19.0.0/bin:$PATH
gu install native-image
```

Next we'll install RapidWright and set RAPIDWRIGHT\_PATH:

```
git clone https://github.com/Xilinx/RapidWright.git
cd RapidWright
./gradlew compileJava
export RAPIDWRIGHT_PATH=$PWD
```

It turns out that the native compilation feature of GraalVM does not support certain kinds of reflection that are used in Jython, so we need to remove that dependency and associated code in order to create the shared object library:

```
rm RapidWright/src/com/xilinx/rapidwright/util/RapidWright.java
rm RapidWright/bin/com/xilinx/rapidwright/util/RapidWright.class
rm RapidWright/jars/{jython-standalone-2.7.0,jupyter-kernel-jsr223,jeromq-0.3.6,json,
→junit-4.12}.jar
```

## 14.1.2 Building a Bridge

Now that GraalVM and RapidWright have been installed and prepared, we can focus on building the bridge between Java and our native application. As Java and C/C++ are fundamentally different languages with differing runtimes, some additional effort is needed to enable cross-language APIs callable from C/C++. This article provides an example on how to create a few API wrappers for C/C++, however, we refer the reader to the [GraalVM documentation](#) and [Javadocs](#) for more advanced usage.

We will choose a couple RapidWright APIs we would like to make available in C++, they are the Java methods:

- Device.getDevice(String deviceName)
- Device.getTile(int column, int row)

To expose these two APIs to C/C++ using GraalVM, we need to declare two new methods and annotate them with `@CEntryPoint`. When annotating methods with `@CEntryPoint`, it must meet a few requirements, namely:

1. The Java method must be declared static
2. The `@CEntryPoint` annotation requires the C API name (`name = "functionName"`)
3. The first parameter must be an execution context (`IsolateThread` or `Isolate`)
4. All other parameters must be Java primitive values (int, long, char, ...), C helper classes (`CCharPointer`, `CIntPtr`,...) or a Java enum annotated with `@CEnumLookup`

Below is an example Java class `RapidWrightAPI.java` that illustrates how these two Java APIs could be implemented to provide the C interface requirements while accessing RapidWright Java functionality. Note that this Java class will need to be compiled with GraalVM as it imports special features from its native-image library.

```
package com.xilinx.rapidwright.examples;

import org.graalvm.nativeimage.IsolateThread;
import org.graalvm.nativeimage.c.function.CEntryPoint;
import org.graalvm.nativeimage.c.type.CCharPointer;
import org.graalvm.nativeimage.c.type.CTypeConversion;

import com.xilinx.rapidwright.device.Device;

public class RapidWrightAPI {

    @CEntryPoint(name = "loadDevice")
    public static void loadDevice(IsolateThread thread, CCharPointer deviceName) {
        String devName = CTypeConversion.toJavaString(deviceName);
        System.out.print("Loading device " + devName + "...");
    }
}
```

(continues on next page)

(continued from previous page)

```

        Device d = Device.getDevice(devName);
        System.out.println("DONE!");
    }

    @CEntryPoint(name = "getTileName")
    public static CCharPointer getTileName(IsolateThread thread, CCharPointer_
→deviceName, int row, int column) {
        String devName = CTypeConversion.toJavaString(deviceName);
        Device d = Device.getDevice(devName);
        return CTypeConversion.toCString(d.getTile(row, column).getName()).get();
    }
}

```

The `loadDevice()` API is redundant because the `getTileName()` also will load the device if it is not already in memory, this is just to provide second point of illustration. Also note that GraalVM provides a set of utility methods to convert to and from Java and C types `CTypeConversion` such as Java Strings to C `char*`.

### 14.1.3 Ready to Build a .so (Linux Shared Object Library)

Now that we have a few APIs, we can test them out by using GraalVM to compile our example and then create a shared object library and header file as shown in the flow below:



Run the following commands to download the example API code, compile it and create a shared object library using GraalVM:

```

wget http://www.rapidwright.io/docs/_downloads/RapidWrightAPI.java -O $RAPIDWRIGHT_
→PATH/src/com/xilinx/rapidwright/examples/RapidWrightAPI.java
export CLASSPATH=$RAPIDWRIGHT_PATH/bin:$(${find $RAPIDWRIGHT_PATH/jars -name '*.jar' |_
→grep -Ev 'jython|jupyter|win64|jeromq|json|junit' | tr '\n' ':'})
javac $RAPIDWRIGHT_PATH/src/com/xilinx/rapidwright/examples/RapidWrightAPI.java -d
→$RAPIDWRIGHT_PATH/bin
native-image --no-server -cp $CLASSPATH --no-fallback --initialize-at-build-time --
→shared -H:Name=librapidwright

```

If all goes well, you should now have a `librapidwright.so` and `librapidwright.h` file present in your current directory.

### 14.1.4 Testing it Out

Now for the fun part, we can create a C or C++ application that will make use of the new RapidWright APIs! Here's a small C++ program that prints out a grid of tile names for a given device:

```
#include <iostream>
// This is the header file created by native-image (Graal)
#include <librapidwright.h>

using namespace std;

int main(int argc, char **argv) {
    // This is some Graal boilerplate code
    graal_isolate_t *isolate = NULL;
    graal_isolateThread_t *thread = NULL;

    if (graal_create_isolate(NULL, &isolate, &thread) != 0) {
        fprintf(stderr, "graal_create_isolate error\n");
        return 1;
    }
    // End boilerplate

    int maxRow = 105;
    int maxCol = 105;
    char * devName = argv[1];

    // Load the device in RapidWright, the device will be
    // persistent in memory until it is unloaded
    loadDevice(thread, devName);

    // Get tile names based on row/column indices and print out
    // the tile names for a few tiles
    for (int row = 100; row < maxRow; row++) {
        for (int col = 100; col < maxCol; col++) {
            std::cout << "Tile[" << col << "," << row << "] = \""
            getTileName(thread, devName, row, col) << "\" << std::endl;
        }
    }

    // Clean up Graal stuff
    if (graal_detach_thread(thread) != 0) {
        fprintf(stderr, "graal_detach_thread error\n");
        return 1;
    }

    return 0;
}
```

There is some GraalVM boilerplate before and after we use the APIs in RapidWright, but we can compile this with any C++ compiler. The program prints out all the tiles in grid between tiles located at (100,100) and (104,104) inclusive, or 25 different tile names. We can compile and run this program by running the following:

```
wget http://www.rapidwright.io/docs/_downloads/RapidWrightExample.cpp
g++ RapidWrightExample.cpp -I. -L. -lrapidwright
export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH && ./a.out xcvu9p
```

If all goes well, you should see the following output:

```
Loading device xcvu9p...DONE!
Tile[100,100] = "CLEL_R_X10Y803"
Tile[101,100] = "NULL_X101Y832"
Tile[102,100] = "NULL_X102Y832"
```

(continues on next page)

(continued from previous page)

```

Tile[103,100] = "CLEM_X11Y803"
Tile[104,100] = "INT_X11Y803"
Tile[100,101] = "CLEL_R_X10Y802"
Tile[101,101] = "NULL_X101Y831"
Tile[102,101] = "NULL_X102Y831"
Tile[103,101] = "CLEM_X11Y802"
Tile[104,101] = "INT_X11Y802"
Tile[100,102] = "CLEL_R_X10Y801"
Tile[101,102] = "NULL_X101Y830"
Tile[102,102] = "NULL_X102Y830"
Tile[103,102] = "CLEM_X11Y801"
Tile[104,102] = "INT_X11Y801"
Tile[100,103] = "CLEL_R_X10Y800"
Tile[101,103] = "NULL_X101Y829"
Tile[102,103] = "NULL_X102Y829"
Tile[103,103] = "CLEM_X11Y800"
Tile[104,103] = "INT_X11Y800"
Tile[100,104] = "CLEL_R_X10Y799"
Tile[101,104] = "NULL_X101Y828"
Tile[102,104] = "NULL_X102Y828"
Tile[103,104] = "CLEM_X11Y799"
Tile[104,104] = "INT_X11Y799"

```

If you have questions or ideas on how to make better use of GraalVM, please post ideas and questions on the [RapidWright forum](#).

## 14.2 Using RapidWright Directly in Python 3

### 14.2.1 TL;DR

```
pip install rapidwright
```

### 14.2.2 Introduction

Although RapidWright is written in Java, there is significant interest to access it from Python. Python has many features that make it a great choice for rapid prototyping and scripting solutions. In fact, RapidWright ships with [Jython](#) (Python implemented in Java) to provide an authentic Python experience.

Despite RapidWright's Jython integration, for real-world Python development, the world has transitioned to Python 3 and depend on packages that have native implementations which are incompatible with Jython. This has generally excluded RapidWright (with the exclusion of the experimental [GraalVM's Python](#)) from working directly with Python 3.

However, there is a Python package called [JPype](#) that enable Python to call Java packages directly as if they were native APIs. This tutorial shows you how RapidWright can take advantage of this package to use RapidWright directly in your Python projects.

### 14.2.3 Python Virtual Environments

A highly recommended way to develop in Python is to use [Virtual Environments](#). Python Virtual Environments allow you to isolate your Python modules and installation from the default system installation. As each project can have a

variety of specific needs and version dependencies, having a dedicated Virtual Environment per project can make for a smoother development experience and minimize conflicts.

#### 14.2.4 Pre-requisites

- Python 3
- Java 1.8 or later

#### 14.2.5 Setting up a Virtual Python Environment

The Python module used to create a virtual environment is called `venv`. For more details about configuring a virtual environment, please refer to the `venv` documentation <<https://www.graalm.org/reference-manual/python/>>\_. The default settings of a virtual environment can be set up with the following command:

```
python3 -m venv venv
```

This will create a directory called `venv` which will contain the essential ingredients for a Python interpreter and its environment. To activate the virtual environment, run:

```
source venv/bin/activate
```

or on Windows, run:

```
venv\Scripts\activate
```

In either case your terminal prompt should now have a prefix (`venv`). To leave or deactivate the virtual environment, simply run:

```
deactivate
```

#### 14.2.6 Running RapidWright in the Virtual Environment

Now that the virtual environment is setup, we can begin to experiment with RapidWright. As mentioned in the introduction, `Jpype1` is listed as a dependency, so if we simply run:

```
pip install rapidwright
```

It will be installed automatically. Then we can run Python:

```
python
```

To use RapidWright inside the Python interpreter (or a script), all we need to do is simply:

```
import rapidwright
```

On the very first invocation of this import, it will take a few seconds to get things set up. After the first time, it will be faster.

At this point, you can import java classes to allow you to access any RapidWright Java API:

```
from com.xilinx.rapidwright.device import Device
device = Device.getDevice(Device.AWS_F1)
```

At this point you can also get tab-completion on the Java classes, for example:

```
>>> device.
device.AWS_F1                               device.getClass()                         device.
    ↵getSLRByConfigOrderIndex()                device.getClockRegion()                  device.
device.DEVICE_FILE_VERSION                   device.getClockRegionFromTile()          device.
    ↵getSLRs()                                device.getClockRegions()                device.
device.FRAMEWORK_NAME                      device.getColumns()                     device.
    ↵getSeries()                             device.getDevice()                      device.
device.FRAMEWORK_NAME_AND_VERSION           device.getDeviceName()                 device.
    ↵getSite()                                device.getDeviceVersion()              device.
device.KCU105                                device.getFamilyType()                 device.
    ↵getSiteFromPackagePin()                  device.getMasterSLR()                  device.
device.PYNQ_Z1                                device.getName()                      device.
    ↵getSitePin()                            device.getNode()                      device.
device.QUIET_MESSAGE                        device.getNumOfClockRegionRows()        device.
    ↵getSiteTypeCount()                     device.getNumOfClockRegionsColumns()   device.
device.RAPIDWRIGHT_MINOR_VERSION            device.getNumOfSLRs()                  device.
    ↵getTile()                                device.getPIP()                      device.
device.RAPIDWRIGHT_QUARTER_VERSION          device.getPackage()                  device.
    ↵getTileTypeCount()                    device.getPackages()                 device.
device.RAPIDWRIGHT_VERSION                  device.getRows()                      device.
    ↵getTiles()                                device.getSLR()
device.RAPIDWRIGHT_YEAR_VERSION             >>> device.
```

Which is quite handy. Object return types are translated for primitive types (int, String, ...), but Java objects are preserved and can be accessed via APIs as well:

```
>>> device.getName()
'xcvu9p'
>>> device.getTiles()
<java array 'com.xilinx.rapidwright.device.Tile[][]'>
```

Although there is limited interaction, you can also run RapidWright GUI applications from Python:

```
>>> from com.xilinx.rapidwright.device.browser import DeviceBrowser
>>> DeviceBrowser.main([])
```

We expect this integration capability with Python to help increase RapidWright's applicability to a wider number of projects. There are more opportunities for integration as well, so stay tuned!



Fig. 1: Screen capture of RapidWright’s Device Browser called from Python

## 14.2.7 Java Development and Python

When you install the Python RapidWright package, it downloads the standalone jar so it can run without any extra setup. However, if you already have a git repo checked out and compiled, you can tell the Python RapidWright package to point to your local install by setting the following environment variables:

```
RAPIDWRIGHT_PATH=<path_to_RapidWright_directory_checked_out_from_GitHub>
CLASSPATH=$RAPIDWRIGHT_PATH/bin:$RAPIDWRIGHT_PATH/jars/*
```

This way, the Python RapidWright will use your development copy of RapidWright.

## 14.2.8 Things to Know When Using RapidWright in Python

### Equality

In Java, there are two main ways to check for equality:

1. Reference equality, `==` operator
2. Object equality, `equals()` method

Reference equality essentially checks if two objects point to the same reference or location in memory. Whereas `equals()` invokes the method on referenced object's class definition.

Jpyte has chosen to map the Python `==` operator to use the Java `equals()` method and the Java `==` is not directly accessible. More on this can be found in [Jpyte documentation](#).

## 14.3 Setup JUnit 5 Tests in RapidWright

RapidWright uses JUnit 5 for Unit Testing. This article aims to give an overview about how to run tests, as well as how to write your own.

### 14.3.1 Running the Tests

All testcases are located in the `test/` directory. JUnit does not need a central list of testcases. Instead, it searches the directory for all classes that contain tests. Tests are marked by annotations (see later). After building a list of all tests, it executes them one by one.

Some tests depend on DCPs which are stored in a Git submodule — a feature that allows a specific commit of another Git repository to exist as a subdirectory of the current repository. To check out the specific commit of a submodule, run:

```
git submodule update --init
```

from the parent RapidWright repository where `--init` is only strictly necessary (but harmless otherwise) on the first invocation.

To run the tests via Gradle, use the task `test` or `build` (which depends on `test`). After running the tests, Gradle will output the results both as an HTML document in `build/reports/tests/test*` and as JUnit-internal XML in `build/test-results/test*`. Note that Gradle knows whether the input to the tests changed and will not rerun them if they are up to date.

There is integration for JUnit in all major IDEs. When loading RapidWright into your IDE, you should set `test` as source directory for tests. Your IDE should allow you to run all the tests or choose a single class to run. Alternatively,

one can execute Gradle from the command line with the `testJava` or `testPython` task to run all tests, and restricted to specific tests with one or more `--tests <filter>` arguments. For example:

```
./gradlew testJava --tests com.xilinx.rapidwright.design.* --tests *PartNameTools.  
→testGetPartCase
```

would run all Java test methods under all classes within the `com.xilinx.rapidwright.design` package, as well as the single test method `testGetPartCase` from the `com.xilinx.rapidwright.device.TestPartNameTools` class. Note that the `test` task depends on `testJava` and `testPython` but does not support filtering.

### 14.3.2 Writing Testcases

JUnit uses Annotations to tag methods as testcases. While there are more specialized annotations, most testcases will be tagged with the annotation `@Test` (from the `org.junit.jupiter.api` package).

A test class with a single (empty) test method might look like this:

```
import org.junit.jupiter.api.Assertions;  
import org.junit.jupiter.api.Test;  
public class MyTestClass {  
    @Test  
    public void test() {  
    }  
}
```

Test methods should be `public` and cannot be `static` and not have parameters. JUnit will create an instance of the class, so the class cannot have any constructor parameters.

Testcases communicate failures by throwing an exception. JUnit will then mark it accordingly. Instead of using an `if` to check for something and then manually creating an exception, you can use the `Assertions` class (from the package `org.junit.jupiter.api`). It offers convenience methods for often used checks:

- `assertEquals`
- `assertArrayEquals`
- `assertNotEquals`
- `assertSame`

All these methods have a parameter for an expected value and an actual value. Optionally, a message parameter can be passed to explain what part of the test encountered an error.

A very simple test to check that addition works as expected might look like this:

```
import org.junit.jupiter.api.Assertions;  
import org.junit.jupiter.api.Test;  
public class MyTestClass {  
    @Test  
    public void test() {  
        Assertions.assertEquals(2, 1 + 1);  
    }  
}
```

### 14.3.3 Parameterized Tests

Normal test methods do not have parameters. If you want to run the same test on a range of data, you can use a loop. However, once the test fails for one set of data, the whole testcase execution is over. Data after the first failure will not be run.

JUnit allows parameters on testcases. They are marked with `@ParameterizedTest` instead of `@Test`. The annotation has an optional parameter (`name`) that allows you to override the generated test's name to make it more descriptive.

You need to specify a source for values for these parameters. One option is use a separate method that return a `Collection<Arguments>` or `Stream<Arguments>`. One instance of `Arguments` describes one invocation of the testcase method. The value source is specified as another annotation (here: `@MethodSource`).

A simple example that calls `testNonzero(int i)` on all numbers from 1 to 10:

```
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

public class MyTestClass {
    @ParameterizedTest(name = "Check that {0} is nonzero")
    @MethodSource()
    public void testNonzero(int i) {
        Assertions.assertNotEquals(0, i);
    }

    public static Stream<Arguments> testNonzero() {
        return IntStream.rangeClosed(1, 10).mapToObj(i -> Arguments.of(i));
    }
}
```

### 14.3.4 RapidWright-specific Considerations

RapidWright's tests are automatically run on Github Actions. There are rather strict restrictions in terms of maximum memory (7GB) and some parts of RapidWright can exceed that limit. You should keep this limitation in mind while writing testcases:

- Testcases should be limited to a single Device. If you have to use multiple Devices, take care that only one Device is referenced at the same time.
- When instantiating a Design, use a small Device for it.

To identify issues with files being left open, there is a JUnit extension that compares the list of open files before and after a testcase. It will fail the testcase if there are changes. This extension (`com.xilinx.rapidwright.support.CheckOpenFilesExtension`) is automatically registered when JUnit tests are run with Gradle.

#### Testcase DCPs

Tests requiring new DCP(s) will need to fork the [RapidWrightDCP](#) repository to gain write permissions.

The DCP(s) to be added should have no encrypted components inside and the EDIF inside the DCP should be readable (not encrypted). A readable EDIF file can be generated using Vivado either [automatically](#) upon load in RapidWright

or via `write_edif` (see *RapidWright and Design Checkpoint Files*). Use the `ReplaceEDIFInDCP` tool to replace the EDIF inside a DCP, for example:

```
rapidwright ReplaceEDIFInDCP design.dcp readable_design.edf
```

will replace the EDIF file inside `design.dcp` with `readable_edif.edf`.

Next, execute the following:

```
# Add, commit, push new DCP(s) into new branch on fork
cd test/RapidWrightDCP
git remote add fork https://github.com/<user>/RapidWrightDCP # Only necessary first
#invocation
git checkout -b <branch>
git add <dcp_name>
git commit
git push -u fork <branch>
cd ../../

# Commit new submodule reference
git commit test/RapidWrightDCP -s -m "(Description)"
```

The submodule can now be used as a regular Git repository during development; remember to commit new submodule references from the RapidWright repository using:

```
git commit test/RapidWrightDCP -s -m "(Description)"
```

Once ready, please create new pull requests in both the upstream RapidWright and RapidWrightDCP repositories. When both pull requests have been approved, the following situation will be present:

```
RapidWrightDCP (upstream) ... o--o-----x
                           \           /
                           (PR#123)
RapidWrightDCP (fork)      ... o--o ... o--o
                           ^ (commit `abc`)

RapidWright (upstream) ... o--o-----x
                           \           /
                           (PR#456)
RapidWright (fork)      ... o--o ... o--o
                           ^ (submodule refers to commit `abc`
                           on RapidWrightDCP fork)
```

Here, RapidWright's PR#456 refers to commit abc which is present only on the fork. The expectation would be that the RapidWrightDCP's PR#123 would be merged first after which PR#456 can then update its RapidWrightDCP submodule reference to include upstream's newly merged result:

```
RapidWrightDCP (upstream) ... o--o-----o (commit `def` including
                           \           /   PR#123)
RapidWrightDCP (fork)      ... o--o ... o--o

RapidWright (upstream) ... o--o-----o
                           \           /   / (PR#456)
RapidWright (fork)      ... o--o ... o--o--o
                           ^ (submodule updated to commit `def`
                           on RapidWrightDCP upstream)
```

This submodule reference can be updated back to upstream as follows:

```
# Return submodule to upstream master
cd test/RapidWrightDCP
git checkout master
git pull
cd ../..

# Commit new submodule reference
git commit test/RapidWrightDCP
```

## 14.4 RapidWright Data Files

### Table of Contents

- *RapidWright Data Files*
  - *On-demand Data File Downloads*
  - *Local Storage of Data Files*
  - *Avoiding On-demand Download & Generation of Data Files*

RapidWright maintains support for the full set of devices publicly available in the latest Vivado release. The information needed to populate RapidWright device models is stored in binary data files distributed with RapidWright. Starting in the 2021.1.0 release, these data files began to be distributed via a download on-demand model. This was done to accelerate installation, reduce disk space requirements and provided an easier path to upgrade.

### 14.4.1 On-demand Data File Downloads

All of the code involved in downloading and checking for data files is in the open source portion of RapidWright. Most of the code is found in `com.xilinx.rapidwright.util.FileTools`. All data files are specified by an MD5 checksum with a master list checked in at `src/com/xilinx/rapidwright/util/DataVersions.java`. When the user calls an API that requires a RapidWright data file, it will check the local file MD5 against the `DataVersions.java` to ensure they match. RapidWright caches the current data file's MD5 by creating a small file alongside the data file with a `.md5` extension for speed. If the file is missing or doesn't match that expected MD5, it will attempt to download the file. This will happen behind the scenes transparent to the user with the exception that the first time call will take a bit longer since it is downloading the file.

If desired, a user can turn off the on-demand data file download feature by calling `FileTools.setOverrideDataFileDownload(true)` at the start of their RapidWright program.

### 14.4.2 Local Storage of Data Files

RapidWright data files are stored in two ways depending on how RapidWright has been installed.

#### Standalone Jar (Binary)

If RapidWright is installed using the standalone jar downloaded directly from a GitHub release or a Python pip install, the files are located in an OS-specified user directory:

- For Windows, `%USER%\AppData\Roaming\RapidWright` or a path set by the environment variable `APPDATA`

- For Linux, `~/.local/share/RapidWright` or a path set by the environment variable `XDG_DATA_HOME`

It should be noted that the first time RapidWright is invoked using the standalone jar method, it will unpack a minimal set of data files that were included with the standalone jar to the directory cited above.

### GitHub Clone (Source Code)

If RapidWright is installed by a clone of the GitHub repository (or a snapshot of the source code), the default directory is the directory created by the clone of the code (`./RapidWright`).

### Override Data File Location

Both standalone jar and GitHub clone options can be overriden by setting the environment varable `RAPIDWRIGHT_PATH`. This will avoid the creation of the default OS/user specific directories.

#### 14.4.3 Avoiding On-demand Download & Generation of Data Files

Two potential challenges exist for on-demand data file download and generation:

1. Lack of persistent Internet connectivity
2. Collisions due to independent, parallel instances of RapidWright runs

To alleviate the need for Internet access, the easiest option is to invoke the API `FileTools.updateAllDataFiles()` when Internet connectivity is available. After successful completion of calling this method, every potential data file that RapidWright could download will have been downloaded on the local system. To run this method from the command line run:

```
rapidwright jython -c 'FileTools.updateAllDataFiles()'
```

Note that this does not generate device cache files that can also potentially cause collisions if independent RapidWright instances are run simultaneously.

To eliminate file download/generation collisions, the API `FileTools.ensureDataFilesAreStaticInstallFriendly(String... devices)` has been created. Due to the overhead of generating a device cache file for each device, the user can specify the specific devices anticipated during future runs. As an example, to run this API from the command line for the `xc7a100t` and `xc7a200t` devices, run:

```
rapidwright jython -c "FileTools.ensureDataFilesAreStaticInstallFriendly("xc7a100t",
    ↵"xc7a200t")"
```

Another option to avoid on-demand download is to obtain the `rapidwright_data.zip` file associated with the current release (see assets from the corresponding [‘GitHub Releases<https://github.com/Xilinx/RapidWright/releases>’](https://github.com/Xilinx/RapidWright/releases)) and replace the data directory in the RapidWright directory with its contents.

---

**Note:** Due to GitHub size limitations, starting with the 2022.1.0 through 2022.2.3

---

release, the data files were split into two downloads (`rapidwright_data.zip` and `rapidwright_data2.zip`). In 2023.1.0, we switched to Zstandard compression for all our data files that has allowed the release to be consolidated back to a single release zip file. `rapidwright_data.zip` and `rapidwright_data2.zip`.

## FREQUENTLY ASKED QUESTIONS

### 15.1 I can't open my DCP in RapidWright, I get 'ERROR: Couldn't determine a proper EDIF netlist to load with the DCP file ...', what should I do?

RapidWright is able to read any unencrypted design files. If a design/DCP has been encrypted, you'll need to generate a new file without encryption in order to use it with RapidWright.

However, sometimes without explicitly invoking encryption, Vivado will encrypt the EDIF file present in a DCP automatically (it is quite common). To enable reading the DCP within RapidWright, load the DCP in Vivado and then create a similarly named EDIF file (mydesign.dcp → mydesign.edf) by running the command `write_edif mydesign.edf`. This will generate an unencrypted EDIF file (only if encryption is turned off and the design does not contain any encrypted IPs) that RapidWright can recognize and load in with the rest of the DCP.

RapidWright comes with a small utility called `ReplaceEDIFInDCP` that can avoid the use of two files for situations that may require that convenience.

New in 2021.1.0, RapidWright can now invoke Vivado automatically to call `write_edif` on the DCP attempting to be loaded at runtime. However, a compatible Vivado version must be on the PATH of the RapidWright program at runtime.

### 15.2 Can RapidWright be used for designs targeting the AWS F1 platform?

Yes, there are some ways in which parts of a design generated in RapidWright can be inserted into an existing AWS-F1 design. One technique uses the Vivado command `read_checkpoint -cell <cell_instance_name> <checkpoint.dcp>`. If you insert a blackbox that matches your DCP (see the stub files inside the DCP file) into your AWS-F1 design, you can use the `read_checkpoint` command to pull in a synthesized, placed and/or routed DCP into the existing design.

Note that RapidWright cannot read in the AWS F1 shell design as it is encrypted and user design data is encrypted by default.

### 15.3 When should I use RapidWright and when should I use Vivado?

We recommend that Vivado be used for all tasks that meet the users expectations. If you have designs that are running successfully and meeting your design constraints, there is no need to use RapidWright. However, if you are seeking to improve performance and/or productivity because of unique insights you might have into your application and/or

the FPGA architecture being targeted, RapidWright might be able to help. Vivado will always be part of the flow for validating designs (DRC/Timing) and creating bitstreams. However, there may be strategic design structures that can be created, preserved and/or replicated in RapidWright that might help you achieve your performance goals.

## 15.4 What languages does RapidWright support, and how do I interact with them?

RapidWright is written in Java. RapidWright is also packaged with a Python interpreter called [Jython](#) that enables it to run pure Python scripts and code. We recommend that for more compute intensive work, Java implementations be the language of choice as it will execute faster. Python is especially useful for interacting with RapidWright in a command-line type fashion. This allows device and design objects to remain persistent as the user examines their work and choose to make changes on the fly.

For C and C++, we have a tech article (see [Call RapidWright from C/C++ Using GraalVM](#)) that describes how you can create a RapidWright shared object library enabling APIs to be called from C or C++ using a compiler framework called [GraalVM](#).

## 15.5 Why is the framework called RapidWright?

The ‘Rapid’ portion is to indicate speed and efficiency. It also provides some resemblance from a previous generation framework called RapidSmith. The ‘Wright’ portion was a common surname in England and means maker or builder. RapidWright is a framework to help you quickly build designs for Vivado.

## 15.6 Can RapidWright generate bitstreams?

No. There is currently no bitstream information in RapidWright. Any designs will need to be put back into Vivado for DRC and bitstream generation.

## 15.7 Does RapidWright provide device timing information?

RapidWright now includes a lightweight timing model for UltraScale+ devices (see [RapidWright Report Timing Example](#)). For other devices, timing results can be obtained by exporting a design using the `Design.writeCheckpoint()` command and loading the design in Vivado to report timing.

## 15.8 Does RapidWright support partial reconfiguration (PR)?

RapidWright does not have specific support for PR, but it can be used to generated designs or partial designs intended to be partially reconfigured. This can be done by generating designs and then importing them into PR-based projects in Vivado using `read_checkpoint -cell <cell_name> <dcp_name>`, where the cell is a black box.

## 15.9 Is there any published work on RapidWright?

Yes, we had a paper at [FCCM 2018](#) (The 26th IEEE International Symposium on Field-Programmable Custom Computing Machines). A preprint copy of the paper is available here: [FCCM18-RapidWright.pdf](#). The presentation

slides are available here: FCCM18-RapidWright-Presentation.pdf.



---

CHAPTER  
SIXTEEN

---

## GLOSSARY

**Laguna** When a device is composed of multiple dies (using [SSIT](#)), CLBs are replaced with Laguna Tiles and Sites to provide dedicated logic to crossing from one die to the next. Laguna sites contain dedicated RX and TX flip flops that connect to [SLLs](#).

**Shell** A static FPGA design that provides a common interface to off-chip resources (DDR, PCIe,...) intended for multiple applications.

**SLL** Super long line, these are the wires that cross between dies in a multi-die device (see [SSIT](#)).

**SLR** Super logic region, in multi-die devices, each super logic region is one die connected to other die via an interposer. The routing wires that connect these SLRs are [SLLs](#)). Also see [SLR \(Super Logic Region\)](#).

**SSIT** Stacked silicon interconnect technology: Xilinx uses an interposer substrate to package multiple FPGA die into a single package.

**Tile Pattern** A sequence of tile types. For example, the 7 series device family have four types of CLB tiles (CLBLL\_L, CLBLL\_R, CLBM\_R, CLBM\_L). A PBlock can cover several tile columns, thus spanning several heterogeneous tile types. If the logic implemented within the pblock is relocated on the device, it must use the same tile pattern, meaning sequence of CLB tile types must match.



## INDEX

### L

Laguna, **197**

### S

Shell, **197**

SLL, **197**

SLR, **197**

SSIT, **197**

### T

Tile Pattern, **197**