

# The RISC-V Berkeley Out-of-Order Machine:

## An update on BOOM and the wider ecosystem (Chisel, FIRRTL, & Rocket-chip)

Christopher Celio

ORCONF

2016 October

`celio@eecs.berkeley.edu`

<http://ucb-bar.github.io/riscv-boom>



Berkeley  
Architecture  
Research



# An Update on the Berkeley Architecture Research Infrastructure (Oh, and BOOM is cool too.)

Christopher Celio

ORCONF

2016 October

`celio@eecs.berkeley.edu`

<http://ucb-bar.github.io/riscv-boom>



Berkeley  
Architecture  
Research



# What is BOOM?

- superscalar, out-of-order RISC-V processor written in Berkeley's Chisel hardware construction language (HCL)
- It is synthesizable
- It is parameterizable
- it is open-source





# How do you make an OoO processor?

- Start with a new hardware construction language
  - Verilog is awful and will just get in the way.
- Start with a working processor.
  - Way easier than writing everything from scratch!
  - PTWs, FPUs, uncore, devices, off-chip IOs are unglamorous.



# It takes a village.

- RISC-V ISA
  - very out-of-order friendly!
- Chisel hardware construction language
  - object-oriented, functional programming
- FIRRTL (**brand new!**)
  - exposed RTL intermediate representation (IR)
- Rocket-chip
  - A full working SoC platform built around the Rocket in-order core
- Thanks to:
  - Krste Asanović, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Biancolin, Christopher Celio, Henry Cook, Palmer Dabbelt, John Hauser, Adam Izraelevitz, Sagar Karandikar, Ben Keller, Donggyu Kim, Jack Koenig, Jim Lawson, Yunsup Lee, Richard Lin, Eric Love, Martin Maas, Chick Markley, Albert Magyar, Howard Mao, Miquel Moreto, Quan Nguyen, Albert Ou, David A. Patterson, Brian Richards, Colin Schmidt, Wenyu Tang, Stephen Twigg, Huy Vo, Andrew Waterman, Angie Wang, and more...

# The Rocket-Chip SoC Generator Ecosystem



- BOOM is a piece of the Rocket-chip ecosystem
- Started in 2011
- taped out **10** (12?) times by Berkeley
- runs at **1.65 GHz** in IBM 45nm
- MMU supports page-based virtual memory
- IEEE 754-2008 compliant FPU
  - supports SP, DP FMA with hw support for subnormals
- cache coherent, non-blocking L1 data cache, L2 cache, and more



Rocket 5-stage pipeline



<https://github.com/ucb-bar/rocket-chip>

<https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html>

# BOOM fits into Rocket-chip SoC



# Rocket-chip Updates

- The devs graduated!
  - started SiFive start-up to support RISC-V, rocket-chip and design custom chips around them
- work in progress on a new Rocket-chip Foundation
- new generation of Berkeley student devs!
  - Berkeley is committed to open-source Rocket-chip
  - (our research depends on it!)
- 37 contributors
  - SiFive, Berkeley, LowRISC, Boston U., and more

the vision...



Rocket-chip





# Rocket-chip Updates

- committed to open-source since Oct 2011
  - 3588 commits!!!
  - 310 commits last four weeks!
- Better documentation
  - <https://dev.sifive.com/documentation/u5-coreplex-series-manual/>
- removed git submodules
  - speed up development (reduce merge headaches)
- RISC-V Privileged Spec v1.9
- uncached loads/stores (memory-mapped IO)
- RISC-V External Debug
  - breakpoints
  - stand-alone boot!
  - non-standard HTIF (host/target interface) has been removed!
- More configurations
  - RV32 support + M/A/F as options
  - RISC-V Compressed ISA support
  - blocking data cache (MSHRs = 0)
- Updated to Chisel3
  - No more C++ backend, only Verilog is emitted.
  - we use Verilator in the build system for free and fast simulation.
- And a lot more...
  - tilelink updates, multi-clock domains, I/Os, devices, ...
  - I'm having trouble keeping up!



- Hardware Construction Language embedded in **Scala**
- not a high-level synthesis language
- hardware module is a data structure in Scala
- Full power of Scala for writing generators
  - object-oriented programming
    - factory objects, traits, overloading
  - functional programming
    - high-order funs, anonymous funcs, currying





# Chisel3 (frontend) and FIRRTL (backend)

## ■ Goal

- turn research-ware into a quality compiler platform
- open up the IR for hardware designers to write their own IR transform passes

## ■ Chisel3

- embedded in Scala
- generates FIRRTL RTL code

## ■ FIRRTL (Flexible IR for RTL)

- serves as an IR for hardware (i.e., LLVM for hw!)
- generates Verilog

## ■ Success!

- can add your own transformations
- you can throw away Chisel and write your own front-end
- <https://github.com/ucb-bar/chisel3> (alpha version)
- <https://github.com/ucb-bar/firrtl> (alpha version)
  - **Spec:** <https://github.com/ucb-bar/firrtl/tree/master/spec>

# FIRRTL IR Passes



- Code coverage
  - how much of my circuit is being exercised?
- Scan chain insertion
- SRAM
  - FPGAs vs ASIC memories
  - re-sizing (transforming skinny/tall to rectangular)
- early statistic gathering (e.g., gate count)
  - synthesis tools take a long time...
- Decoupling target time from host time
  - e.g., adding a "stop-the-world" button
- assertion support (TBA)
  - turning simulator assertions into a real HW assertion

# New & Upcoming Chisel Features



- parameterized Verilog blackbox support
- Analog types
  - represents outside wires that are "not digital"
  - e.g., connecting **inout** I/Os to blackboxes
- multi-clock domain support
  - fixed-ratio has first-class citizen support
  - you provide your own clock domain crossings\*
  - e.g., async FIFOs
- \*async reset
  - TBD, but with an eye towards multi-clock support
- DSP support
  - FixedPoint type
  - can provide value ranges, not bit widths
- Annotations
  - allow FIRRTL back-end to get additional information from Chisel

# Chisel2



- provides compatibility warnings to help migration to Chisel3
- migration documentation is available
- End of life...?
- if you use Chisel2 and depend on it let us know!

# Questions on BAR?



# What is BOOM?



- superscalar, out-of-order RISC-V processor written in Berkeley's Chisel RTL
- It is synthesizable
- It is parameterizable
- it is open-source
- started in **2012**
- **10k** lines of code
- implements **RV64G**



2-wide BOOM (16kB/16kB) 1.2mm<sup>2</sup> @ 45nm



- PRF
  - explicit renaming
  - holds speculative and committed data
  - holds both x-reg, f-reg
- Unified Issue Window
  - holds all instructions
- split ROB/issue window design

# Benefits of using Chisel & Rocket-chip



- **~10,000 loc** in BOOM github repo
- + additional **~12,000 loc** instantiated from other libraries
  - **~5,000 loc** from Rocket core repository
    - 90 (integer ALU)
    - 150 (unpipelined mul/div)
    - 550 (floating point units)
    - 1,000 (non-blocking datacache)
    - 300 (icache)
    - 300 (next line predictor/BTB/RAS)
    - 200 (decoder minimization logic)
    - 200 (page-table walker)
    - 200 (TLB)
    - 400 (control/status register file)
    - 300 (instruction definitions + constants)
  - **~4,500 loc** from uncore
    - coherence hubs, L2 caches, networks
  - **~2000 loc** from hardfloat
    - floating point hard units

# Parameterized Superscalar

**dual-issue (5r,3w)**



```
val exe_units = ArrayBuffer[ExecutionUnit]()
exe_units += Module(new ALUExeUnit(is_branch_unit = true,
                                    has_fpu = true,
                                    has_mul = true))
exe_units += Module(new ALUMemExeUnit(fp_mem_support = true,
                                      has_div = true))
```

**Quad-issue (9r,4w)**



**OR**

```
exe_units += Module(new ALUExeUnit(is_branch_unit = true))
exe_units += Module(new ALUExeUnit(has_fpu = true,
                                    has_mul = true))
exe_units += Module(new ALUExeUnit(has_div = true))
exe_units += Module(new MemExeUnit())
```

# ARM Cortex-A9 vs. RISC-V BOOM



| Category             | ARM Cortex-A9                          | RISC-V BOOM-2w                       |
|----------------------|----------------------------------------|--------------------------------------|
| ISA                  | 32-bit ARM v7                          | 64-bit RISC-V v2 (RV64G)             |
| Architecture         | 2 wide, 3+1 issue Out-of-Order 8-stage | 2 wide, 3 issue Out-of-Order 6-stage |
| Performance          | <b>3.59</b> CoreMarks/MHz              | <b>4.61</b> CoreMarks/MHz            |
| Process              | TSMC 40GPLUS                           | TSMC 40GPLUS                         |
| Area with 32K caches | 2.5 mm <sup>2</sup>                    | 1.00 mm <sup>2</sup>                 |
| Area efficiency      | 1.4 CoreMarks/MHz/mm <sup>2</sup>      | 4.6 CoreMarks/MHz/mm <sup>2</sup>    |
| Frequency            | 1.4 GHz                                | 1.5 GHz                              |

Caveats: A9 includes NEON;  
 BOOM is 64-bit, has IEEE-2008 fused mul-add



# BOOM Updates

- open sourced in Jan 2016
  - <http://ucb-bar.github.io/riscv-boom/>
- ~60 page BOOM Design Specification
  - <https://ccelio.github.io/riscv-boom-doc/>
- used for case study in ISCA 2016 paper ([strober.org](http://strober.org))
- first external contribution (visualizer)
- ported to Chisel3/FIRRTL
- supports uncached loads, stores (allows for memory-mapped IO)
- updated to Privilege Spec v1.9
- supports RISC-V External Debug Spec
- added High Performance Monitor counters (HPM)
- branch predictor improvements
- beginning tape-out



# Zynq FPGA Repository

- BOOM runs on a Xilinx Zynq zc706
- <https://github.com/ucb-bar/fpga-zynq>
- Updated to handle latest Privileged Spec v1.9, External Debug spec, self-booting
- was "tethered" via the Debug Transport Module, but...
- ... just finished up a "Tether Serial Interface" update

# Visualization



- Above shows:

- SD-to-LBU hazard in dhrystone
- instructions dependent on LBU wait for LBU to execute
- instructions not-dependent issue out-of-order, before LBU executes



# I accept contributions!

- Visualizer is BOOM's first external contribution
- Happy to accept more!
  - code that crashes
  - fixes to code that crashes
  - performance analysis tools
  - debugging or visualization tools
  - performance improvements
  - new features

# Branch Predictors

## Instructions

0x100: add r1, r2, r3  
0x104: add r3, r0, r4  
0x108: lw r2 0(r3)  
0x10c: beq r1, r2, 0x200



# GShare Predictor

- global history
  - track outcome of last N branches
- 2-bit saturating counter table



# BOOM's Branch Prediction



- next-line predictor (NLP)
  - BTB, BHT, RAS
  - combinational
- backing predictor (BPD)
  - global history predictor
  - SRAM (1 r/w port)



**Front-end**



- change "2-bit counter" state machine
  - on misprediction, jump from weak to strong
    - 00 -> 01 -> 11
  - this will allow us to reduce the read/write requirements
  - described in Hennessy & Patterson computer architecture textbook



# GShare Predictor - Fitting it in SRAM



- p-bit (prediction)

- read every cycle
- write on mispredict (value of h-bit)

- h-bit (hysteresis)

- write every branch
- read on mispredict



# GShare in single-ported SRAM



- delayed ghistory update
- super-scalar predictions
- ghistory is fetch packet granularity
- banked p-table
- reset ghistory on misspeculations
- update during commit

# Abstract Branch Predictors



# BOOM's Branch Predictors



- Null
  - predicts not-taken
- Random
  - serves as the baseline worst-case predictor
  - useful for testing the pipeline
- Simple Gshare
  - demonstrates how to interface with the branch predictor framework
  - not synthesizable
- GShare
  - targeting 1r/1w SRAM (dualported)...
  - ... or 1rw SRAM (banked)
- 2bc-GSkew
  - based on the EV8 (Alpha 21464) predictor
  - 1rw (or 1r/1w) SRAM
  - took 12 hours to implement
- TAGE
  - a super awesome predictor
  - still in prototyping; WIP to make it synthesizable and scalable



# Thank you!

- Source
  - <https://ucb-bar.github.io/riscv-boom>
- Documentation
  - <https://ccelio.github.io/riscv-boom-doc>
- Google group
  - <https://groups.google.com/forum/#!forum/riscv-boom>
- Tech Report
  - <https://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-167.html>
- Twitter
  - [https://twitter.com/boom\\_cpu](https://twitter.com/boom_cpu)

# Funding Acknowledgements



- *Research partially funded by DARPA Award Number HR0011-12-2-0016, the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Huawei, Nokia, NVIDIA, Oracle, and Samsung.*
- *Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.*
- *Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and does not necessarily reflect the position or the policy of the sponsors.*