

# From Open Hardware Design to Silicon: Lessons From Some Big (Academic) Chips

Jonathan Balkind

ArchLab, UC Santa Barbara

# Building Realistic Scalable Prototypes

- Seen as expensive and risky
- Need open IP to reuse
- Scalability and configurability are challenging
  
- OpenPiton tackles these issues
- For our users, OpenPiton acts as:
  - A research platform for manycore SoC development
  - A nexus for open source hardware



G.K.Chen et al. JSSC 2023



Piton



CIFER



DECADES

Coming  
soon!  
Currently  
in fab

Polara

# OpenPiton & P-Mesh Coherence

- Open source manycore (since 2015)
- P-Mesh coherence scales to  $\frac{1}{2}$  billion cores
  - Innovation: we connect cores of different ISAs in one open-source system!
  - Core interface to cache coherence (TRI)
  - **We have connected >15 cores of six ISAs**
- Configurable core, uncore
- Simulation in VCS, ModelSim, Incisive, Verilator, Icarus, Riviera\*, Vivado
- Runs full stack multi-user Debian Linux
- Used in >75 public research works



# Piton Chip

- 25 cores
  - Modified 64 bit OpenSPARC T1 Core
- 3 P-Mesh NoCs
  - 64 bit, 2D Mesh
  - Extend off-chip enabling multichip systems
- P-Mesh Directory-Based Cache System
  - 64kB L2 Cache per core (Shared)
  - 8kB L1.5 & L1 Data Caches
  - 16kB L1 Instruction Cache
- IBM 32nm SOI Process
  - 6mm x 6mm
  - 460 Million Transistors - Among the largest academic chips
- Target: 1 GHz Clock @ 900 mV
- Received silicon and runs full-stack Debian in lab



# System Overview - Chip



- Extending NoCs off-chip enables multi-chip
- Off-chip chipset logic was developed post-tapeout
- Packets can be arbitrarily modified off-chip
- **Lesson:** This enables agile prototyping and addition of peripherals
- **Lesson:** Over-provisioned packet format enabled research modifications

# Configuration Options

- Configuration provided both in core and uncore
- Many options needed to enable design space exploration
- Parameterisation enabled by PyHP preprocessing
  - Python can be embedded in Verilog to produce complex Verilog code
- **Lesson:** PyHP and heterogeneity can be at odds
  - Leaning more on native Verilog today

| Component             | Configurability Options                       |                          |
|-----------------------|-----------------------------------------------|--------------------------|
| Cores (per chip)      | Up to 65,536                                  |                          |
| Cores (per system)    | Up to 500 million                             |                          |
| Core Type             | OpenSPARC T1                                  | Ariane 64 bit RISC-V     |
| Threads per Core      | 1/2/4                                         | 1                        |
| Floating-Point Unit   | FP64/32                                       | FP64/32/16/FP8, BFLOAT16 |
| TLBs                  | 8/16/32/64 entries                            | Number of entries        |
| L1 I-Cache            | Number of Sets, Ways                          |                          |
| L1 D-Cache            | Number of Sets, Ways                          |                          |
| L1.5 Cache            | Number of Sets, Ways                          |                          |
| L2 Cache              | Number of Sets, Ways                          |                          |
| Intra-chip Topologies | 2D Mesh, Crossbar                             |                          |
| Inter-chip Topologies | 2D Mesh, 3D Mesh, Crossbar, Butterfly Network |                          |
| Bootloading           | SD/SDHC Card, UART, RISC-V JTAG Debug         |                          |

# Tool Flows

- Simulation (**sims**)
  - Flow came with VCS simulation
  - Today further supports Cadence NCSim, Mentor ModelSim, Icarus Verilog, Verilator, Xilinx Vivado, simulation on MacOS
- FPGA (**protosyn**)
- Synthesis and Backend (**rftf**)
- **Lesson:** supporting all users means supporting open tools
- **Lesson:** push-button, single command flows are crucial to research productivity
- **Lesson:** porting FPGAs/ASIC libraries must be easy



# FPGA Prototyping

- Chipset crossbars generated from XML description of devices
  - Chip has no knowledge of chipset
  - Packets from chip rewritten to route to correct port
- Many devices supported for use bare metal and from Linux
- **Lesson:** Productivity enhancements for repeated menial tasks are worthwhile at scale
  - Debian boot time reduced from ~45 minutes to ~5 minutes



# Open Source System Stack

- All levels of the stack are open source
- Unmodified applications can run in Debian Linux on OpenPiton
- Or users can modify at any level of the stack!
- Debian boots on SPARC + RISC-V
  - Minor Linux patches used (and open)
- Modifications to reset code and hypervisor made for SPARC



# Open Source Design and Data

The diagram illustrates the OpenPiton ecosystem. At the center is a large blue cloud containing the text "OpenPiton" and the website "http://www.openpiton.org". Above the cloud, the word "Design" is written above "OpenPiton" and "Characterization Data" is written below it. To the left of the cloud is a diagram of a chip architecture with a "Chip Bridge" at the top, connected to "Tile 0", "Tile 1", "Tile 5", "Tile 10", "Tile 15", "Tile 16", "Tile 20", "Tile 21", "Tile 22", "Tile 23", and "Tile 24". A double-headed arrow connects the "Design" and "Characterization Data" sections. To the right of the cloud are three graphs. The top graph shows "Measured Power (W)" vs "Hops" for various benchmarks: "nop", "and", "and", "mix", "mix", "sqrt", "sqrt", "fadd", "fadd", "fdiv", "fdiv", "ldt", "ldt", "stx (N)", "stx (N)", and "time (T)". It includes trendlines for FSW (-16.68 pJ/hop), FSVA (-16.98 pJ/hop), HSW (-11.18 pJ/hop), and NSW (-3.58 pJ/hop). The middle graph is a bar chart of "Measured Power (W)" for SRAM Dynamic Power, Core Dynamic Power, SRAM Static Power, and Core Static Power across different memory configurations. The bottom graph is a scatter plot of "Measured Power (W)" vs "Hops" for the same benchmarks as the top graph, with trendlines for FSW, FSVA, HSW, and NSW.

Design  
Characterization Data

OpenPiton

OpenPiton  
<http://www.openpiton.org>

Measured Power (W)

Trendlines: FSW (-16.68 pJ/hop), FSVA (-16.98 pJ/hop), HSW (-11.18 pJ/hop), NSW (-3.58 pJ/hop)

Measured Power (W)

SRAM Dynamic Power, Core Dynamic Power, SRAM Static Power, Core Static Power

Measured Power (W)

Hops

First detailed power and energy characterization of an open source manycore!

# Open Source & Community

- 13 releases since 2015
- Transitioned to GitHub in 2018
- Open source ecosystem
  - Research: OpenPiton+Ariane, PyOCN, ILA
  - Tools: Icarus, Verilator, FuseSoC, Yosys, OpenTimer, OpenSTA
  - Peripherals: SDHC, DRAM, VGA, AXI
  - Industrial use: Emulation, training
- Community building
  - Thirteen tutorials
  - Mailing list for announcements
  - Google group for support



# Open Source & Community

- **Lesson:** Lowering barrier to entry keeps users engaged
  - Wrote getting started guide
  - Prebuilt docker, virtual machine images
- **Lesson:** Tutorial materials increase access
  - Slides and pre-recorded tutorial available
- **Lesson:** Archiving issues helps in future
  - Users find answers and see active community
- **Lesson:** Open development improves code quality and adoption



# Core Interface Options

- Several interface options with different trade-offs
- We want to minimise effort to connect new cores
- We define the Transaction-Response Interface (TRI)
  - Simple, extensible, low overhead



Provide 3 cache levels  
Interface before L1

Provide 1 cache level  
Interface before LLC

Provide 2 cache levels  
Interface after L1

| Core to Cache      | Cache to Core        | Subset       |
|--------------------|----------------------|--------------|
| Load               | Load Response        | Load/Store   |
| Store              | Store Ack            | Load/Store   |
| Instruction Miss   | Instruction Return   | Instruction  |
|                    | Invalidation Request | Invalidation |
| Atomic             | Atomic Response      | Atomic       |
| Outbound Interrupt | Inbound Interrupt    | Interrupt    |

# Cores Connected With TRI Subsets

- CVA6/Ariane (RISC-V 64)
  - All subsets
- OpenSPARC T1 (SPARC V9)
  - All subsets
- BlackParrot (RISC-V 64)
  - All subsets
- PicoRV32 (RISC-V 32)
  - Load/store + Atomic
- ao486 (x86) 
  - Instruction + Load/store
- QEMU (Many ISAs)
  - Load/store
- VexRiscv (RISC-V 32)
  - Instruction
- Microwatt (OpenPOWER)
  - Instruction + Load/store\*

| Core to Cache      | Cache to Core        | Subset       |
|--------------------|----------------------|--------------|
| Load               | Load Response        | Load/Store   |
| Store              | Store Ack            | Load/Store   |
| Instruction Miss   | Instruction Return   | Instruction  |
|                    | Invalidation Request | Invalidation |
| Atomic             | Atomic Response      | Atomic       |
| Outbound Interrupt | Inbound Interrupt    | Interrupt    |

# TRIcky Design Details

- TRI subsets
- L1 cache organisation
- Supporting existing protocols
- Configurability
- Write-back L1 caches
- Handshakes
- Reset strategy
- ISA complexity

# Enabled Research

- OpenPiton team:
  - Execution Drafting [McKeown MICRO 2014]
  - Coherence Domain Restriction [Fu MICRO 2015]
  - MITTS [Zhou ISCA 2016]
  - PiCL [Nguyen MICRO 2018]
  - Piton Characterisation [McKeown HPCA 2018]
  - JuxtaPiton [Lim FPGA 2019]
  - BYOC [Balkind ASPLOS 2020]
  - MAPLE [Orenes Vera ISCA 2022]
  - Cohort [Wei ASPLOS 2023]
  - SMAPPIC [Chirkov ASPLOS 2023]
- Collaborators:
  - Oblivious RAM [Ren TDSC 2017]
  - POSH/IDEA/SDH programs
- External:
  - Hardware trojan detection [Elnaggar ITC 2017, Basu DAC 2019, Elnaggar T-VLSI 2019]
  - IoT reliability lifetime [Lerner ISQED 2017, Lerner T-VLSI 2018]
  - NCFET characterization [Rapp DAC 2019, Salamin ISLPED 2019]
  - Scan test power analysis [Indino 2017]
  - Modeling and control of EMI [Gorman 2018, Gorman MICRO 2019]
  - 3D ICs [Pentapati IEEE Micro 2019, Bamberg DATE 2020]
- Student research projects:
  - Transactional Memory, Heterogeneous ISA, MIAOW GPU, ...
- Industrial:
  - Intel, Western Digital, Google, ...

# CIFER Architecture Overview

- 4×  CVA6
- 3×  Tiny-core clusters
  - 6-core, MIMD, 32-bit RISC-V (18 in total)
- Embedded FPGA
  - RTL design with PRGA
  - Synthesized with standard cells
  - 6,720 basic logic elements (LUT6 + 2× DFF)
  - 54KB BRAM
- 12nm FinFET, 456M transistors, 4x4mm
  - $V_{nom}$  [ $V_{min} - V_{max}$ ] (V): 0.80 [0.68 – 1.10]
  - $F_{max}@V_{max}$  (GHz): 1.2



# DECADES Architecture

- Accelerator-rich heterogeneous many-core architecture (108 tiles)
- 2.2B transistors, 8x8mm in 12nm process

| Tile type                    | Count |
|------------------------------|-------|
| Ariane (RV64 CPU)            | 60    |
| Accelerator<br>(GeMM/Conv2D) | 24    |
| IS (Intelligent Storage)     | 23    |
| Controller                   | 1     |
| eFPGA                        | 1     |



# DECADES Architecture



- Taped out in GF12
- 2.2 billion transistors
- All work was completed by a group of graduate students and postdoc



# Designing modern chips is hard

- Scale and heterogeneity have become blockers
- Simulation speed plummets as design size grows
- Programming the many heterogeneous elements requires specialised code and knowledge
- What can we learn from software?



# Software Oriented Acceleration



# Bird's Eye View of Cohort

```
fifo_t pf = fifo_init(...);  
fifo_t cf = fifo_init(...);  
cohort_register(acc, pf, cf);  
push(data, pf);  
int result = pop(cf);
```



# Evaluation: SHA Accelerator

IPC Performance with SHA accelerator



Program Latency with SHA accelerator



**UC SANTA BARBARA**