



PULP PLATFORM

Open Source Hardware, the way it should be!

---

# *Working with RISC-V from open ISA to open Architecture to open Hardware*

Part 1 of 5 : Introduction to RISC-V ISA

Luca Benini

Davide Rossi

<[luca.benini@unibo.it](mailto:luca.benini@unibo.it)>

<[davide.rossi@unibo.it](mailto:davide.rossi@unibo.it)>



<http://pulp-platform.org>



@pulp\_platform



[https://www.youtube.com/pulp\\_platform](https://www.youtube.com/pulp_platform)

**ETH** zürich





# Summary

- **Part 1 – Introduction to RISC-V ISA**
  - What RISC-V is about
  - Description of ISA, and basic principles
  - Simple 32b implementation (Ibex by LowRISC)
  - How to extend the ISA (CV32E40P by OpenHW group)
- **Part 2 – Advanced RISC-V Architectures**
- **Part 3 – PULP concepts**
- **Part 4 – PULP Extensions and Accelerators**
- **Part 5 – PULP based chips**



# RISC-V Instruction Set Architecture



- Started by UC-Berkeley in 2010
- Contract between SW and HW
  - Partitioned into user and privileged spec
  - External Debug
- Standard governed by RISC-V foundation
  - ETHZ is a founding member of the foundation
  - Necessary for the continuity
- Defines 32, 64 and 128 bit ISA
  - No implementation, just the ISA
  - Different implementations (both open and close source)
- At ETHZ+UNIBO we specialize in efficient implementations of RISC-V cores





# RISC-V maintains basically a PDF document

The screenshot shows the RISC-V International website's "Specifications" page. The header includes the RISC-V logo, navigation links for About, Membership, Specs & Support (which is highlighted), Hardware & Software, News, and Events, along with social media icons and a search bar. The main content area has a dark blue header with "Specifications" and a backslash icon. The left sidebar lists "RISC-V SPECIFICATIONS" with links to Unprivileged Specification, Privileged ISA Specification, and Debug Specification. It also lists "RISC-V SOFTWARE" with a Software Status link, "RISC-V CORES" with a RISC-V Cores link, and "RISC-V EDUCATION". The main content area contains a note about specification development, followed by sections for "ISA Specification" and "Debug Specification", each with a list of current ratified releases.

Please note, RISC-V ISA and related specifications are developed, ratified and maintained by RISC-V International contributing members within the RISC-V International Technical Committee. Operating details of the Technical Committee can be found in the RISC-V International [Tech Group](#). Work on the specification is [performed on GitHub](#) and the GitHub issue mechanism can be used to provide input into the specification.

## ISA Specification

The specifications shown below represent the current, ratified releases:

- Volume 1, Unprivileged Spec v. 20191213 [\[PDF\]](#) [\[GitHub \(latest\)\]](#)
- Volume 2, Privileged Spec v. 20190608 [\[PDF\]](#) [\[GitHub \(latest\)\]](#)

## Debug Specification

- External Debug Support v. 0.13.2 [\[PDF\]](#)





# ISA defines the instructions that processor uses

ETH Zürich




**C++ program translated to RISC-V instructions defined by ISA.**

```

C++ source #1 × RISC-V rv64gc clang (trunk) (Editor #1, Compiler #1) C++ ×
A ▾ RISC-V rv64gc clang (trunk) ▾ Compiler options...
C++ ▾ Output... ▾ Filter... ▾ Libraries ▾ + Add new... ▾ Add tool...
A ▾ .LBB0_7:
63   ld    a0, -56($0)
64   lw    a1, -76($0)
65   lw    a2, -60($0)
66   mul   a1, a1, a2
67   lw    a2, -72($0)
68   addw  a1, a1, a2
69   slli  a1, a1, 2
70   add   a0, a0, a1
71   flw   ft0, 0(a0)
72   ld    a0, 0($0)
73   lw    a1, -68($0)
74   lw    a3, -64($0)
75   mul   a1, a1, a3
76   addw  a1, a1, a2
77   slli  a1, a1, 2
78   add   a0, a0, a1
79   flw   ft1, 0(a0)
80   fadd.s ft0, ft1, ft0
81   fsw   ft0, 0(a0)
82

```

Screen shot from the excellent Compiler Explorer by Matt Godbolt  
<https://godbolt.org/>



# RISC-V Ecosystem

- Binutils – upstream
- GCC – upstream
- LLVM – upstream
- Simulator:
  - "Spike" - reference
  - QEMU, Gem5
- OpenOCD
- OS
  - Linux, sel4, freeRTOS, zephyr
- Runtimes
  - Jikes, Ocaml, Go
- SW maintained by different parties
  - Binutils and GCC by Sifive a Berkeley start-up





# RISC-V ISA is divided into extensions

|   |                                          |
|---|------------------------------------------|
| I | Integer instructions (frozen)            |
| E | Reduced number of registers              |
| M | Multiplication and Division (frozen)     |
| A | Atomic instructions (frozen)             |
| F | Single-Precision Floating-Point (frozen) |
| D | Double-Precision Floating-Point (frozen) |
| C | Compressed Instructions (frozen)         |
| X | Non Standard Extensions                  |

- Kept very simple and extendable
  - Wide range of applications from IoT to HPC
- RV + word-width + extensions
  - RV32IMC: 32bit, integer, multiplication, compressed
- User specification:
  - Separated into extensions, only I is mandatory
- Privileged Specification (WIP):
  - Governs OS functionality: Exceptions, Interrupts
  - Virtual Addressing
  - Privilege Levels





# Work continues on new RISC-V extensions

- Foundation members work in **task-groups**
- Dedicated task-groups
  - Formal specification
  - Memory Model
  - Marketing
  - External Debug Specification
- ETH Zurich also contributes
  - Bit manipulation
  - Packed SIMD, DSP

**Q** Quad-precision Floating-Point

**L** Decimal Floating Point

**B** Bit Manipulation

**T** Transactional Memory

**P** Packed SIMD

**J** Dynamically Translated Languages

**V** Vector Operations

**N** User-Level Interrupts





# What is so special about RISC-V

RISC-V base ISAs have either little-endian or big-endian memory systems, with the privileged architecture further defining bi-endian operation. Instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of memory system endianness. Parcels forming one instruction are stored at increasing halfword addresses, with the lowest-addressed parcel holding the lowest-numbered bits in the instruction specification.

---

*We originally chose little-endian byte ordering for the RISC-V memory system because little-endian systems are currently dominant commercially (all x86 systems; iOS, Android, and Windows for ARM). A minor point is that we have also found little-endian memory systems to be more natural for hardware designers. However, certain application areas, such as IP networking,*

- Major design decisions have been properly motivated and explained
- **Reserved space for extensions, modular**
- **Open standard, you can help decide how it is developed**





# The FREEDOM in RISC-V is implementation

- You can access all ISAs without (many) restrictions
  - SW tools need to be developed so that they can generate code for that ISA
- Most ISAs are **closed**. Only specific vendors can implement it
  - To use a core that implements an ISA, you have to license/buy it from vendor
  - Open source SW (for the ISA) is possible but **building HW is not allowed**

# RISC-V

## Integer Register-Register Operations

RV32I defines several arithmetic R-type operations. All operations read the *rs1* and *rs2* registers as source operands and write the result into register *rd*. The *funct7* and *funct3* fields select the type of operation.

| 31      | 25 24 | 20 19 | 15 14        | 11 10 | 7 6  | 0      |
|---------|-------|-------|--------------|-------|------|--------|
| funct7  | rs2   | rs1   | funct3       | rd    | dest | opcode |
| 7       | 5     | 5     | ADD/SLT/SLTU | 5     | 7    | OP     |
| 0000000 | src2  | src1  | ADD/SLT/SLTU | dest  | OP   |        |

# ARM

## C2.9 ADD

Add without Carry.

### Syntax

`ADD{S}{cond} {Rd}, Rn, Operand2`

`ADD{cond} {Rd}, Rn, #imm12 ; T32, 32-bit encoding only`



# Are RISC-V processors better than XYZ?

- Actual performance depends on the implementation
  - RISC-V does not specify implementation details (on purpose)
- Modern design, should deliver comparable performance
  - If implemented well, it should perform as well as other modern ISA implementations
  - In our experiments, we see **no major weaknesses** when compared to other ISAs
  - It also is not magically 2x better
- High-end processor performance is not so much about ISA
  - Implementation “details” like microarchitecture, memory hierarchy, target technology, power management are more important.



ETH zürich



# What is not so good about RISC-V?

- **Still in development**
  - Some standards (privilege, vector, debug etc.) still being refined, adjusted.
  - Tools and development environment needs to catch up.
- **No canonical implementation (“the” RISC-V core)**
  - It is free to implement, so many people did so, resulting in many cores
- **Higher end (out of order, superscalar) cores not yet mature**
  - In theory there is nothing to prevent a RISC-V based Linux laptop.
  - It will take some more time until RISC-V implementations can compete with other commercial processors (which needed hundreds of man months of work)
  - Getting there (Alibaba XT910, SiFive P550, Esperanto ET-Maxion, Semidynamics Avispado, Rivos ??? and more coming every day!)



# Reduced Instruction Set: all in one page

Free & Open RISC-V Reference Card

| Base Integer Instructions: RV32I, RV64I, and RV128I |                        |              |              | RV Privileged Instructions |                               |                             |                   |
|-----------------------------------------------------|------------------------|--------------|--------------|----------------------------|-------------------------------|-----------------------------|-------------------|
| Category                                            | Name                   | Fmt          | RV32I Base   | +RV(64,128)                | Category                      | Name                        | RV mmemonic       |
| Loads                                               | Load Byte              | I LB         | rd,rs1,imm   | +RV(64,128)                | CSR Access                    | Atomic R/W                  | CSRRW rd,csr,rs1  |
|                                                     | Load Halfword          | I LH         | rd,rs1,imm   |                            |                               | Atomic Read & Set Bit       | CSRRS rd,csr,rs1  |
|                                                     | Load Word              | I LW         | rd,rs1,imm   |                            |                               | Atomic Read & Clear Bit     | CSRRC rd,csr,rs1  |
|                                                     | Load Byte Unsigned     | I LBU        | rd,rs1,imm   |                            |                               | Atomic Read & RW Immed      | CSRRWI rd,csr,imm |
|                                                     | Load Half Unsigned     | I LHU        | rd,rs1,imm   |                            |                               | Atomic Read & Clear Unimmed | CSRRU rd,csr,imm  |
| Stores                                              | Store Byte             | S SB         | rs1,rs2,imm  |                            | Change Level                  | Env. Call                   | ECALL             |
|                                                     | Store Halfword         | S SH         | rs1,rs2,imm  |                            | Environment B                 | Env. Switch                 | ESWAP             |
|                                                     | Store Word             | S SW         | rs1,rs2,imm  |                            | Trap Redirect                 | Supervisor                  | MRTS              |
| Shifts                                              | Shift Left Immediate   | R SLL        | rd,rs1,rs2   |                            | Redirect Trap to Hypervisor   | MRTB                        |                   |
|                                                     | Shift Left Immmedate   | I SLLI       | rd,rs1,shamt |                            | Hypervisor Trap to Supervisor | HRTS                        |                   |
|                                                     | Shift Right Immediate  | R SRL        | rd,rs1,rs2   |                            | Interrupt                     | Wait for Interrupt          | WFPI              |
|                                                     | Shift Right Immmedate  | I SRLI       | rd,rs1,shamt |                            | MMU                           | Supervisor FENCE            | SFENCE.VM rsl     |
|                                                     | Shift Right Arithmetic | R SRA        | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | Shift Right Arith Imm  | I SRAI       | rd,rs1,shamt |                            |                               |                             |                   |
| Arithmetic                                          | ADD                    | ADD          | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | ADD Immediate          | ADDI         | rd,rs1,imm   |                            |                               |                             |                   |
|                                                     | SUBtract               | SUB          | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | Load Upper Immn        | LUI          | rd,imm       |                            |                               |                             |                   |
|                                                     | Add Upper Immn to PC   | AUIPC        | rd,imm       |                            |                               |                             |                   |
| Logical                                             | XOR                    | XOR          | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | XOR Immediate          | XORI         | rd,rs1,imm   |                            |                               |                             |                   |
|                                                     | OR                     | OR           | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | OR Immediate           | ORI          | rd,rs1,imm   |                            |                               |                             |                   |
|                                                     | AND                    | AND          | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | AND Immediate          | ANDI         | rd,rs1,imm   |                            |                               |                             |                   |
| Compare                                             | Set <                  | SLT          | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | Set < Immediate        | SLTI         | rd,rs1,imm   |                            |                               |                             |                   |
|                                                     | Set < Unsigned         | R SLTU       | rd,rs1,rs2   |                            |                               |                             |                   |
|                                                     | Set < Imm Unsigned     | SLTIU        | rd,rs1,imm   |                            |                               |                             |                   |
|                                                     |                        |              |              |                            |                               |                             |                   |
| Branches                                            | Branch =               | BEQ          | rs1,rs2,imm  |                            |                               |                             |                   |
|                                                     | Branch #               | BNE          | rs1,rs2,imm  |                            |                               |                             |                   |
|                                                     | Branch <               | BLT          | rs1,rs2,imm  |                            |                               |                             |                   |
|                                                     | Branch >               | BGE          | rs1,rs2,imm  |                            |                               |                             |                   |
|                                                     | Branch < Unsigned      | BLTU         | rs1,rs2,imm  |                            |                               |                             |                   |
|                                                     | Branch > Unsigned      | BGEU         | rs1,rs2,imm  |                            |                               |                             |                   |
| Jump & Link                                         | J&L                    | UJ           | JAL rd,imm   |                            |                               |                             |                   |
|                                                     | Jump & Link Register   | UJAL         | rd,rs1,imm   |                            |                               |                             |                   |
| Sync                                                | Synch thread           | I FENCE      |              |                            |                               |                             |                   |
|                                                     | Synch Instr & Data     | I FENCE.I    |              |                            |                               |                             |                   |
| System                                              | System CALL            | I SCALL      |              |                            |                               |                             |                   |
|                                                     | System BREAK           | I SBRK       |              |                            |                               |                             |                   |
| Counters                                            | Read CYCLE             | I RDCYCLE    | rd           |                            |                               |                             |                   |
|                                                     | Read CYCLE upper Half  | I RDCYCLEH   | rd           |                            |                               |                             |                   |
|                                                     | Read TIME              | I RDTIME     | rd           |                            |                               |                             |                   |
|                                                     | Read TIME upper Half   | I RDTIMEH    | rd           |                            |                               |                             |                   |
|                                                     | Read INSTR Retired     | I RDINSTRRET | rd           |                            |                               |                             |                   |
| Shifts                                              | Shift Left Imm         | C SLLI       | rd,imm       |                            |                               |                             |                   |
|                                                     | Branch=0               | C BEQZ       | rs1',imm     |                            |                               |                             |                   |
|                                                     | Branch#0               | C BNZ        | rs1',imm     |                            |                               |                             |                   |
|                                                     | Jump                   | C J          | imm          |                            |                               |                             |                   |
|                                                     | Jump Register          | C JR         | rd,rs1       |                            |                               |                             |                   |
| Jump & Link                                         | J&L                    | C JAL        | imm          |                            |                               |                             |                   |
|                                                     | Jump & Link Register   | C JALR       | rd,rs1,0     |                            |                               |                             |                   |
| System                                              | Env. BREAK             | C EBREAK     |              |                            |                               |                             |                   |

**RISC-V Integer Base (RV32I/64I/128I)**, privileged, and optional compressed extension (RVC). Registers x1-x31 and the pc are 32 bits wide in RV32I, 64 in RV64I, and 128 in RV128I ( $x=0$ ). RV64I/128I add 10 instructions for the wider formats. The RVT base of <50 classic integer RISC instructions is required. Every 16-bit RVC instruction matches an existing 32-bit RVT instruction. See [risc.org](#).

| Category           |                     | Name      | Fmt | Op    | Description | Notes     |
|--------------------|---------------------|-----------|-----|-------|-------------|-----------|
| <b>Multiply</b>    |                     | MULtify   | R   | MUL   | rd,rs1,rs2  | MUL(W D)  |
| Divide             | Multiply upper Half |           | R   | MULH  | rd,rs1,rs2  |           |
|                    | Multiply Half Sign  |           | R   | MULHS | rd,rs1,rs2  |           |
|                    | Multiply upper Half |           | R   | MULHD | rd,rs1,rs2  |           |
| <b>Divide</b>      |                     | DIVide    | R   | DIV   | rd,rs1,rs2  | DIV(W D)  |
| DIVide Unsigned    |                     |           | R   | DIVU  | rd,rs1,rs2  |           |
| <b>Remainder</b>   |                     | REMAinder | R   | REM   | rd,rs1,rs2  | REM(W D)  |
| REMAinder Unsigned |                     |           | R   | REMU  | rd,rs1,rs2  | REMU(W D) |

# Multiply/Divide (M)

| Optional Atomic Instruction Extension: RVA |                   |             |                |               |             |
|--------------------------------------------|-------------------|-------------|----------------|---------------|-------------|
| Category                                   | Name              | Fmt         | RV32A (Atomic) |               | +RV(64,128) |
| Load                                       | Load Reserved     | R LR.W      | rd,rs1         | LR.(D Q)      | rd,rs1      |
| Store                                      | Store Conditional | R BC.W      | rd,rs1,rs2     | SC.(D Q)      | rd,rs1,rs2  |
| Swap                                       | SWAP              | R AMOSWAP.W | rd,rs1,rs2     | AMOSWAP.(D Q) | rd,rs1,rs2  |
| Add                                        | ADD               | R AMOADD.W  | rd,rs1,rs2     | AMOADD.(D Q)  | rd,rs1,rs2  |
| Logical                                    | OR                | R AMOOR.W   | rd,rs1,rs2     | AMOOR.(D Q)   | rd,rs1,rs2  |
|                                            | MIN/MAX           | R AMONIN.W  | rd,rs1,rs2     | AMONIN.(D Q)  | rd,rs1,rs2  |
| Min/Max                                    | MAXIMUM           | R AMOMAX.W  | rd,rs1,rs2     | AMOMAX.(D Q)  | rd,rs1,rs2  |
|                                            | Minimum Unsigned  | R AMONINU.W | rd,rs1,rs2     | AMONINU.(D Q) | rd,rs1,rs2  |
|                                            | Maximum Unsigned  | R AMOMAXU.W | rd,rs1,rs2     | AMOMAXU.(D Q) | rd,rs1,rs2  |

# Atomic Extensions (

| Three Optional Floating Point Instruction Extensions: FV1, FV2, & FV3 |                           |     |                            |             |
|-----------------------------------------------------------------------|---------------------------|-----|----------------------------|-------------|
| Category                                                              | Name                      | Fmt | RV32.F(P D)(H SP,DP OF FP) | +RV(64,128) |
| Move                                                                  | Move from integer         |     | FMV.H, {X}   S             | rd,rs1      |
|                                                                       | Move to integer           |     | FMV.X, {H   S}             | rd,rs1      |
| Convert                                                               | Convert from Int          |     | FCVT.{S   D}   Q           | rd,rs1      |
|                                                                       | Convert from Int Unsigned |     | FCVT.{H   S   D}   Q       | rd,rs1      |
| Convert to Int                                                        | Convert to Int            |     | FCVT.W, {H   S   D}   Q    | rd,rs1      |
|                                                                       | Convert to Int Unsigned   |     | FCVT.WU, {H   S   D}   Q   | rd,rs1      |
| Load                                                                  | Load                      |     | L(W,D,Q)                   | rd,rs1,imm  |

RISE-V Calling Convention

| Code           |                            | Load                      | Store               | Extention     | Register                  | ABI Name | Saver   | Description                     |
|----------------|----------------------------|---------------------------|---------------------|---------------|---------------------------|----------|---------|---------------------------------|
| Arithmetic     | Store                      | $SW_{W,D,B}(S D Q)$       | $LD_{W,D,B}(S D Q)$ | rs1,rs2,imm   | x0                        | zero     | ---     | Hard-wired zero                 |
|                | ADD                        | $ADD_{(S D Q)}(S D Q)$    | $RD_{(S D Q)}$      | rd,rs1,rs2    | x1                        | ra       | Caller  | Return address                  |
|                | SUBtract                   | $SUB_{(S D Q)}(S D Q)$    | $RD_{(S D Q)}$      | rd,rs1,rs2    | x2                        | sp       | Callee  | Stack pointer                   |
|                | MULtify                    | $MUL_{(S D Q)}(S D Q)$    | $RD_{(S D Q)}$      | rd,rs1,rs2    | x3                        | gp       | ---     | Global pointer                  |
| Divide         | DIvide                     | $DIV_{(S D Q)}(S D Q)$    | $RD_{(S D Q)}$      | rd,rs1,rs2    | x4                        | tp       | ---     | Thread pointer                  |
|                | SQuare Root                | $SQRT_{(S D Q)}(S D Q)$   | $RD_{(S D Q)}$      | rd,rs1        | x5-t                      | t0-t     | Caller  | Temporaries                     |
|                | Mul-Add                    | $FWDADD_{(S D Q)}(S D Q)$ | $RD_{(S D Q)}$      | rd,rs1,rs2,rs | x8                        | s0/fp    | Calliee | Saved register/frame pointer    |
| Mul-Subtract   | Multiply-SUBtract          | $MPSUB_{(S D Q)}(S D Q)$  | $RD_{(S D Q)}$      | rd,rs1,rs2,rs | x9                        | s1       | Calliee | Saved register                  |
|                | Negative Multiply-Subtract | $NMPSUB_{(S D Q)}(S D Q)$ | $RD_{(S D Q)}$      | rd,rs1,rs2,rs | x10-11                    | a0-1     | Caller  | Function arguments/return value |
|                | Negative Multiply-ADD      | $NNMADD_{(S D Q)}(S D Q)$ | $RD_{(S D Q)}$      | rd,rs1,rs2,rs | x12-17                    | a7-7     | Caller  | Function arguments              |
| Sign Inject    | SiGN source                | $SGNJ_{(S D Q)}(S D Q)$   | $RD_{(S D Q)}$      | rd,rs1,rs2    | x18-27                    | s2-11    | Calliee | Saved registers                 |
|                | Negative SiGN source       | $FSGNJN_{(S D Q)}(S D Q)$ | $RD_{(S D Q)}$      | rd,rs1,rs2    | x28-31                    | t3-t6    | Calliee | Temporaries                     |
|                | Xor SiGN source            | $SGNXJ_{(S D Q)}(S D Q)$  | $RD_{(S D Q)}$      | rd,rs1,rs2    | f0-7                      | ft0-7    | Caller  | FP temporaries                  |
| Min/Max        | MINimum                    | $FMIN_{(S D Q)}(S D Q)$   | $RD_{(S D Q)}$      | rd,rs1,rs2    | f8-9                      | fs0-1    | Calliee | FP saved registers              |
|                | MAXimum                    | $FMAX_{(S D Q)}(S D Q)$   | $RD_{(S D Q)}$      | rd,rs1,rs2    | f10-11                    | fa0-1    | Caller  | FP arguments/return values      |
| Compare        | Compare Float =            | $EQ_{(S D Q)}(S D Q)$     | $RD_{(S D Q)}$      | rd,rs1,rs2    | f12-17                    | fa2-7    | Caller  | FP arguments                    |
|                | Compare Float <            | $LT_{(S D Q)}(S D Q)$     | $RD_{(S D Q)}$      | rd,rs1,rs2    | f18-27                    | fs2-11   | Calliee | FP saved registers              |
|                | Compare Float ≤            | $LE_{(S D Q)}(S D Q)$     | $RD_{(S D Q)}$      | rd,rs1,rs2    | f28-31                    | ft8-11   | Caller  | FP temporaries                  |
| Categorization | Classify Type              | $CLASS_{(S D B)}(S D B)$  | $RD_{(S D B)}$      | rd,rs1        | RISC-V Calling Convention |          |         |                                 |

*RISC-V calling convention and five optional extensions: 10 multiply-divide instructions (RV32M); 11 optional atomic instructions (RV32A); and 25 floating-point instructions each for single-, double-, and quadruple-precision (RV32F, RV32D, RV32Q). The latter add registers  $f0-f31$ , whose width matches the widest precision, and a floating-point control and status register `fcsr`. Each larger address adds some instructions: 4 for RVM, 11 for RVA, and 6 each for RVF/D/Q. Using regex notation, `{}` means set, `L|D|Q` is both LD and LQ. See riscv.org. (8/21/15 revision)*



# RISC-V Architectural State

- There are 32 registers, each 32 / 64 / 128 bits long
  - Named x0 to x31
  - x0 is hard wired to zero
  - There is a standard 'E' extension that uses only 16 registers (RV32E)
- In addition one program counter (PC)
  - Byte based addressing, program counter increments by 4/8/16
- For floating point operation 32 additional FP registers
- Additional Control Status Registers (CSRs)
  - Encoding for up to 4'096 registers are reserved. Not all are used.



# RISC-V Instructions four basic types

- **R** register to register operations
- **I** operations with immediate/constant values
- **S / SB** operations with two source registers
- **U / UJ** operations with large immediate/constant value





# Encoding of the instructions, main groups

- Reserved opcodes for standard extensions
- Rest of opcodes free for **custom** implementations
- Standard extensions will be frozen/not change in the future

ETH Zürich



| inst[4:2] | 000    | 001      | 010             | 011      | 100    | 101             | 110                   |            |
|-----------|--------|----------|-----------------|----------|--------|-----------------|-----------------------|------------|
| inst[6:5] |        |          |                 |          |        |                 |                       | (> 32b)    |
| 00        | LOAD   | LOAD-FP  | <i>custom-0</i> | MISC-MEM | OP-IMM | AUIPC           | OP-IMM-32             | 48b        |
| 01        | STORE  | STORE-FP | <i>custom-1</i> | AMO      | OP     | LUI             | OP-32                 | 64b        |
| 10        | MADD   | MSUB     | NMSUB           | NMADD    | OP-FP  | <i>reserved</i> | <i>custom-2/rv128</i> | 48b        |
| 11        | BRANCH | JALR     | <i>reserved</i> | JAL      | SYSTEM | <i>reserved</i> | <i>custom-3/rv128</i> | $\geq 80b$ |



# RISC-V is a load/store architecture

- All operations are on internal registers
  - Can not manipulate data in memory directly
- Load instructions to copy from memory to registers
- R-type or I-type instructions to operate on them
- Store instructions to copy from registers back to memory
- Branch and Jump instructions



# Constants (Immediates) in Instructions

- In 32bit instructions, not possible to have 32b constants
  - Constants are distributed in instructions, and then sign extended
  - The **Load Upper Immediate (lui)** instruction to assemble/push constants
- Instruction types according to immediate encoding





# Load from memory (ld), how immediates work

**ld x9, 64(x22)**



- Not possible to fit a 32b address in 32b encoding directly
  - Take the content in source (**rs1**), add the immediate (**imm**) to it. This is the **address**
  - Read from this **address** in the memory and load into the destination (**rd**) register
- RISC-V tries to minimize number of instructions
  - The **ld** instruction seems overly complicated, but you can use this for everything





# Branching, how addresses come together

**bne x10, x11, 2000** // if  $x10 \neq x11$ , jump 2000 ahead



- Similar problem, how to encode jump address in branches
  - Branch on Equal (**beq**) and Branch on Not Equal (**bne**)
  - They use B type operations, need two source registers
- Jumps are relative to Program Counter (PC)
  - The **immediate** (constant) shows how far we have to jump (PC-relative addressing)
  - Works addresses within  $\pm 4096$ . To branch further, we need several instructions.



# RISC-V Instruction Length is Encoded

- LSB of the instruction tells how long the instruction is
- Supports instructions of 16, 32, 48, 64, 80, 96, ..., 320 bit
  - Allows RISC-V to have Compressed instructions



Byte Address: base+4

base+2

base





# Compressed Instruction extension ‘C’

- Use 16-bit instructions for common operations
  - Code size reduction by 34%
  - Compressed instructions increase fetch-bandwidth
  - Allow for macro-op fusion of common patterns



**x86-64:** 3.71 bytes / instruction   **RV64IC:** 3.00 bytes / instruction





# So, how to build RISC-V cores?

- **RISC-V ISA tells you the function**
  - You know which instructions are supported
  - How they are encoded
  - What they are supposed to do
- **It does **not** tell you any implementation details**
  - Pipeline stages, memory hierarchy, computation units, in-order or out-of order
  - Everyone is free to figure out how to best implement these
- **Need to come up with a micro-architecture to implement it**
  - Determine which standard extensions are supported, how
  - Choose a micro-architecture that fits performance requirements



# What are the Performance Metrics

## ■ Area

- in kGE equivalent (# of simple logic gates) or mm<sup>2</sup> (technology dependent)

## ■ Frequency:

- Depends on # of gates on longest path

## ■ Power:

- Strongly depends on the above metrics
- **Leakage**: dissipated even when not working (Area)
- **Dynamic Power**: dissipated on logic transitions (frequency and area)

## ■ CPU Design:

- **IPC** (Instructions per cycle)
  - IPC implicitly measured in commonly used benchmarks (Coremark, Dhrystone, SpecInt)
- **Energy Efficiency**: OPs/Joule

## ■ Hardware Designer

- Tries to find a good balance
- Application dependent
  - IoT and HPC have different requirements
- One size does not fit all



# RISC-V cores developed at ETH Zurich

| 32 bit                                                                                                                                                                                                                       |                                                                                                                                                                                                                                      |                                                                                                                                                     | 64 bit                                                                                                                                                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>Low Cost Core</b> <ul style="list-style-type: none"><li>▪ Zero-riscy<ul style="list-style-type: none"><li>▪ RV32-IC(M)</li><li>▪ Micro-riscy<ul style="list-style-type: none"><li>▪ RV32-QC</li></ul></li></ul></li></ul> | <b>DSP Enhanced Core</b> <ul style="list-style-type: none"><li>▪ RI5CY<ul style="list-style-type: none"><li>▪ RV32-ICMDFX</li><li>▪ SIMD</li><li>▪ HW loops</li><li>▪ B&amp;W manipulation</li><li>▪ Fixed point</li></ul></li></ul> | <b>Streaming Compute Core</b> <ul style="list-style-type: none"><li>▪ Snitch<ul style="list-style-type: none"><li>▪ RV32-ICMDFX</li></ul></li></ul> | <b>Linux capable Core</b> <ul style="list-style-type: none"><li>▪ Ariane<ul style="list-style-type: none"><li>▪ RV64-IC(MA)</li><li>▪ Full privileged specification</li></ul></li></ul> |

ETH Zürich



LowRISC

OPENHW  
GROUP  
OPEN RISC-V PROCESSOR DESIGN

OPENHW  
GROUP  
OPEN RISC-V PROCESSOR DESIGN

HPEAC



# Zero-riscy / Ibex, small core for control applications

- 2-stage pipeline
- Optimized for area
  - Area:  
19 kGE (Zero-riscy)  
12 kGE (Micro-riscy)
  - Critical path:  
~ 30 logic levels
- New name: Ibex
  - LowRISC has taken over Zero/Micro-Riscy in 2019



- Two Configurations:
  - **Zero-riscy**: RV32IMC (2,44 Coremark/MHz)
    - 32 registers, hardware multiplier
  - **Micro-riscy** : RV32EC (0,91 Coremark/MHz)
    - 16 registers (E), software emulated multiplier

P. Davide Schiavone et al., "Slow and steady wins the race? A comparison of ultra-low-power RISC-V cores for Internet-of-Things applications," 2017 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), 2017, pp. 1-8.





# Ibex continues to grow with LowRISC

40+

**Contributors**

680

**Pull Requests**

314

**GitHub Issues**

*Ibex is a small and efficient, 32-bit, in-order RISC-V core with a 2-stage (or optionally 3-stage) pipeline that implements the RV32IMCB instruction set architecture.*

*Since being contributed to lowRISC by ETH Zürich, it has seen substantial investment of development effort*





# Roadmap of Ibex





# Growth of Ibex measured with Coremark/MHz

ETH zürich



# RI5CY / CV32E40P our main 32bit RISC-V core

- Zero-riscy / Ibex is suitable for simple applications
  - Control applications, book-keeping
- For number crunching, we need more capable cores
  - Mainly used in clusters for signal processing / machine learning applications
- Tuned for energy efficiency
  - Not necessarily lowest power
- Make use of *custom extensions*
  - The Xpulp extensions enhance the capabilities
  - Several Xpulp extensions in discussions for ratification



# Simplified pipeline for RI5CY / CV32E40P





# RI5CY: Our 32-bit workhorse

- 4-stage pipeline
  - 41 kGE
  - Coremark/MHz 3.19
- Includes Xpulp extensions
  - SIMD
  - Fixed point
  - Bit manipulations
  - HW loops



- Different Options:
  - **FPU:** IEEE 754 single precision
    - Including hardware support for FDIV, FSQRT, FMAC, FMUL
  - **Privilege support:**
    - Supports privilege mode **M** and **U**

M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 25, no. 10, pp. 2700-2713, Oct. 2017.





# RISC-V has space for custom instructions (X)

- There is a reserved decoding space for custom instructions
  - Allows **everyone** to add new instructions to the core
  - The address decoding space is **reserved**, it will not be used by future extensions
  - Implementations supporting custom instructions will be compatible with standard ISA
    - Code compiled for standard RISC-V will run without issues
  - The user has to provide support to take advantage of the additional instructions
    - Compiler that generates code for the custom instructions
- We use a lot this degree of freedom
  - Great tool for exploring
  - The goal is to help ratify these extensions as standards through working groups



# Our extensions to RI5CY & support in GCC, LLVM

- Post-incrementing load/store instructions
- Hardware Loops (**lp.start**, **lp.end**, **lp.count**)
- ALU instructions
  - Bit manipulation (count, set, clear, leading bit detection)
  - Fused operations: (add/sub-shift)
  - Immediate branch instructions
- Multiply Accumulate (32x32 bit and 16x16 bit)
- SIMD instructions (2x16 bit or 4x8 bit) with scalar replication option
  - add, min/max, dotproduct, shuffle, pack (copy), vector comparison

For 8-bit values the following can be executed in a single cycle  
**(pv.dotup.b)**

$$Z = D_1 \times K_1 + D_2 \times K_2 + D_3 \times K_3 + D_4 \times K_4$$



# RISCV ISA extensions improve performance

```
for (i = 0; i < 100; i++)
    d[i] = a[i] + b[i];
```

## Baseline

```
mv x5, 0
mv x4, 100
Lstart:
    lb x2, 0(x10)
    lb x3, 0(x11)
    addi x10,x10, 1
    addi x11,x11, 1
    add x2, x3, x2
    sb x2, 0(x12)
    addi x4, x4, -1
    addi x12,x12, 1
bne x4, x5, Lstart
```

## Auto-incr load/store HW Loop

```
mv x5, 0
mv x4, 100
Lstart:
    lb x2, 0(x10!)
    lb x3, 0(x11!)
    addi x4, x4, -1
    add x2, x3, x2
    sb x2, 0(x12!)
    bne x4, x5, Lstart
```

**lp.setupi 100, Lend**

```
lb x2, 0(x10!)
lb x3, 0(x11!)
add x2, x3, x2
Lend: sb x2, 0(x12!)
```

## Packed-SIMD

**lp.setupi 25, Lend**

```
lw x2, 0(x10!)
lw x3, 0(x11!)
pv.add.b x2, x3, x2
Lend: sw x2, 0(x12!)
```

11 cycles/output

8 cycles/output

5 cycles/output

1,25 cycles/output





# Runtime for three different applications





# Different cores for different area budgets





# Different cores for different power budgets





# Energy Efficiency: 2D-Convolution @55MHz, 0.8V





# This was a short overview of basics of RISC-V

- Tomorrow, more advanced cores
  - 64bit RISC-V core
  - Discussion on performance
  - Vector processing
- On Wednesday-Friday, we learn about PULP systems
  - Cores alone can not do much, they need a system around
  - Many core systems
  - Managing Data
  - Acceleration
  - Actual Integrated Circuits from the PULP group



# PULP

Parallel Ultra Low Power

Luca Benini, Davide Rossi, Andrea Borghesi, Michele Magno, Simone Benatti, Francesco Conti, Francesco Beneventi, Daniele Palossi, Giuseppe Tagliavini, Antonio Pullini, Germain Haugou, Manuele Rusci, Florian Glaser, Fabio Montagna, Bjoern Forsberg, Pasquale Davide Schiavone, Alfio Di Mauro, Victor Javier Kartsch Morinigo, Tommaso Polonelli, Fabian Schuiki, Stefan Mach, Andreas Kurth, Florian Zaruba, Manuel Eggimann, Philipp Mayer, Marco Guermandi, Xiaying Wang, Michael Hersche, Robert Balas, Antonio Mastrandrea, Matheus Cavalcante, Angelo Garofalo, Alessio Burrello, Gianna Paulin, Georg Rutishauser, Andrea Cossettini, Luca Bertaccini, Maxim Mattheeuws, Samuel Riedel, Sergei Vostrikov, Vlad Niculescu, Hanna Mueller, Matteo Perotti, Nils Wistoff, Luca Bertaccini, Thorir Ingulfsson, Thomas Benz, Paul Scheffler, Alessio Burello, Moritz Scherer, Matteo Spallanzani, Andrea Bartolini, Frank K. Gurkaynak,  
and many more that we forgot to mention



<http://pulp-platform.org>



@pulp\_platform