

# ECE4700J Lab 4

## Introduction of RISC-V

Haoyang Zhang

UM-SJTU Joint Institute

*zhy-sjtu-jc@sjtu.edu.cn*

June 9, 2022



# Overview

1 Administrivia

2 RISC-V

3 Assignments

# Administrivia

- Lab 3 Assignment's live demo will be on 8:00pm - 10:30pm, Friday Jun. 10th (Beijing Time). Lab 3 Assignment code submission will be due on 23:59pm, Jun. 10th (Beijing Time).
- Homework 2 will be due on 02:59am, Jun. 13rd (Beijing Time).
- Lab 4 Assignment will be released soon (in this week). It will be due on Friday Jun. 24th (Beijing Time).
- Homework 3 will come soon.

# Administrivia

- Start early on Lab 4!!!
- The Lab 4 Assignment will be significantly harder than former Lab assignments, and you will learn a lot.
- We will cover hints in lab lectures.

# For Online Live Demo

- Please put yourself in the demo queue: <https://sjtu.feishu.cn/sheets/shtcn5u6ycQejXPQ2dQpLq24nhb>.
  - We will invite students into a discussion room to do the demo.
  - Others please wait in the main room.

# What is RISC-V

- RISC-V (pronounced “risk-five”) is an open, free ISA enabling a new era of processor innovation through open standard collaboration. Born in academia and research, RISC-V ISA delivers a new level of free, extensible software and hardware freedom on architecture, paving the way for the next 50 years of computing design and innovation. - RISC-V Foundation
  - Take the last part of with a huuuuuuge grain of salt
- It originated in UC Berkeley, but now it has its own foundation with a large number of contributors
- Spawns quite a few startups

# Why RISC-V

- A completely open ISA that is freely available to academia and industry
- A real ISA suitable for direct native hardware implementation, not just simulation or binary translation
- An ISA that avoids "over-architecting" for a particular microarchitecture style (e.g., microcoded, in-order, decoupled, out-of-order) or implementation technology (e.g., full-custom, ASIC, FPGA), but which allows efficient implementation in any of these
- An ISA separated into a small base integer ISA, usable by itself as a base for customized accelerators or for educational purposes, and optional standard extensions, to support general-purpose software development
  - Most important part for us in particular, is good software support

# T-Head's XuanTie 910

[Homepage](#)[Products](#)[Industry Applications](#)[Technical Resources](#)[About Us](#)[CN](#)

## C910

Compatible with RISC-V architecture, C910 adopts 12-stage superscalar pipeline and enhances its arithmetic operation, memory access and multi-core synchronization. Meanwhile, it is configured with standard memory management unit and can run Linux and other operating systems; with 3-issue and 8-executive deep out-of-order execution architecture and single/double-precision floating-point unit,

It is applicable to artificial intelligence, 5G, edge server and other areas that have very high performance requirements.

[View Datasheet \(PDF\)](#)

# Why RISC-V

- Lets us explore more layers of the computing stack, mainly compilers and systems
- Can arbitrarily generate test cases, since we can just write in C now!
  - Easier for you to test
  - Easier for the staff to shuffle around test cases
  - Easier to generate large test cases that can actually benefit from additional features and properly reward those who worked on extra features

# ISA Overview - Base ISA

- Base ISA + many extensions including privileges mode
- 32, 64 and 128-bit address space
  - We only use 32-bit for now, the other two only add a few instructions
- 32 integer registers
- Byte level addressing for memory, little endian
- Instructions must align to 32-bit addresses (unless they are compressed)
- No condition codes or carry out bits to detect overflow
  - Intentional, these can be achieved in software
- Comparisons are built in for branches
  - e.g. beq x1, x2, offset

# ISA Overview - Instruction Formats



Figure 2.3: RISC-V base instruction formats showing immediate variants.

# ISA Overview - More Extensions

- V - Has a vector extension as well, if staff in the future wants to spice things up
- A - The atomic extension will be partially used to implement locks in the future
- F, D, Q, L- Floating point extensions can be supported for people's own interest
- C - Compressed extension to increase code density
- E - for embedded systems; reduced number of registers (only 16), can be combined with C to save ROM
- T - RISC-V has plans to support transactional memory in the future (omegalul)
- Z series, basically all future extensions since they ran out of letters

# RV32I Base Integer Instructions

| Inst  | Name                    | FMT | Opcode  | funct3 | funct7         | Description (C)         | Note         |
|-------|-------------------------|-----|---------|--------|----------------|-------------------------|--------------|
| add   | ADD                     | R   | 0110011 | 0x0    | 0x00           | $rd = rs1 + rs2$        |              |
| sub   | SUB                     | R   | 0110011 | 0x0    | 0x20           | $rd = rs1 - rs2$        |              |
| xor   | XOR                     | R   | 0110011 | 0x4    | 0x00           | $rd = rs1 \wedge rs2$   |              |
| or    | OR                      | R   | 0110011 | 0x6    | 0x00           | $rd = rs1 \vee rs2$     |              |
| and   | AND                     | R   | 0110011 | 0x7    | 0x00           | $rd = rs1 \& rs2$       |              |
| sll   | Shift Left Logical      | R   | 0110011 | 0x1    | 0x00           | $rd = rs1 \ll rs2$      |              |
| srl   | Shift Right Logical     | R   | 0110011 | 0x5    | 0x00           | $rd = rs1 \gg rs2$      |              |
| sra   | Shift Right Arith*      | R   | 0110011 | 0x5    | 0x20           | $rd = rs1 \gg rs2$      | msb-extends  |
| slt   | Set Less Than           | R   | 0110011 | 0x2    | 0x00           | $rd = (rs1 < rs2)?1:0$  |              |
| sltu  | Set Less Than (U)       | R   | 0110011 | 0x3    | 0x00           | $rd = (rs1 < rs2)?1:0$  | zero-extends |
| addi  | ADD Immediate           | I   | 0010011 | 0x0    |                | $rd = rs1 + imm$        |              |
| xori  | XOR Immediate           | I   | 0010011 | 0x4    |                | $rd = rs1 \wedge imm$   |              |
| ori   | OR Immediate            | I   | 0010011 | 0x6    |                | $rd = rs1 \vee imm$     |              |
| andi  | AND Immediate           | I   | 0010011 | 0x7    |                | $rd = rs1 \& imm$       |              |
| slli  | Shift Left Logical Imm  | I   | 0010011 | 0x1    | imm[5:11]=0x00 | $rd = rs1 \ll imm[0:4]$ |              |
| srlti | Shift Right Logical Imm | I   | 0010011 | 0x5    | imm[5:11]=0x00 | $rd = rs1 \gg imm[0:4]$ |              |
| srai  | Shift Right Arith Imm   | I   | 0010011 | 0x5    | imm[5:11]=0x20 | $rd = rs1 \gg imm[0:4]$ | msb-extends  |
| slti  | Set Less Than Imm       | I   | 0010011 | 0x2    |                | $rd = (rs1 < imm)?1:0$  |              |
| sltiu | Set Less Than Imm (U)   | I   | 0010011 | 0x3    |                | $rd = (rs1 < imm)?1:0$  | zero-extends |

# RV32I Base Integer Instructions

|        |                     |   |         |     |         |                              |              |
|--------|---------------------|---|---------|-----|---------|------------------------------|--------------|
| lb     | Load Byte           | I | 0000011 | 0x0 |         | rd = M[rs1+imm][0:7]         |              |
| lh     | Load Half           | I | 0000011 | 0x1 |         | rd = M[rs1+imm][0:15]        |              |
| lw     | Load Word           | I | 0000011 | 0x2 |         | rd = M[rs1+imm][0:31]        |              |
| lbu    | Load Byte (U)       | I | 0000011 | 0x4 |         | rd = M[rs1+imm][0:7]         |              |
| lhu    | Load Half (U)       | I | 0000011 | 0x5 |         | rd = M[rs1+imm][0:15]        | zero-extends |
| sb     | Store Byte          | S | 0100011 | 0x0 |         | M[rs1+imm][0:7] = rs2[0:7]   |              |
| sh     | Store Half          | S | 0100011 | 0x1 |         | M[rs1+imm][0:15] = rs2[0:15] |              |
| sw     | Store Word          | S | 0100011 | 0x2 |         | M[rs1+imm][0:31] = rs2[0:31] |              |
| beq    | Branch ==           | B | 1100011 | 0x0 |         | if(rs1 == rs2) PC += imm     |              |
| bne    | Branch !=           | B | 1100011 | 0x1 |         | if(rs1 != rs2) PC += imm     |              |
| blt    | Branch <            | B | 1100011 | 0x4 |         | if(rs1 < rs2) PC += imm      |              |
| bge    | Branch ≥            | B | 1100011 | 0x5 |         | if(rs1 >= rs2) PC += imm     |              |
| bltu   | Branch < (U)        | B | 1100011 | 0x6 |         | if(rs1 < rs2) PC += imm      | zero-extends |
| bgeu   | Branch ≥ (U)        | B | 1100011 | 0x7 |         | if(rs1 >= rs2) PC += imm     | zero-extends |
| jal    | Jump And Link       | J | 1101111 |     |         | rd = PC+4; PC += imm         |              |
| jalr   | Jump And Link Reg   | I | 1100111 | 0x0 |         | rd = PC+4; PC = rs1 + imm    |              |
| lui    | Load Upper Imm      | U | 0110111 |     |         | rd = imm << 12               |              |
| auipc  | Add Upper Imm to PC | U | 0010111 |     |         | rd = PC + (imm << 12)        |              |
| ecall  | Environment Call    | I | 1110011 | 0x0 | imm=0x0 | Transfer control to OS       |              |
| ebreak | Environment Break   | I | 1110011 | 0x0 | imm=0x1 | Transfer control to debugger |              |

# RV32M Multiply Extension

| Inst  | Name             | FMT | Opcode  | funct3 | funct7 | Description (C)           |
|-------|------------------|-----|---------|--------|--------|---------------------------|
| mul   | MUL              | R   | 0110011 | 0x0    | 0x01   | $rd = (rs1 * rs2)[31:0]$  |
| mulh  | MUL High         | R   | 0110011 | 0x1    | 0x01   | $rd = (rs1 * rs2)[63:32]$ |
| mulsu | MUL High (S) (U) | R   | 0110011 | 0x2    | 0x01   | $rd = (rs1 * rs2)[63:32]$ |
| mulu  | MUL High (U)     | R   | 0110011 | 0x3    | 0x01   | $rd = (rs1 * rs2)[63:32]$ |
| div   | DIV              | R   | 0110011 | 0x4    | 0x01   | $rd = rs1 / rs2$          |
| divu  | DIV (U)          | R   | 0110011 | 0x5    | 0x01   | $rd = rs1 / rs2$          |
| rem   | Remainder        | R   | 0110011 | 0x6    | 0x01   | $rd = rs1 \% rs2$         |
| remu  | Remainder (U)    | R   | 0110011 | 0x7    | 0x01   | $rd = rs1 \% rs2$         |

# Software Environment



**temp = v[k];  
v[k] = v[k+1];  
v[k+1] = temp;**

lw \$t0, 0(\$2)  
lw \$t1, 4(\$2)  
sw \$t1, 0(\$2)  
sw \$t0, 4(\$2)

Anything can be represented  
as a *number*,  
i.e., data or instructions

0000 1001 1100 0110 1010 1111 0101 1000  
1010 1111 0101 1000 0000 1001 1100 0110  
1100 0110 1010 1111 0101 1000 0000 1001  
0101 1000 0000 1001 1100 0110 1010 1111



4

# Software Environment

- Why do I remotely even care about software in a hardware class
  - Believe me, it's important
  - Architecture is the bridge between the two (insert preaching)
- For RISC-V, we will have both C programs and assembly programs to test
- At the same time, you also need to have a grasp of how C works at a very low level
  - It doesn't affect your implementation for sure, but knowing this will make your life easier
  - You will also learn A LOT

# Software Environment - GNU Tools

- There's a full suite of GNU tools for RISC-V
  - `gcc` - compiler
  - `as` - assembler
  - `ld` - linker
  - `objdump` - disassembler
  - `objcopy` - don't really need but cool
  - `g++` - don't use this
  - `gdb` - no idea if this actually works
  - a lot more that you can explore yourself...

## Software Environment - ELF

- What happens when you compile a program?
  - You generate an ELF, not Legolas though
  - But rather Executable and Linkable Format
- What is actually inside an ELF?
  - ELF/program header
    - Usually tells what OS it's for
    - Where in memory to put the program in
  - .text: the actual instructions of the program
  - .rodata: read-only data, but we don't enforce that
  - .data: modifiable program data
  - Section header table: where's what



# Software Environment - Program Space

- Flashback to 370 or whatever computer organization class you had
- What does the memory space for a program look like?
- Stack for statically allocated variables, pointer decrements
- Heap for dynamic memory, pointer increments



# Software Environment - Program Space

- Example of program space allocation for arm processors->
  - The linker allocate the memory space
- In general memory addresses around 0x0 are precious
  - Some peripherals on the serial buses can only talk to limited addresses
- In our case, the text section starts at 0x0 to simplify loading
- The stack pointer starts at 0x10000
  - The end of the testbench memory space
  - This means any program that you write, text+data+stack < 64KiB



# Software Environment - Program Space

# Software Environment - Function Calls

- Every time there's a function call, have a frame pointer that saves the previous stack pointer
- Caller/callee save the variables



# Software Environment - ABI

- Registers aren't just registers, each of them has a meaning
- Such concept is called Application Binary Interface (ABI)

| Register | ABI Name | Description                       | Saver  |
|----------|----------|-----------------------------------|--------|
| x0       | zero     | Hard-wired zero                   | —      |
| x1       | ra       | Return address                    | Caller |
| x2       | sp       | Stack pointer                     | Callee |
| x3       | gp       | Global pointer                    | —      |
| x4       | tp       | Thread pointer                    | —      |
| x5       | t0       | Temporary/alternate link register | Caller |
| x6-7     | t1-2     | Temporaries                       | Caller |
| x8       | s0/fp    | Saved register/frame pointer      | Callee |
| x9       | s1       | Saved register                    | Callee |
| x10-11   | a0-1     | Function arguments/return values  | Caller |
| x12-17   | a2-7     | Function arguments                | Caller |
| x18-27   | s2-11    | Saved registers                   | Callee |
| x28-31   | t3-6     | Temporaries                       | Caller |
| f0-7     | ft0-7    | FP temporaries                    | Caller |
| f8-9     | fs0-1    | FP saved registers                | Callee |
| f10-11   | fa0-1    | FP arguments/return values        | Caller |
| f12-17   | fa2-7    | FP arguments                      | Caller |
| f18-27   | fs2-11   | FP saved registers                | Callee |
| f28-31   | ft8-11   | FP temporaries                    | Caller |

## Code Example

```
add    1  2  3 ; reg 3 = reg 1 + reg 2
nand   3  4  5 ; reg 5 = reg 3 ~& reg 4
add    6  3  7 ; reg 7 = reg 6 + reg 3
lw     3  6  10 ; reg 6 = Mem[reg 3 + 10]
sw     6  2  12 ; Mem[reg6+12] = reg 2
```

# Code Example

```
test_progs > ASH rv32_evens.s
1  /*
2   TEST PROGRAM #2: compute even numbers that are less than 16
3
4
5     long output[16];
6
7     void
8     main(void)
9     {
10       long i,j;
11
12       for (i=0,j=0; i < 16; i++)
13       {
14         if ((i & 1) == 0)
15           output[j++] = i;
16       }
17     }
18 */
19     data = 0x1000
20     li x3, 0
21     li x4, data
22     loop1: andi x31, x3, 1
23     bne x31, x0, loop2 #
24     sw x3, 0(x4)
25     addi x4, x4, 0x8 #
26     loop2: addi x3, x3, 0x1 #
27     slti x2, x3, 16 #
28     bne x2, x0, loop1 #
29     wfi
30
31
```

# Lab4 Assignment - VeriSimple Pipeline

- Same basic pipeline as in class
  - More info at Appendix A of the textbook
    - Starting with the 8th Edition of Hennessy and Patterson, all examples should be using RISC-V as well
  - Supports RV32IM minus divide, remainder and system instructions
- Need to add hazard logic to a simple 5 stage pipeline
- Given pipeline without hazard detection or forwarding logic
- Programs still run because only one instruction is allowed in the pipeline at a time

# Lab4 Assignment - VeriSimple Pipeline



# Lab 4 Assignment Overview

- Forwarding
  - Like what we covered in class
  - Need to forward results from later stages to EX
- Structural Hazards
  - Only one memory port for fetching and memory accesses
  - Memory gets priority over fetch
    - You need this to guarantee forward progress
- Control Hazards
  - Predict not taken, resolved in MEM stage
  - Flush IF/ID, ID/EX, EX/MEM if incorrect

# Testing Environment Setup

- This will be very complicated
- After we finish designing the assignment, a Lab4-part2 slide will be released to talk about this part.

# Testcases Writing

- Feel free to write your own test programs, C or assembly
- Most of the glibc library functions should work, as long as they don't use any system calls
- We do have our own version of `malloc()`, `calloc()`, `free()`
  - `tj_malloc()`, `tj_calloc()`, `tj_free()`
- Avoid using the division (“/”) and modulus operator (“%”), since they will generate unsupported instructions like DIV and REM

# Pipeline Specifics - Fetch



# Pipeline Specifics - Decode



# Pipeline Specifics - EX



# Pipeline Specifics - MEM



# Pipeline Specifics - WB



# Pipeline Specifics - Memory Arbitration



## Sample Code

```
add    1  2  3 ; reg 3 = reg 1 + reg 2
nand   3  4  5 ; reg 5 = reg 3 ~& reg 4
add    6  3  7 ; reg 7 = reg 6 + reg 3
lw     3  6  10 ; reg 6 = Mem[reg 3 + 10]
sw     6  2  12 ; Mem[reg6+12] = reg 2
```

# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints

- Branches should resolve in the same stage they are currently resolved in
- All forwarding must be to the EX stage, even if the data isn't needed until a later stage
- Any stalling due to data hazards must occur in the decode stage. (That is, if stalling is required the dependent instruction should stall in the decode stage.) Obviously, instructions following the stalling instruction in the IF stage will have to stay in the IF stage. Put another way, if you need to insert an invalid instruction, it should be inserted in the EX stage

# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints



## Lab4 Hints



# Lab4 Hints



# Lab4 Hints



# Lab4 Hints

- If you wish to insert a nop you must invalidate the instruction. Otherwise your CPI numbers will be wrong
- If there is a structural hazard in the memory, you should let the load/store go and have the fetch stage wait on getting memory

# Lab4 Hints



# Lab4 Hints



# Implementation Hints

- Try to tackle one thing at a time
- Be careful of register 0!
- Be aware of where operand data is coming from
  - Not all instructions receive source data from ALU output.
- “Forward data into EX stage”
  - Essentially means widen muxes for ALU input, or add muxes for EX/MEM pipeline register inputs.
- Adding signals should be very easy if you use structs wisely

# References

 Jielun Tan, James Connolly (2021)

EECS470 Lab4-part2

 Jon Beaumont (2021)

EECS470 Lab4

# Thanks