

# RISC-V SoC Microarchitecture Design & Optimization

Design Review #4 (Final Defense)

**Group 23**

**Instructor & Sponsor:** Weikang Qian

**Group Member:** Li Shi, Jian Shi, Yichao Yuan, Yiqiu Sun, Zhiyuan Liu



JOINT INSTITUTE  
交大密西根学院

# Team Members



Zhiyuan Liu

Li Shi

Yichao Yuan

Jian Shi

Yiqiu Sun

# Overview

- Introduction
- Design Specifications
- Concept Generation & Selection
- Final Design of Microarchitecture
- Prototype Description & Validation
- Discussion & Conclusion

# 1. Introduction

# Introduction

Domain Specific Optimization: Future of Computing



Fig 1. Tesla self-driving cars.

Source: [www.businessinsider.com/tesla-autopilot-full-self-driving-subscription-early-2021-elon-musk-2020-12](http://www.businessinsider.com/tesla-autopilot-full-self-driving-subscription-early-2021-elon-musk-2020-12)



## 2. Design Specifications

# Design Specifications

## Customer Requirement (CR)

Be compatible  
with RISC-V apps

Have detailed docs  
for reference

Is inexpensive

*General*

Run fast for normal  
arithmetic programs

Run fast for machine  
learning (ML) apps

Run fast for  
memory-bound apps

*Performance*

Be easy to configure  
for various parameters

Have good support for  
multiple I/O devices

Save power

Respond quickly

*Embedded-System*

# Design Specifications

## Engineering Specifications (ES)

|                                                  | <b>Unit</b> | <b>Target Value</b> |
|--------------------------------------------------|-------------|---------------------|
| Support RV32G instruction set architecture (ISA) | -           | Yes                 |
| Core frequency on FPGA test platform             | MHz         | 100                 |
| Number of pipeline stages                        | -           | 9                   |
| Instructions executed per clock cycle (IPC)      | -           | 0.5                 |
| Support instruction dynamic scheduling           | -           | Yes                 |
| Typical total cache size                         | KB          | 32                  |
| Number of function units                         | -           | 6                   |
| Average response time to a request for service   | ms          | 10                  |
| Usage of look-up tables (LUT) on FPGA            | k           | 120                 |
| Usage of block RAM (BRAM) on FPGA                | -           | 50                  |
| Usage of digital signal processor (DSP) on FPGA  | -           | 30                  |
| Power consumption on target FPGA test platform   | W           | 5                   |
| Operations processed within unit energy          | MOp/J       | 25                  |
| Number of flexibly-configured modules            | -           | 10                  |
| Number of I/O device types                       | -           | 3                   |
| User guide and programmers manual                | -           | Yes                 |

Table 1. Engineering specifications.

# Quantification of Design Specifications

## Engineering Specifications (ES)

|                                                  | <b>Unit</b> | <b>Target Value</b> |
|--------------------------------------------------|-------------|---------------------|
| Support RV32G instruction set architecture (ISA) | -           | Yes                 |
| Core frequency on FPGA test platform             | MHz         | 100                 |
| Number of pipeline stages                        | -           | 9                   |
| Instructions executed per clock cycle (IPC)      | -           | 0.5                 |
| Support instruction dynamic scheduling           | -           | Yes                 |
| Typical total cache size                         | KB          | 32                  |
| Number of function units                         | -           | 6                   |
| Average response time to a request for service   | ms          | 10                  |
| Usage of look-up tables (LUT) on FPGA            | k           | 120                 |
| Usage of block RAM (BRAM) on FPGA                | -           | 50                  |
| Usage of digital signal processor (DSP) on FPGA  | -           | 30                  |
| Power consumption on target FPGA test platform   | W           | 5                   |
| Operations processed within unit energy          | MOp/J       | 25                  |
| Number of flexibly-configured modules            | -           | 10                  |
| Number of I/O device types                       | -           | 3                   |
| User guide and programmers manual                | -           | Yes                 |

**Example 1**

**Example 2**

Table 1. Engineering specifications.

# QFD



Figure 2. QFD diagram.

Data source: Alexander Dörflinger, et al. 2021. A comparative survey of open-source application-class RISC-V processor implementations. Proceedings of the 18th ACM International Conference on Computing Frontiers. ACM, New York, NY, USA, 12–20.

# 3. Concept Generation & Selection

# Concept Generation & Decision Making

## Instruction Set Architecture (ISA)



### ARMv7 ISA

### RISC-V 32G ISA

**Pro**

Support complicated instructions<sup>1</sup>

Custom design space  
Easy to implement<sup>1</sup>

**Con**

No custom design space  
Difficult to implement<sup>2</sup>

Only support a small number of  
instructions<sup>3</sup>

Source:

1. A. Armstrong, et al. "ISA Semantics for ARMv8-a, RISC-v, and CHERI-MIPS." *Proceedings of ACM on Programming Languages*, no. POPL (2019): 1-31.
2. ARM Limited. [www.arm.com](http://www.arm.com)
3. RISC-V International. [www.riscv.org](http://www.riscv.org)

# Concept Generation & Decision Making

## Instruction Set Architecture (ISA) - Decision



RISC-V 32G ISA



Fig 3. Execution stage with approximate computing unit.

# Concept Generation & Decision Making

## Microarchitecture Design: Instruction Parallelism



**Scalar**

**Superscalar**

**Pro**

Low complexity

Better performance  
Higher Compatibility

**Con**

Limited performance

High complexity

# Concept Generation & Decision Making

## Microarchitecture Design: Instruction Scheduling



Static (In-order)

Dynamic (Out-of-order)

Pro

Easy to implement  
Little overhead

Better performance

Con

Limited performance

Hard to implement  
Extra overhead

# Decision Matrix

| Design Criterion    | Weight Factor | Unit | Instruction Parallelism |        |             |        | Instruction Scheduling |        |              |        |
|---------------------|---------------|------|-------------------------|--------|-------------|--------|------------------------|--------|--------------|--------|
|                     |               |      | Scalar                  |        | Superscalar |        | In-order               |        | Out-of-order |        |
|                     |               |      | Score                   | Rating | Score       | Rating | Score                  | Rating | Score        | Rating |
| Performance         | 0.45          | %    | 5                       | 2.25   | 8           | 3.6    | 3                      | 1.35   | 10           | 4.5    |
| Hardware Complexity | 0.20          | %    | 9                       | 1.8    | 6           | 1.2    | 9                      | 1.8    | 3            | 0.6    |
| Cost                | 0.15          | %    | 8                       | 1.2    | 5           | 0.75   | 9                      | 1.35   | 5            | 0.75   |
| Flexibility         | 0.20          | %    | 6                       | 1.2    | 9           | 1.8    | 5                      | 1      | 9            | 1.8    |
| Total               |               |      |                         | 6.45   |             | 7.35   |                        | 5.5    |              | 7.65   |

# 4. Final Design of Microarchitecture

# Final Design Overview

-  RISC-V ISA
-  Out-of-order Execution
-  4-way Superscalar
-  Built-in Approximate Units



Fig 4. RISC-V core overview.

# Final Design Overview



**RISC-V ISA**



**Out-of-order Execution**



**4-way Superscalar**



**Built-in Approximate Units**



Fig 5. Microarchitecture design.

# Final Design

## Instruction Fetch & Branch Predictor

- 4-way Instruction Fetch
- Adaptive 2-level branch predictor
- 128-entry Branch History Table (BHT)
- 128-entry Pattern History Table (PHT)
- 32-entry Branch Target Buffer (BTB)



Fig 6. Instruction fetch & branch predictor.

# Final Design

## Instruction Decode

4-way Instruction Decode

Support RV32IFD

FrontEnd



Fig 7. Instruction decode unit.

# Final Design

## Register Renaming

4-way Register Renaming

“Explicit Renaming” Design

Interleaving Free List

Retirement Rename Allocation Table

### FrontEnd



Fig 8. Register renaming unit.

# Final Design

## Instruction Dispatch & Commit



Fig 9. Re-order buffer (ROB) & dispatch unit.

# Final Design

## Issue & Register File

32-entry 3-way Integer Issue Queue

16-entry 1-way Memory Issue Queue

16-entry 2-way FP Issue Queue

128-entry Physical Register File (PRF)

Data Dependency Scoreboard



Fig 10. Issue queue, PRF, scoreboard.

# Final Design

## Execution



Fig 11. Execution units.

# Final Design Summary



**RISC-V ISA**



**Out-of-order Execution**



**4-way Superscalar**



**Built-in Approximate Units**



Fig 5. Microarchitecture design.

# 5. Prototype Description & Validation

# Prototype Description



Fig 12. SoC and simulation system integration prototype.

# Prototype Validation Process



# Prototype Validation Result

**Correct results**

**Verification target results**

| spike.out |      |                                  |                  |
|-----------|------|----------------------------------|------------------|
| 1         | core | 0: 0x80000558 (0x20000137) lui   | sp, 0x20000      |
| 2         | core | 0: 0x8000055c (0xb21ff0ef) jal   | pc - 0x4e0       |
| 3         | core | 0: 0x8000007c (0xff010113) addi  | sp, sp, -16      |
| 4         | core | 0: 0x80000080 (0x00112623) sw    | ra, 12(sp)       |
| 5         | core | 0: 0x80000084 (0x01400593) li    | a1, 20           |
| 6         | core | 0: 0x80000088 (0x10000537) lui   | a0, 0x10000      |
| 7         | core | 0: 0x8000008c (0x40850513) addi  | a0, a0, 1032     |
| 8         | core | 0: 0x80000090 (0xf71ff0ef) jal   | pc - 0x90        |
| 9         | core | 0: 0x80000000 (0x00100793) li    | a5, 1            |
| 10        | core | 0: 0x80000004 (0x06b7da63) bge   | a5, a1, pc + 116 |
| 11        | core | 0: 0x80000008 (0x00050313) mv    | t1, a0           |
| 12        | core | 0: 0x8000000c (0xffff58f13) addi | t5, a1, -1       |
| 13        | core | 0: 0x80000010 (0x00000893) li    | a7, 0            |
| 14        | core | 0: 0x80000014 (0x0400006f) j     | pc + 0x40        |
| 15        | core | 0: 0x80000054 (0x00030e93) mv    | t4, t1           |
| 16        | core | 0: 0x80000058 (0x00032e03) lw    | t3, 0(t1)        |
| 17        | core | 0: 0x8000005c (0x00088813) mv    | a6, a7           |
| 18        | core | 0: 0x80000060 (0x00188893) addi  | a7, a7, 1        |
| 19        | core | 0: 0x80000064 (0xfcdb8dae3) bge  | a7, a1, pc - 44  |
| 20        | core | 0: 0x80000068 (0x00030713) mv    | a4, t1           |

Fig 13. Correct program counter trace results.

| retire.out |   |      |          |
|------------|---|------|----------|
| 1          | [ | 57]  | 80000558 |
| 2          | [ | 57]  | 8000055c |
| 3          | [ | 77]  | 8000007c |
| 4          | [ | 83]  | 80000080 |
| 5          | [ | 83]  | 80000084 |
| 6          | [ | 83]  | 80000088 |
| 7          | [ | 85]  | 8000008c |
| 8          | [ | 85]  | 80000090 |
| 9          | [ | 105] | 80000000 |
| 10         | [ | 111] | 80000004 |
| 11         | [ | 111] | 80000008 |
| 12         | [ | 111] | 8000000c |
| 13         | [ | 111] | 80000010 |
| 14         | [ | 111] | 80000014 |
| 15         | [ | 131] | 80000054 |
| 16         | [ | 133] | 80000058 |
| 17         | [ | 133] | 8000005c |
| 18         | [ | 133] | 80000060 |
| 19         | [ | 139] | 80000064 |
| 20         | [ | 139] | 80000068 |

Fig 14. Program counter trace results to be validated.

# Validation of Engineering Specifications

| Engineering Specification                        | Unit  | Target | Actual      | Comment                    |
|--------------------------------------------------|-------|--------|-------------|----------------------------|
| Support RV32G instruction set architecture (ISA) | -     | Yes    | Partial     | RV32IM                     |
| Core frequency on FPGA test platform             | MHz   | 100 ↑  | 74.88       | Synthesis result           |
| Number of pipeline stages                        | -     | 9      | 9           |                            |
| Instructions executed per clock cycle (IPC)      | -     | 0.5 ↑  | 0.601       | Simulation result          |
| Support instruction dynamic scheduling           | -     | Yes    | Yes         |                            |
| Typical total cache size                         | KB    | 32 ↑   | 128 / ideal | Ideal software model       |
| Number of function units                         | -     | 6 ↑    | 6           |                            |
| Average response time to a request for service   | ms    | 10 ↓   | ideal       | Ideal software model       |
| Usage of look-up tables (LUT) on FPGA            | k     | 120 ↓  | 101.7       | Synthesis result           |
| Usage of block RAM (BRAM) on FPGA                | -     | 50 ↓   | 0           | Synthesis result           |
| Usage of digital signal processor (DSP) on FPGA  | -     | 30 ↓   | 8           | Synthesis result           |
| Power consumption on target FPGA test platform   | W     | 5 ↓    | 2.63        | Synthesis result           |
| Operations processed within unit energy.         | MOp/J | 25 ↑   | 17.11       | Synthesis result           |
| Number of flexibly-configured modules            | -     | 10 ↑   | 13          |                            |
| Number of I/O device types                       | -     | 3 ↑    | 3           | File r/w & Standard output |
| User guide and programmers manual                | -     | Yes    | Yes         |                            |

Table 3. Validation of engineering specifications (↑ Higher is better / ↓ Lower is better).

# 6. Discussion & Conclusion

# Discussion

## Advantages over Traditional 5-stage In-order Scalar Pipeline



Fig 15. 5-stage in-order scalar pipeline.

Source: Gang Zheng. VE370 *Intro to Computer Organization* lecture notes.



Fig 16. 9-stage out-of-order superscalar pipeline.

Pipeline bubbles

Long critical path

High pipeline utilization

Short delay & High frequency

# Discussion

## Future work: Pipeline optimization



Unbalanced  
pipeline stages

Fig 17. Further optimized pipeline design.

Combine or separate pipeline stages

# Conclusion



# Thank you!

## RISC-V SoC Microarchitecture Design & Optimization

**Group 23**

**Instructor & Sponsor:** Weikang Qian

**Group Member:** Li Shi, Jian Shi, Yichao Yuan, Yiqiu Sun, Zhiyuan Liu



JOINT INSTITUTE  
交大密西根学院