

# Towards Accurate Performance Modeling of RISC-V Designs

Odysseas Chatzopoulos  
George Papadimitriou

George-Marios Frangkoulis  
Dimitris Gizopoulos

University of Athens, Greece

# Microarchitectural Simulation

- Computer architects strongly rely on **microarchitectural simulators**
  - Performance-based studies
  - Evaluation and validation of new microarchitectures
  - Reliability evaluation
  - Power consumption
- **Accuracy**
  - Often fail to accurately model the real hardware providing design inconsistencies
- **Speed**
  - High simulation throughput



# RTL vs. Microarchitecture-Level Simulation

- **RTL simulation**

- High accuracy
- Low throughput when running in SW
- Hard to modify and extend

- **Microarchitecture-level simulation**

- Acceptable accuracy
- High throughput
- Extensible and easily configurable
- Model several different architectures and microarchitectures



# Motivation and Contributions

- RISC-V accurate performance modeling is still an ongoing task
- What is **the highest accuracy can be achieved** through microarchitecture-level simulation?
  - Comparison to an RTL simulation baseline
- **Our contributions**
  - Discuss about the **challenges and sources of error** between microarchitecture-level and RTL simulations
  - Modeling validation of two ISAs with **subtle differences**
    - RV32IM and RV64IM



# Sources of Simulation Error

- **Modeling errors**
  - When the simulator developers incorrectly model the desired functionality
  - Usually can be corrected when the simulator is gradually improved through its continuous maintenance
- **Specification errors**
  - When the developers are unaware of the functionality being modeled or must speculate about it
- **Abstraction errors**
  - When the developers abstract or fail to incorporate some details of the modeled design



# RSD: A Reference RTL Model

- Open-source RISC-V OoO microprocessor core optimized for FPGAs
  - <https://github.com/rsd-devel/rsd>
- Provides high performance by supporting advanced microarchitectural features
- The RSD pipeline is structured using three basic blocks
  - **Front-end block**
    - It supports the **gshare branch predictor**
  - **Scheduling block**
  - **Execution block**





# gem5 Microarchitectural Simulator

- Widely used open-source microarchitecture-level simulator
- Employed in both academia and industry
- Supports variety of ISAs, including RISC-V
- Two main simulation modes
  - System-call emulation (SE): does not load OS
    - System calls are emulated by the host system
  - Full-system (FS): executes both user-level and kernel-level instructions by loading an OS into the simulator



# gem5 Utilized Components

- Variety of built-in microprocessor and memory system models
- Microprocessor Model → O3CPU
- The gem5 O3CPU pipeline consists of 5 stages
  - Fetch
  - Decode
  - Rename
  - Issue/Execute/Writeback
  - Commit
- Memory Model → SimpleMemory
- SE simulation mode is used throughout our work
  - RSD does not support a full OS
  - RISC-V FS mode was in early stages of development



# Experimental Methodology

- Extract microarchitectural parameters from RSD source code and configuration files and translate them to gem5 parameters
- To test parameter matching we developed stressmarks targeted at specific system components and run them on both simulated models
- Analyze simulation results
  - Compare IPC and Microarchitectural Event Counters
  - Use Konata pipeline viewer to further investigate execution flow
- Adjust gem5 parameters according to above results
- Run general purpose benchmarks



# Main Modeling Parameters

| Parameter                              | Value                        |
|----------------------------------------|------------------------------|
| ISA                                    | RV32IM (RSD) – RV64IM (gem5) |
| L1 Data / Instruction Cache            | 4KB / 4KB (2-way)            |
| Cache Line Size                        | 8 Bytes                      |
| Replacement Policy                     | Tree-PLRU                    |
| L1 Hit / Miss Latency                  | 1 / 100 clock cycles         |
| Fetch / Decode / Rename / Commit Width | 2                            |
| Issue / Writeback Width                | 5                            |
| Reorder Buffer                         | 64 entries                   |
| MSHR                                   | 2 entries                    |
| Branch Predictor                       | gshare (2048 History Table)  |
| Branch Target Buffer                   | 1024 entries                 |
| Load / Store Queue                     | 16 entries                   |
| Physical Register File                 | 64 registers                 |



# Visual Pipeline Comparison using Konata

| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
|----|---|----|----|----|----|----|----|----|---|----|----|
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |

| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
|---|----|---|---|---|----|---|---|---|----|----|---|
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |

Same pipeline stages between RSD and gem5

| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
|----|---|----|----|----|----|----|----|----|---|----|----|
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |
| Np | F | Pd | Dc | Rn | Ds | Sc | Is | Rr | X | Rw | Cm |

| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
|---|----|---|---|---|----|---|---|---|----|----|---|
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |
| F | Dc | 1 | 2 | 3 | Rn | 1 | 2 | 3 | Is | Cm | 1 |

RSD

gem5



# Utilized Benchmarks

- **BubblesortC & BubblesortAsm:** Sort a 250-entry array with integer values using the Bubblesort algorithm. Assembly version of Bubblesort was written because C version shows considerable difference in committed instructions between the two ISAs
- **MemoryRandom:** Study the behavior of memory upon random accesses
- **FibSlow & Fibfast:** FibSlow computes the 20th term of the Fibonacci sequence. FibFast computes all terms in the range 1 through 45 and reduces exponential complexity to linear
- **IntegerStress:** Stresses the integer functional units
- **BranchMisNever:** Always taken branch in a loop
- **BranchMisRandom:** Specify branch misprediction latency studying gshare's behavior on branches using random conditions
- **StringSearchLarge & StringSearchSmall:** Search for given words in phrases using a case insensitive comparison algorithm
- **Qsort:** Sorts a large array of strings into ascending order using the quick sort algorithm



# Simulation Speedup

- Speedup of microarchitecture-level ranges between 5x and 20x compared to RTL simulation



# Simulation Accuracy: Committed Instructions

Most benchmarks provide similar instruction count between gem5 and RSD



# Simulation Accuracy: Clock Cycles



# Main Sources of Error

- We employ microarchitectural counters for
  - Memory Accesses
  - Cache Misses
  - Branch Prediction Misses
  - MSHR Hits
  - Memory Latency
- **Memory system** and **branch predictor** seem to be **the main sources of error**
  - Branch mispredictions per kilo-instructions and memory accesses per kilo-instructions show **the highest correlation with simulation error**



# Main Sources of Error

High  
misprediction  
rate and the  
highest memory  
accesses



# Conclusion & Future Work

- **Accuracy validation** of performance modeling of microarchitecture-level simulation
  - Present the challenges of performance modeling
- Clock cycles difference can be up to **36%** due to the **abstraction errors**
- Make targeted changes to the **memory system and branch predictor**
- Use an OS capable RISC-V processor along with HW acceleration to validate the FS simulation mode of gem5
- Employing **automated scripts** that provide the most important microarchitectural parameters of black-box processors



**thank you**

danke 謝謝  
спасибо faafetai lava  
спасибо dankie  
спасибо mandi  
спасибо kiitos  
спасибо dhanyavad  
спасибо bayarlaaa  
спасибо gracie  
спасибо maururu  
спасибо köszönöm  
спасибо enkosi  
спасибо bedankt  
спасибо dziekuje  
спасибо obrigado  
спасибо rahmat  
спасибо sagolim  
спасибо mèsi  
спасибо didinadoba  
спасибо sukriya  
спасибо najis tuke  
спасибо kam sah hamnida  
спасибо rahmat  
спасибо terima kasih  
спасибо 감사합니다  
спасибо xièxie  
спасибо euxcharistw  
спасибо merci  
спасибо рахмат  
спасибо mersi  
спасибо vinaka  
спасибо blagodaram  
спасибо kia ora  
спасибо barka  
спасибо welalin  
спасибо tack  
спасибо dank je  
спасибо misaotra  
спасибо matondo  
спасибо paldies  
спасибо grazzi  
спасибо mahalo  
спасибо tapadhi leat  
спасибо xvala  
спасибо asante  
спасибо manana  
спасибо obrigada  
спасибо murakozé  
спасибо tenki  
спасибо djiere dieuf  
спасибо mochchakkeram  
спасибо go raibh  
спасибо maith agat  
спасибо arigatō  
спасибо takk  
спасибо dakujem  
спасибо trugarez  
спасибо shukriya  
спасибо мерси

# Talk on Thursday, June 17, 2021

## Towards Accurate Performance Modeling of RISC-V Designs

Odysseas Chatzopoulos

George Papadimitriou

George-Marios Frangoulis

Dimitris Gizopoulos

University of Athens, Greece  
<http://cal.di.uoa.gr>

Fifth Workshop on Computer Architecture Research with RISC-V (CARRV 2021)