

Expected delivery of **lab\_4.zip** must include:

- each configuration of the custom architecture (`riscv_o3_custom.py`) that you modify.
- This document with all the field compiled and in PDF form.

## Introduction and Background

### Simulating an Out-of-Order (OoO) CPU (O3CPU)



In this laboratory, you will be able to configure an OoO CPU by using a script called `riscv_o3_custom.py`. In a few words, the script configures an Out-of-Order (O3) processor based on the *DerivO3CPU*, a superscalar processor with a reduced number of features.

### Pipeline

The processor pipeline stages can be summarized as:

- **Fetch stage:** instructions are fetched from the instruction cache. The `fetchWidth` parameter sets the number of fetched instructions. This stage does branch prediction and branch target prediction.
- **Decode stage:** This stage decodes instructions and handles the execution of unconditional branches. The `decodeWidth` parameter sets the maximum number of instructions processed per clock cycle.
- **Rename stage:** As suggested by the name, registers are renamed, and the instruction is pushed to the IEW (Issue/Execute/Write Back) stage. It checks that the *Instruction Queue (IQ)/Load and Store Queue (LSQ)* can hold the new instruction. The maximum number of instructions processed per clock cycle is set by the `renameWidth` parameter.



Figure 1: Understanding configurable OoO CPU parameters.

- **Dispatch stage:** instructions whose renamed operands are available are dispatched to functional units (**FU**). For loads and stores, they are dispatched to the Load/Store Queue (**LSQ**). The maximum number of instructions processed per clock cycle is set by the `dispatchWidth` parameter.
- **Issue stage:** The simulated processor has a single instruction queue from which all instructions are issued. Ordinarily, instructions are taken in-order from this queue. An instruction is issued if it does not have any dependency.
- **Execute stage:** the functional unit (**FU**) processes their instruction. Each functional unit can be configured with a different latency. Conditional branch mispredictions are identified here. The maximum number of instructions processed per clock cycle depends on the different functional units configured and their latencies.
- **Writeback stage:** it sends the result of the instruction to the reorder buffer (**ROB**). The maximum number of instructions processed per clock cycle is set by the `wbWidth` parameter.
- **Commit stage:** it processes the reorder buffer, freeing up reorder buffer entries. The maximum number of instructions processed per clock cycle is set by the `commitWidth` parameter. Commit is done in order.

In the event of a **branch misprediction**, trap, or other speculative execution event, "squashing" can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.



Figure 2: Example of a branch misprediction (transparent rows)

## Pipeline Resources

Additionally, it has the following structures:

- Branch predictor (BP)
  - Allows for selection between several branch predictors, including a local predictor, a global predictor, and a tournament predictor. Also has a branch target buffer (BTB) and a return address stack (RAS).
- Reorder buffer (ROB)
  - Holds instructions that have reached the back end. Handles squashing instructions and keep instructions in program order.
- Instruction queue (IQ)
  - Handles dependencies between instructions and scheduling ready instructions. Uses the **memory dependence predictor** to tell when memory operations are ready.
- Load-store queue (LSQ)
  - Holds loads and stores that have reached the back end. It hooks up to the d-cache and initiates accesses to the memory system once memory operations have been issued and executed. Also handles forwarding from stores to loads, replaying memory operations if the memory system is blocked, and detecting memory ordering violations.
- Functional units (FU)
  - Provides timing for instruction execution. Used to determine the latency of an instruction executing, as well as what instructions can issue each cycle.
  - **Floating point units, floating point registers**, and respective instructions are supported.

|                                                      |                                       |
|------------------------------------------------------|---------------------------------------|
| 560: s561 (t0: r160): 0x00010106: fmv_w_x fa5, zero  | F   Dc   Rn 1   Is 1   2   3   Cm 1   |
| 561: s562 (t0: r161): 0x0001010a: c_addi16sp sp, -64 | F   Dc   Rn 1   Is Cm 1   2   3   4   |
| 562: s563 (t0: r162): 0x0001010c: c_fsdsp fs0, 8(sp) | F 1   Dc   Rn 1   Is Mc 1   2   3   4 |
| 563: s564 (t0: r163): 0x0001010e: c_fsdsp fs1, 0(sp) | F 1   Dc   Rn 1   2   3   Is Mc 1   2 |

Figure 3: Pipeline example of FP instructions and FP registers

## Laboratory: hands-on

All the needed resources are at a GitHub repository:

[https://github.com/cad-polito-it/ase\\_riscv\\_gem5\\_sim](https://github.com/cad-polito-it/ase_riscv_gem5_sim)

To create your simulation environment:

For HTTPS clone:

```
~/my_gem5Dir$ git clone https://github.com/cad-polito-it/ase_riscv_gem5_sim.git
```

For SSH:

```
~/my_gem5Dir$ git clone git@github.com:cad-polito-it/ase_riscv_gem5_sim.git
```

The environment is configured to be executed on the **LABINF MACHINES**.

Follow the HOWTO instructions available on the GitHub Repository for simulating a program.

### Exercise 1:

Simulate the benchmark *my\_c\_benchmark\_2 (main.c)* by using the gem5 simulator to obtain the *trace.out* file. Then, you can visualize the pipeline (i.e., load the *trace.out* file on Konata).

Based on the CPU architecture described in *riscv\_o3\_custom.py*, visualize the Konata's pipeline to find out the conditions:

1. Out-of-order execution (issue), in-order commit (commit)
2. Two commits in the same clock cycle
3. Flush of the pipeline.

For every condition, fill the following tables.

| Condition                                                    | Out-of-order execution, in-order commit                                                                                                                                                                                                                                                                   |
|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Screenshot from Konata                                       |                                                                                                                                                                                                                       |
| Explain the reason behind the condition                      | Perchè il compilatore per ottimizzare le performance esegue le istruzioni che non hanno dipendenze sui dati appena gli è possibile. In questo caso l'operazione alla riga 23 termina prima della operazione nella riga 22 perchè non dipende da nessuna delle istruzioni precedenti non ancora terminate. |
| Briefly explain the advantages of the OoO execution in a CPU | Il vantaggio di questo tipo di esecuzione è che nel caso in cui delle istruzioni non hanno dipendenza, posso essere eseguite senza aspettare il termine delle istruzioni precedenti, evitando di rimanere in stallo per diversi cicli di clock. Quindi c'è un minore spreco di cicli di clock             |

|                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>Condition</b>                               | Two or more commits in the same clock cycle                                                                                                                                                                                                                                                                                                                                                                                                          |
| <b>Screenshot from Konata</b>                  | <pre>40: s98 (t0: r27): 0x00000200: addiw a4, a5, 0 41: s99 (t0: r28): 0x00000204: lw a5, -1560(\$0) 42: s100 (t0: r29): 0x00000208: addiw a5, a5, 0</pre>                                                                                                                                                                                                                                                                                           |
| <b>Explain the reason behind the condition</b> | Queste due istruzioni terminano contemporaneamente la fase di issue stage e possono fare entrambe il commit. Con queste impostazioni possiamo fare fino a due commit insieme perchè il commitWidth è 2.                                                                                                                                                                                                                                              |
| <b>Briefly explain the Commit functioning</b>  | Il commit controlla lo stallo generale del processore.<br>Elabora il buffer di riordino, liberando le voci del buffer.                                                                                                                                                                                                                                                                                                                               |
| <b>Condition</b>                               | Flush of the pipeline                                                                                                                                                                                                                                                                                                                                                                                                                                |
| <b>Screenshot from Konata</b>                  | <pre>55: s113 (t0: r42): 0x0000023c: lui a4, 1 56: s114 (t0: r35): 0x00000220: fld fa5, -1568(\$0) 57: s115 (t0: r36): 0x00000224: fcvt_w_d a5, fa5 58: s116 (t0: r37): 0x00000228: addiw a5, a5, 0 59: s117 (t0: r38): 0x0000022c: sw a5, -1556(\$0) 60: s118 (t0: r39): 0x00000230: jal zero, 84 61: s119 (t0: r40): 0x00000234: lw a5, -1560(\$0) 62: s120 (t0: r41): 0x00000238: addiw a5, a5, 0 63: s121 (t0: r42): 0x0000023c: lui a4, 1</pre> |
| <b>Explain the reason behind the condition</b> | Ciò avviene per la presenza di un salto non condizionato, le istruzioni successive si trovano in fase di Fetch quando l'istruzione 60 salta quindi vengono “abortite”.                                                                                                                                                                                                                                                                               |

## Exercise 2:

Given your benchmark (`main.c` in `my_c_benchmark_2`), optimize the CPU architecture (i.e., modify the `riscv_o3_custom.py` file) and write down the improvements in terms of CPI and speedup.

- To optimize the CPU architecture, open the configuration file of the CPU (i.e., the `riscv_o3_custom.py`), and tune specific hardware-related parameters.

You have to change specific values in **one or more** stages of the pipeline:

- # - FETCH STAGE
  - Tune parameters such as the `fetchWidth`, `fetchBufferSize` and so on, and see the effects on your system.

- # - DECODE STAGE
- # - RENAME STAGE
  - Try changing some values, but don't touch the "Phys" ones.
- # - DISPATCH/ISSUE STAGE
- # - EXECUTE STAGE
  - Here you can optimize the Functional units of your CPU like the INT ALU, the FP ALU, the FP Multiplier/Divider and so on.
  - Tune the number of units (*count*) that you have in the system, as well as their latency (*opLat*) to see how this affects the execution of your program.
- You can create a different branch predictor. They are defined in *create\_predictor.py*
- You can also try to change the parameters of the L1 Cache. Look for the “class L1Cache” in the *riscv\_o3\_custom.py* file. The L1 cache, also referred to as the primary cache, is the smallest and fastest level of memory. It is located directly on the processor, and it is used to store frequently accessed data by the CPU. In this way, the CPU saves time with respect to the normal access to the main memory.

**HINT:** To implement the best hardware optimization, and understand how to change the parameters, the best option consists in analysing the *stats.txt* file (in **ase\_riscv\_gem5\_sim/results/my\_c\_benchmark\_2**).

Find information regarding the workload profiling. In other words, look for lines such as “system.cpu.commitStats0.committedInstType::IntAlu”, and the following ones to understand which kind of instructions are executed the most. In this way, you can target a specific functional unit and modify its specifications.

Fill the following Tables with the CPI that you obtain with the old and the new architectures. Compute also the equivalent speedup that you obtain.

HINT: You can get the CPI and other useful information from the *stats.txt* file.

| Parameters                      | Configuration 1              | Configuration 2             | Configuration 3              | Configuration 4              |
|---------------------------------|------------------------------|-----------------------------|------------------------------|------------------------------|
| <b>First changed parameter</b>  | the_cpu.fetchWidth = 2       | the_cpu.fetchWidth = 6      | the_cpu.fetchBufferSize = 8  | the_cpu.fetchBufferSize = 32 |
| <b>Second changed parameter</b> | the_cpu.numROBEntries = 32   | the_cpu.numROBEntries = 128 | the_cpu.numIQEntries = 2     | the_cpu.numIQEntries = 4     |
| <b>Third changed parameter</b>  | opClass="IntMul", opLat=2    | opClass="IntDiv", opLat=4   | opClass="FloatCmp", opLat=2  | opClass="FloatCvt", opLat=2  |
| <b>Fourth changed parameter</b> | opClass="FloatMult", opLat=3 | opClass="FloatDiv", opLat=3 | opClass="FloatSqrt", opLat=3 | opClass="FloatMisc", opLat=2 |

Original CPI (no hardware optimization):

|                                   | Configuration 1 | Configuration 2 | Configuration 3 | Configuration 4 |
|-----------------------------------|-----------------|-----------------|-----------------|-----------------|
| <b>CPI</b>                        | 2.14            | 2.14            | 2.69            | 1.82            |
| <b>Speedup (wrt Original CPI)</b> | 2.14/2.19=0.98  | 2.14/2.19=0.98  | 2.69/2.19=1.23  | 1.82/2.19=0.83  |

Which is the best optimization in terms of CPI and speedup, why?

Your answer:

La configurazione migliore è stata la numero 4 in cui sono stati cambiati i seguenti parametri:

the\_cpu.fetchBufferSize = 32

the\_cpu.numIQEntries = 4

opClass="FloatCvt", opLat=2

opClass="FloatMisc", opLat=2

Il miglioramento è dovuto soprattutto alla fase di rename perchè così facend è molto più veloce e viene meno spesso bloccata.