

# Design and Implementation of a Pipelined RISC-V Processor

Weiyuan Du

January 2026

## 1 Analysis of System Architecture and Performance

### 1.1 System Architecture

#### 1.1.1 Hazard Detection and Handling Logic

The Hazard Unit manages pipeline dependencies by monitoring register addresses ( $rf\_a1, rf\_a2, rf\_a3$ ) and control states ( $we\_rf, sel\_result0, pcsrc$ ). According to the implemented Verilog module, RAW hazards are bypassed using a priority-based forwarding multiplexer, while Load-Use and Control hazards are resolved through a combination of stalls and flushes.

Table 1: Hazard Detection and Control Logic

| Signal     | Logic Condition                                                                                                                                        | Description / Result                  |
|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| $E\_fd\_A$ | $(E\_rf\_a1 == M\_rf\_a3) \wedge M\_we\_rf \wedge (E\_rf\_a1 \neq 0)$<br>else if $(E\_rf\_a1 == W\_rf\_a3) \wedge W\_we\_rf \wedge (E\_rf\_a1 \neq 0)$ | 10 (Fwd from MEM)<br>01 (Fwd from WB) |
| $E\_fd\_B$ | $(E\_rf\_a2 == M\_rf\_a3) \wedge M\_we\_rf \wedge (E\_rf\_a2 \neq 0)$<br>else if $(E\_rf\_a2 == W\_rf\_a3) \wedge W\_we\_rf \wedge (E\_rf\_a2 \neq 0)$ | 10 (Fwd from MEM)<br>01 (Fwd from WB) |
| $lwStall$  | $((D\_rf\_a1 == E\_rf\_a3) \vee (D\_rf\_a2 == E\_rf\_a3)) \wedge E\_sel\_result0$                                                                      | $F\_stall, D\_stall, E\_flush = 1$    |
| $E\_pcsrc$ | Branch Taken signal from Execute stage                                                                                                                 | $D\_flush, E\_flush = 1$              |

#### 1.1.2 Load-Use Flushing Concept

In the provided example (Figure 3b), the register values ( $x2, x3, x4$ ) remain correct regardless of whether  $PLR2$  is flushed because the  $Stall$  signal freezes the PC and  $PLR1$ . This causes the dependent instruction to be re-dispatched and re-executed in the following cycle with the correct forwarded data, effectively overwriting any temporary incorrect results in the register file.

However, flushing is mandatory to prevent **side effects** during the stall cycle. If an instruction performs a non-idempotent operation, such as writing to memory, a lack of flushing would lead to architectural errors.

**Demonstration Program:** The following sequence demonstrates a scenario where flushing  $PLR2$  is critical:

```
lw x1, 0(x2) # Load value into x1
sw x1, 8(x2) # Store x1 to memory (depends on lw)
```

**Analysis:**

- **With Flushing:** When the  $lwStall$  is detected,  $E\_flush$  is asserted. This clears the  $sw$  instruction's control signals (setting `MemWrite` to 0) in the Execute stage. The memory remains unchanged during the stall cycle.
- **Without Flushing:** The  $sw$  instruction would remain in the Execute stage with its `MemWrite` signal active. It would erroneously write the `stale` value of  $x1$  to memory address  $8(x2)$  during the stall cycle, leading to data corruption before the correct value is eventually written in the next cycle.

### 1.1.3 Data Forwarding, Flushing, and Stalling

- **Data Forwarding:** Bypasses the Register File by routing  $M\_alu\_o$  or  $W\_result$  directly to the ALU inputs via  $E\_fd\_A/B$  muxes.
- **Stalling:** Disables stage updates ( $EN = 0$ ) to hold instructions, specifically used to wait for  $lw$  data.
- **Flushing:** Clears pipeline registers ( $CLR = 1$ ) to invalidate instructions following a taken branch or during a load-use stall.

## 1.2 Performance Analysis

### 1.2.1 CPI Estimation for Test Assembly

Based on the provided assembly code and the implemented Hazard Unit logic, the performance is analyzed as follows:

- **Instruction Count ( $I$ ):** 20 instructions are executed to reach the final loop.
- **Control Hazards:** Two jumps occur (one conditional `beq` and one `jal`). Each jump asserts `E_pcsrc`, leading to 2 flush cycles per jump (Total: 4 cycles).
- **Data Hazards:** One Load-Use hazard occurs between `lw x2` and `add x9`, triggering `lwStall` and adding 1 stall cycle.
- **Total Cycles ( $C$ ):**  $I + \text{Base Overhead} (4) + \text{Flushes} (4) + \text{Stall} (1) = 29$  cycles.

The estimated average CPI for this program is:

$$CPI_{avg} = \frac{29}{20} = 1.45$$

### 1.2.2 Pipelined Processor Performance Comparison

Based on the component delays provided in the lab manual:

- **Single-Cycle Critical Path ( $T_{sc}$ ):**  $T_{clk-Q} + T_{MemRead} + T_{Dec} + T_{RegRead} + T_{ALU} + T_{MemRead} + T_{Mux} = 710$  ps.
- **Multi-Cycle Critical Path ( $T_{mc}$ ):** Bounded by the Memory Read stage:  $T_{clk-Q} + T_{MemRead} + T_{Mux} = 270$  ps.
- **Pipelined Critical Path ( $T_{pl}$ ):** Bounded by the slowest stage plus register overhead: 270 ps.

**Conclusion:** The maximum frequency of the pipelined version is not five times higher than the single-cycle version. This limitation exists because the cycle time is determined by the bottleneck stage (Memory Read) and the inescapable overhead introduced by the setup and propagation delays ( $T_{clk-Q}$ ) of the pipeline registers.

#### Source Code Repository:

[https://github.com/SinYita/riscv-single/tree/RV\\_PL](https://github.com/SinYita/riscv-single/tree/RV_PL)

