

# Processor Architecture

## Pipelined RISC-V Implementation (Part II)



# Module Outline

- Data Hazards
- Control Hazards
- Exceptions
- Module Summary



# Data Hazards

# Recap: Data Hazard

- Caused by data dependencies (arithmetic or load)

OP  $i$     add     $x_5, x_{11}, x_{12}$   
OP  $i+1$     sub     $x_7, x_5, x_7$



Information cannot flow  
backwards in time!

# Data Hazards in ALU Instructions

- Consider this code sequence:

```
sub x2, x1, x3  
and x12, x2, x5  
or x13, x6, x2  
add x14, x2, x2  
sw x15, 100(x2)
```

- Several data dependencies (RAW) on register x2

```
sub x2, x1, x3  
and x12, x2, x5  
or x13, x6, x2  
add x14, x2, x2  
sw x15, 100(x2)
```



# Data Hazards in ALU Instructions



- How do we detect data hazards in the pipeline and how do we fix them?

# Data Hazards in ALU Instructions

- The updated value of  $x_2$  is available as soon as sub has passed the EX stage



# Forwarding Paths and Control

- **Pass source register indices along in pipeline** (destination register already passed on)
  - Pass register source 1 and register source 2 indices from ID to EX stage
  - Naming
    - ▶ ID/EX.RegisterRs1 = register # for Rs1 sitting in ID/EX pipeline register
    - ▶ ID/EX.RegisterRs2 = register # for Rs2 sitting in ID/EX pipeline register

# Forwarding Paths and Control



# Forwarding Control

| MUX control   | Source | Example                                                                        |
|---------------|--------|--------------------------------------------------------------------------------|
| ForwardA = 00 | ID/EX  | The first ALU operand comes from the register file.                            |
| ForwardA = 10 | EX/MEM | The first ALU operand is forwarded from the prior ALU result.                  |
| ForwardA = 01 | MEM/WB | The first ALU operand is forwarded from data memory or an earlier ALU result.  |
| ForwardB = 00 | ID/EX  | The second ALU operand comes from the register file.                           |
| ForwardB = 10 | EX/MEM | The second ALU operand is forwarded from the prior ALU result.                 |
| ForwardB = 01 | MEM/WB | The second ALU operand is forwarded from data memory or an earlier ALU result. |

# Detecting the Need to Forward

## Condition for data hazards

EX/MEM.RegisterRd = ID/EX.RegisterRs1

EX/MEM.RegisterRd = ID/EX.RegisterRs2

MEM/WB.RegisterRd = ID/EX.RegisterRs1

MEM/WB.RegisterRd = ID/EX.RegisterRs2

*Arithmetic data hazard*

Forward from EX/MEM  
pipeline register

Forward from MEM/WB  
pipeline register

*Memory data hazard*

# Detecting the Need to Forward (cont'd)

- But only if forwarding instruction will write to a register

```
EX/MEM.RegWrite = 1,  
MEM/WB.RegWrite = 1
```

- And only if Rd for that instruction is not x0

```
EX/MEM.RegisterRd ≠ 0  
MEM/WB.RegisterRd ≠ 0
```

# Forwarding Conditions : Computing inside Forwarding unit $\Rightarrow (rd == rs1/2) \& (\text{regwrite}) \& (!\text{regrd})$

## ■ EX hazard

```
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
    and (EX/MEM.RegisterRd = ID/EX.RegisterRs1))  
    forwardA = 10  
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
    and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))  
    forwardB = 10
```

Completed

## ■ MEM hazard

```
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
    and (MEM/WB.RegisterRd = ID/EX.RegisterRs1))  
    forwardA = 01  
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
    and (MEM/WB.RegisterRd = ID/EX.RegisterRs2))  
    forwardB = 01
```

Incomplete

# Forwarding from EX/MEM



# Forwarding from MEM/WB



# Simultaneous Data Hazards

- Consider this sequence:



|     |     |        |   |
|-----|-----|--------|---|
| add | x1  | x1, x2 | ① |
| add | x1  | x1, x3 | ② |
| add | x1, | x1, x4 | ③ |

→ can refer to

ret of ① in MEM/WB Reg

or

ret of ② in EX/MEM Reg

⇒ have to use recent!

. : Using ② for ② operation

- Both hazards occur simultaneously

- Want to mimic serial execution

→ Use most recent result

- Revise MEM hazard condition

- Only forward if EX hazard condition isn't true

# Revised Forwarding Conditions

## ■ MEM hazard

*Check forwarding*

```
if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0))  
    and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0))  
        and (EX/MEM.RegisterRd = ID/EX.RegisterRs1))  
    and (MEM/WB.RegisterRd = ID/EX.RegisterRs1))  
forwardA = 01  
Check MEM data hazard occur  
  
if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd != 0)  
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0))  
    and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))  
    and (MEM/WB.RegisterRd = ID/EX.RegisterRs2))  
forwardB = 01
```

# Datapath with Forwarding



# Load-Use Hazard (Load Interlock)



# Load-Use Hazard Resolution

## ■ Action

- Impossible to solve with forwarding
- Stall pipeline, insert bubble

## ■ Detection

- Check for dependency when instruction is decoded in ID stage
- ALU operand register numbers in ID stage are given by
  - ▶ IF/ID.RegisterRs1, IF/ID.RegisterRs2
- Load-use hazard when

*Query type*  
ID/EX.MemRead and

$((\text{ID/EX.RegisterRd} = \text{IF/ID.RegisterRs1}) \text{ or } (\text{ID/EX.RegisterRd} = \text{IF/ID.RegisterRs2}))$  *Hazard condition*

# Load-Use Hazard Resolution



# How to Stall the Pipeline

## ■ Force control values in ID/EX register to 0

- Inserts a nop (no operation) into the ID/EX register
  - ▶ By setting the control signals to 0
- The nop bubble will propagate through EX, MEM and WB like a regular instruction

*setup register*

## ■ Prevent update of PC and IF/ID register (*delay*)

- Current instruction is decoded again
- Following instruction is fetched again
- 1-cycle stall allows MEM to read data for 1d that can subsequently be forwarded to EX stage

# Datapath with Hazard Detection



# Illustration of Load Interlock Resolution

and  $x_4, x_2, x_5$  1w  $x_2, 40(x_1)$



# Illustration of Load Interlock Resolution



# Illustration of Load Interlock Resolution



# Illustration of Load Interlock Resolution



# Illustration of Load Interlock Resolution

after<1>



# Stalls and Performance

- Stalls reduce performance
  - But are required to get correct results
- Compiler can arrange code to avoid hazards and stalls
  - Requires knowledge of the pipeline structure



# Control Hazards

# Recap: Control Hazards

## Caused by PC-changing instructions

- conditional branches, function call/return
- may fetch wrong instructions

| Branch Instruction   | IF             | ID    | EX    | MEM | WB |    |    |        |
|----------------------|----------------|-------|-------|-----|----|----|----|--------|
| Branch successor     | — IF           | stall | stall | —   | IF | ID | EX | MEM WB |
| Branch successor + 1 | Cancel process |       |       |     | IF | ID | EX | MEM WB |
| Branch successor + 2 |                |       |       |     |    | IF | ID | EX MEM |
| Branch successor + 3 |                |       |       |     |    | IF | ID | EX     |
| Branch successor + 4 |                |       |       |     |    | IF | ID |        |
| Branch successor + 5 |                |       |       |     |    | IF |    |        |

*run after PC updated*

*Cancel process*

# Control Hazards

- Branch outcome determined in MEM
- Always assume branch not taken *static prediction*



# Reducing the Branch Delay

## ■ Move hardware to determine outcome to ID stage

- Easy

$EX \rightarrow ID$

- ▶ Target address adder
- ▶ Register comparator

- Difficult

- ▶ Branch decision: comparing two registers in ID stage requires additional forwarding and hazard detection logic

## ■ Example: branch taken

```
36:    sub     x10, x4, x8
40:    beq     x1, x3, 16      // PC-relative branch to 40+16*2 = 72
44:    and     x12, x2, x5
48:    or      x13, x2, x6
52:    add     x14, x4, x2
56:    sub     x15, x6, x7
...
72:    ld      x4, 50(x7)
```

# Example: Branch Taken



# Example: Branch Taken

*before, and*



# Cost of Moving Branch Test to ID

⇒ Must forward to 'ID' phase

- Register operands may require forwarding
  - New forwarding logic from EX/MEM or MEM/WB pipeline registers to ID needed
- Stalls due to data hazard *(if data from forward, can't determine)  
⇒ need stall*
  - 1-cycle stall if the preceding instruction is an ALU instruction



- 2-cycle stall if the preceding instruction is a load instruction



# Dynamic Branch Prediction

## Static Branch Prediction

- always taken 50%
- always not taken

- In deeper and superscalar pipelines branch penalty is more significant

+d : Based on previous decision

- Use dynamic prediction

- Branch prediction buffer (or branch history table) : history based predict (Super expensive)
- Indexed by recent branch instruction addresses (lower part of address)
- Stores outcome (taken / not taken)

- To execute a branch

- Check table, expect the same outcome
- Start fetching from fall-through or target
- If wrong, flush pipeline and flip prediction

0x104

0x108

0x112

0x116

0x120

↓ 4 ( $\Rightarrow$  0 or 1, 2 bits)

# Calculating Branch Target

- Even with predictor, still need to calculate the target address
  - 1-cycle penalty for a taken branch
- Branch target buffer (BTB)
  - Cache of target addresses
  - Indexed by PC when instruction fetched
  - If hit and instruction is branch predicted taken, can fetch target immediately



# 1-Bit Predictor Example

BTB

## Code

0x14380: L1: add x3,x3,x5  
0x14384: beq x8,x3,L1  
...  
...

| Address tag | Valid | Branch Target Buffer |
|-------------|-------|----------------------|
| ...0        | 0     |                      |
| ...4        | 0     |                      |
| ...8        | 0     |                      |
| ...c        | 0     |                      |

## Execution

| CC | Address | Instruction  | x8 | x3 | Adr tag | Valid | BTB          |
|----|---------|--------------|----|----|---------|-------|--------------|
| 10 | 14384   | beq x8,x3,L1 | 10 | 7  | 4       | 0     | first meet   |
| 11 | 14380   | add x3,x3,x5 | 8  |    | 4       | 1     | 14380        |
| 12 | 14384   | beq x8,x3,L1 | 10 | 8  | 4       | 1     | 14380        |
| 13 | 14380   | add x3,x3,x5 | 9  |    |         |       |              |
| 14 | 14384   | beq x8,x3,L1 | 10 | 9  | 4       | 1     | 14380        |
| 15 | 14380   | add x3,x3,x5 | 10 |    |         |       |              |
| 16 | 14384   | beq x8,x3,L1 | 10 | 10 | 4       | 1     | 14380        |
| 17 | 14388   | ...          |    |    |         | 0     | invalid now! |

# 1-Bit Predictor: Shortcoming

- Inner loop branches always mispredicted twice!



- Mispredict as taken on last iteration of inner loop
- Then mispredict as not taken on first iteration of inner loop next time around

# 2-Bit Predictor

- Only change prediction after two successive mispredictions



# 2-Bit Predictor Example

11: Predict taken  
 10: Predict taken  
 01: Predict not taken  
 00: Predict not taken

## Execution

| CC          | Address | Instruction        | x8 | x3 | Adr tag | Pred | BTB                |
|-------------|---------|--------------------|----|----|---------|------|--------------------|
| inner Loop1 | 10      | 14384 beq x8,x3,L1 | 10 | 7  | 4       | 00   |                    |
|             | 11      | 14380 add x3,x3,x5 |    | 8  | 4       | 01   | taken +1           |
|             | 12      | 14384 beq x8,x3,L1 | 10 | 8  | 4       | 01   | recording          |
|             | 13      | 14380 add x3,x3,x5 |    | 9  | 4       | 10   | taken +1           |
|             | 14      | 14384 beq x8,x3,L1 | 10 | 9  | 4       | 10   | 14380              |
|             | 15      | 14380 add x3,x3,x5 |    | 10 | 4       | 11   | taken +1 14380     |
|             | 16      | 14384 beq x8,x3,L1 | 10 | 10 | 4       | 11   | 14380              |
|             | 17      | 14388 ...          |    |    | 4       | 10   | not taken -1 14380 |
| ...         |         |                    |    |    |         |      |                    |
| inner Loop2 | 20      | 14384 beq x8,x3,L1 | 10 | 7  | 4       | 10   | 14380              |
|             | 21      | 14380 add x3,x3,x5 |    | 8  | 4       | 11   | 14380              |
|             | 22      | 14384 beq x8,x3,L1 | 10 | 8  | 4       | 11   | 14380              |
|             | 23      | 14380 add x3,x3,x5 |    | 9  | 4       | 11   | 14380              |
|             | 24      | 14384 beq x8,x3,L1 | 10 | 9  | 4       | 11   | 14380              |
|             | 25      | 14380 add x3,x3,x5 |    | 10 | 4       | 11   | 14380              |
|             | 26      | 14384 beq x8,x3,L1 | 10 | 10 | 4       | 11   | 14380              |
|             | 27      | 14388 ...          |    |    | 4       | 10   | 14380              |

# Datapath and Control



# Exceptions



| Accident type       | Source      | RISC-V                |
|---------------------|-------------|-----------------------|
| System reset        | External    | Exception             |
| I/O device request  | External    | Interrupt             |
| OS system call      | Internal    | Exception             |
| Undefined Operation | Internal    | Exception             |
| Hardware Error      | In/External | Exception / Interrupt |

# Exceptions and Interrupts

- “Unexpected” events requiring change in flow of control
    - Different ISAs use the terms differently
  - Exceptions
    - RISC-V: any unexpected change in control flow (either internal or external)
    - Internal exceptions arise within the CPU (e.g., undefined opcode, syscall, ...)
  - Interrupt
    - RISC-V: event from an external I/O controller
    - e.g., hard disks, network adapters, keyboard, ...
  - Dealing with them without sacrificing performance is hard
    - But also not necessary if exceptions do not happen frequently
    - Make the common case fast
- (Non-error phase)*
- divide by zero*
- 정상적인 계산이나 예외 처리*
- handle in OS*

# Handling Exceptions in RISC-V

- Save PC of offending (or interrupted) instruction *(Hardware error)*
  - Supervisor Exception Program Counter (SEPC)  
*Save address for decision or return*
- Save indication of the problem
  - Supervisor Exception Cause Register (SCAUSE); *reason*
  - 64-bits, but most bits unused
  - Exception code field: 2 for undefined opcode, 12 for hardware malfunction, ...  
*detect reason of error by looking for SEPC + state register*
- Jump to handler
  - Assume at 0x0000 0000 1C09 0000 (*+ error's handler address*)

# An Alternate Mechanism

- Vectored interrupts
  - Handler address determined by the cause
- Exception vector address to be added to a vector table base register: *(base address + Vectorize address)*
  - Undefined opcode: 00 0100 0000
  - Hardware malfunction: 01 1000 0000
- Instructions either
  - Deal with the interrupt, or
  - Jump to real handler

| Interrupt Type | Handler Address |
|----------------|-----------------|
|                |                 |
|                |                 |
|                | .               |
|                | .               |
|                | .               |
|                | .               |
|                |                 |

# Handler Actions

*from inflation and compare it with SCAUSE*

- Read cause, and transfer to relevant handler

- Determine action required

- If restartable
  - Take corrective action
  - Use SEPC to return to program

- Otherwise
  - Terminate program
  - Report error using SEPC, SCAUSE, ...



① after 'sys call': next operation

② after 'load': same address

③ general: stop program

# Exception Properties

- Restartable exceptions
  - Pipeline can flush the instruction
  - Handler executes, then returns to the instruction:  
Refetched and executed from scratch
- PC saved in SEPC register
  - Identifies causing instruction

# Exceptions in a Pipeline

- Another form of control hazard
- Consider malfunction on add in EX stage

add x1, x2, x1

- Prevent x1 from being clobbered : *operation fail, so must not override*
- Complete previous instructions
- Flush add and subsequent instructions (*flush register*)
- Set SEPC and SCAUSE register values (*address and reason*)
- Transfer control to handler

- Similar to mispredicted branch: use much of the same hardware

# Pipeline with Exception Control



# Exception Example

- Exception (hardware fault) on **add** in EX stage

```
40:    sub    x11, x2, x4
44:    and    x12, x2, x5
48:    or     x13, x2, x6
→4c:    add    x1, x2, x1 Error!)
50:    sub    x15, x6, x7
54:    lw     x16, 100(x7)
```

- Handler

```
1c090000      sw    x26, 1000(x10)
1c090004      sw    x27, 1008(x10)
...
```

# Exception Example



# Exception Example

*handler*

*flush instruction input after exception*

sw x26, 1000(x0)

bubble (nop)

bubble

bubble

or x13, . . .

IF.Flush



# More on Exceptions

## ■ Multiple exceptions

- Pipelining overlaps multiple instructions - could have multiple exceptions at once!
- Simple approach: deal with exception from earliest instruction
  - ▶ Flush subsequent instructions
  - ▶ “Precise” exceptions
- In complex pipelines
  - ▶ Multiple instructions issued per cycle
  - ▶ Out-of-order completion
  - ▶ Maintaining precise exceptions is difficult!
- “Imprecise” exceptions
  - ▶ Exception not associated with exact instruction that caused the interrupt
- Majority of modern processors, incl. RISC-V, support precise interrupts

# Module Summary



# Module Summary

## ■ Data hazards

- Solved with data forwarding and hazard detection
- Can eliminate stalls completely except for load interlock
- Relatively simple

## ■ Control hazards

- Reduce branch penalty by moving branch decision to the ID stage
  - ▶ At the expense of additional forwarding paths and hazard detection logic
- Speculative execution
  - ▶ Speculate, flush pipeline if incorrect
  - ▶ Branch prediction with 1 and 2 bit buffers

## ■ Exceptions

- Unforeseen internal or external events trigger a jump to an interrupt handler
- Address of affected instruction and cause of exception recorded in registers
- Exceptions in pipelines are passed on, similar to other data, and triggered when the instruction retires

Storage Technology and Trend

SATA, M2, NVMe, PCIe

Multi-Level Cells