

# Lecture 8: CPU – data hazard

Hunjun Lee  
[hunjunlee@hanyang.ac.kr](mailto:hunjunlee@hanyang.ac.kr)

# Illustrating 5-stage pipeline operation



*Simultaneously execute  
multiple stages  
→ We cannot share  
resources*

# Pipelined Control



# In Reality

- ◆ No identical operations e.g., R-type, I-type & J-type
    - ⇒ Unify instruction types
      - Combine instruction types to flow through “multi-function” pipe
  - ◆ No uniform sub-operations e.g., RF read VS. Memory write?
    - ⇒ Balance pipeline stages
      - Stage-latency calculation to make balanced stages
  - ◆ No independent operations e.g., Waiting data to be produced?
    - ⇒ Remove dependency and/or busy resources
      - Inter-instruction dependency detection and resolution

# There are three types of pipeline hazards

## ◆ Data Hazard

- Data dependency (Read-after-write: RAW)
- Anti dependency (Write-after-read: WAR)
- Output dependency (Write-after-write: WAW)

## ◆ Control Hazard

- Data dependency of program counter

## ◆ Structural Hazard

- Due to the lack of resources
  - Ex1) We need to split the instruction and data memory (to simultaneously execute IF and MEM stage)
  - Ex2) We may execute multiple instructions in the EX stage (#ALU < # instructions ready)

# Data dependence

Today's topic!

## Data dependence



*Read-after-Write  
(RAW)*

## Anti-dependence



*Write-after-Read  
(WAR)*

## Output-dependence



*Write-after-Write  
(WAW)*

Later...

# Dependencies and Pipelined Execution

## Sequential and Atomic instruction semantics



Two instructions may require ordering of (only) **sub-operations**



Defines what is correct  
→ Architecture

Defines what is implemented  
→ Architecture

# RAW Dependency and Hazard



We cannot execute ID stage of the following instructions until the first instruction undergoes WB stage

# Register data hazard analysis

|     | R/I-Type | LW      | SW  | Br  | J | Jr  |
|-----|----------|---------|-----|-----|---|-----|
| IF  |          |         |     |     |   |     |
| ID  | use      | use     | use | use |   | use |
| EX  |          |         |     |     |   |     |
| MEM |          |         |     |     |   |     |
| WB  | produce  | produce |     |     |   |     |

The problem is related to (1) the distance between two instructions  
and (2) the distance between “use” and “produce” stages

Register 遷移  
Reg

Register 遷合  
Reg

→ Two dependent R/I-type instructions should be separated by at least  $\text{dist}(\text{ID}, \text{WB}) + 1$ , which is “4”       $\text{dist} > 3$

Register Data Hazard

# Necessary condition for RAW hazards

If an instruction

writes to  $r_1$

$t_0 \quad t_1 \quad t_2 \quad t_3 \quad t_4 \quad t_5 \quad t_6 \quad t_7 \quad t_8 \quad t_9 \quad t_{10}$



If an instruction  
utilizes  $r_1$  as an  
operand

What about WAW and WAR hazards?

# WAR / WAW hazard

If an instruction

reads r1



If an instruction writes to r1

*Actually, there is no such thing as an WAR / WAW hazard if the CPU processes instructions in-order*

→ 명령어가 순서대로 실행되는 경우, WAR와 WAW는 발생하지 않음.

→ You will learn out-of-order CPU later ...

# A universal hazard resolution: “Pipeline Stall” (= let’s wait) resolve



**Stall:** make younger instructions wait until hazard is resolved  
Stop all up-stream stages & Drain all down-stream stages

# What is stall?

- ◆ We should not modify programmer visible states during stall (bubble)
  - Disable PC and IR latching ← 사용할 명령어를 기억하지 못하는 디자인의 가능한 오류는 Stall!
  - Control should set **RegWrite=0** and **MemWrite=0** ← 소스작업 중지



# When and how long to stall?

- ◆ Instructions  $I_A$  and  $I_B$  (where  $I_A$  comes before  $I_B$ ) have RAW hazard iff
  - $I_B$  (R/I, LW, SW, Br or JR) reads a register written by  $I_A$  (R/I or LW)
  - $\text{Dist}(I_A, I_B) \leq \text{dist}(\text{ID}, \text{WB}) = 3 \leftarrow \text{hazard 가능} \uparrow$
  - Stall cycles:  $\text{dist}(\text{ID}, \text{WB}) - \text{Dist}(I_A, I_B) + 1$   
 $= 4 - \underbrace{\text{Dist}(I_A, I_B)}_{I_A \text{가 } I_B \text{가 } \text{쓰임} \text{될 때}}$
- ◆ In other words, we must stall  $I_B$  in ID stage if it wants to read a register **to be written** by **ALL  $I_A$**  that might exist in EX, MEM or WB stage

$I_A$ 가 쓸 때  $I_B$ 가 읽을 때  
Stall 해야 한다.

# Stall condition

ID 단위 Read hazard을 헤쳐내기 위해 Data Hazard을 예상하는 데  
 → Hazard Detection Unit은 IF/ID latch 단위로 있는  
 \$1s, \$t2s로 Hazard Detection을 한다.

Latch의 write address와  
 address가 같은 경우



IF/ID에 걸쳐는  
 dest INS 정보

Latch의 write address와  
 instruction의  
 write address가 동일한 경우  
 Register가 같은지를 확인

If the two match → Stall  
 → EX, MEM, WB에서 사용하는  
 dest-reg

연속적인 두 단계가 /Data dependency/  
 일정한 n cycle stall이 발생:  
 ↳ 동일하게 처리되는 단계

# Stall condition

## ◆ Helper functions

- $\text{rs}(\text{I})$  returns the  $\text{rs}$  field of  $\text{I}$  (instruction register)
- $\text{use\_rs}(\text{I})$  returns true if  $\text{I}$  requires  $\text{RF}[\text{rs}]$

## ◆ Stall when

- $(\text{rs}(\text{IR}_{\text{ID}}) == \text{dest}_{\text{EX}}) \&\& \text{use\_rs}(\text{IR}_{\text{ID}}) \&\& \text{RegWrite}_{\text{EX}}$  // 3 Cycles
  - $(\text{rs}(\text{IR}_{\text{ID}}) == \text{dest}_{\text{MEM}}) \&\& \text{use\_rs}(\text{IR}_{\text{ID}}) \&\& \text{RegWrite}_{\text{MEM}}$  // 2 Cycles
  - $(\text{rs}(\text{IR}_{\text{ID}}) == \text{dest}_{\text{WB}}) \&\& \text{use\_rs}(\text{IR}_{\text{ID}}) \&\& \text{RegWrite}_{\text{WB}}$  // 1 Cycle
- 
- $(\text{rt}(\text{IR}_{\text{ID}}) == \text{dest}_{\text{EX}}) \&\& \text{use\_rt}(\text{IR}_{\text{ID}}) \&\& \text{RegWrite}_{\text{EX}}$  // 3 Cycles
  - $(\text{rt}(\text{IR}_{\text{ID}}) == \text{dest}_{\text{MEM}}) \&\& \text{use\_rt}(\text{IR}_{\text{ID}}) \&\& \text{RegWrite}_{\text{MEM}}$  // 2 Cycles
  - $(\text{rt}(\text{IR}_{\text{ID}}) == \text{dest}_{\text{WB}}) \&\& \text{use\_rt}(\text{IR}_{\text{ID}}) \&\& \text{RegWrite}_{\text{WB}}$  // 1 Cycle

*It is crucial that the EX, MEM and WB stages continue to advance as normal during these stall cycles*

# Impact of Stall on Performance

- ◆ Without stall

Ideal IPC = 1 ← Because of pipelining

- ◆ With stall

- Each stall cycle corresponds to 1 lost cycle
- For a program with N instructions and S stall cycles,  
Average IPC (with stall) =  $N/(N+S)$       *S Cycle Stall*

- ◆ S depends on

- Frequency of RAW hazards
- Exact distance between the hazard-causing instructions
- Hazard Stall ↓

*Stall ↑ → IPC ↓*

*Stall ↑ ↗ 가능 줄여 하면*

# Data forwarding (= register bypassing)

→ 계산된 Register value 를 다른 Register에 바로 전달하는 기술

- ◆ It is intuitive to think of RF as ‘flow of right data’

- “add rx ry rz” literally means get values from **RF[ry]** and **RF[rz]** respectively and put result in **RF[rx]**
- So, “add rx ry rz” really means

- (1) Get “the computation results” of the previous instructions to define the values of **RF[ry]** and **RF[rz]**, respectively, and
- (2) Until another instruction redefines **RF[rx]**, younger instructions that refers to **RF[rx]** can use this instruction’s result



*Why not by directly use the computation results that are not “yet” written to the register file?*

# Resolving RAW Hazard by Forwarding

- ◆ Instructions  $I_A$  and  $I_B$  (where  $I_A$  comes before  $I_B$ ) have RAW hazard iff
  - $I_B$  (R/I, LW, SW, Br or JR) reads a register written by  $I_A$  (R/I or LW)
  - $\text{dist}(I_A, I_B) \leq \text{dist}(\text{ID}, \text{WB}) = 3$
- ◆ In other words, if  $I_B$  in ID stage reads a register written by  $I_A$  in EX, MEM or WB stage, then the operand required by  $I_B$  is not yet in RF
  - Retrieve operand “**from datapath**” instead of waiting until the RF is updated!
  - Retrieve operand “**from the youngest definition**” if multiple definitions are outstanding!

# Looking closer at forwarding



**We need to transfer the data from the latches  
to the execution stages**

# Internal RF forwarding

- ◆ We may design an RF that internally forwards data

```
// Verilog-style pseudocode
module RF (clk, raddr, rdata, waddr, wdata);
    // port declaration
    ...
    // storage
    reg [31:0] RF_data [0:31];

    always @(*) begin
        if(waddr == raddr1)      rdata = wdata;
        else                      rdata = RF_data[raddr];
    end

    always @(posedge clk) begin
        // write RF
        RF_data[waddr] <= wdata;
    end
endmodule
```

# Forwarding Paths



b. With forwarding

# Forwarding Logic

| Mux control   | Source | Explanation                                                                                               |
|---------------|--------|-----------------------------------------------------------------------------------------------------------|
| ForwardA = 00 | ID/EX  | The first ALU operand comes from the register file. <i>Register file의 2nd slot에서 4th : Raw forwarding</i> |
| ForwardA = 10 | EX/MEM | The first ALU operand is forwarded from the prior ALU result.                                             |
| ForwardA = 01 | MEM/WB | The first ALU operand is forwarded from data memory or an earlier ALU result.                             |
| ForwardB = 00 | ID/EX  | The second ALU operand comes from the register file.                                                      |
| ForwardB = 10 | EX/MEM | The second ALU operand is forwarded from the prior ALU result.                                            |
| ForwardB = 01 | MEM/WB | The second ALU operand is forwarded from data memory or an earlier ALU result.                            |

rs != 0 ?

→ rs == 0 : Zero leg

정수 0인 경우는 Register

→ 0이 Forwarding한 결과

if ( $rs_{EX} \neq 0$ ) && ( $rs_{EX} == dest_{MEM}$ ) &&  $RegWrite_{MEM}$  then

**forward operand from MEM stage** // dist=1

else if ( $rs_{EX} \neq 0$ ) && ( $rs_{EX} == dest_{WB}$ ) &&  $RegWrite_{WB}$  then

**forward operand from WB stage** // dist=2

else

**use the operand from register file** // dist  $\geq 3$

Ordering matters!! Consider the following ...

Raw

|                   |
|-------------------|
| add \$1, \$1, \$2 |
| add \$1, \$1, \$3 |
| add \$1, \$1, \$4 |

이 경우 세 번째 명령어는  
가장 영향력이 큽니다!

# Datapath with forwarding



# Data Hazard Analysis (with Forwarding)

|     | R/I-Type       | LW      | SW  | Br  | J | Jr  |
|-----|----------------|---------|-----|-----|---|-----|
| IF  |                |         |     |     |   |     |
| ID  |                |         |     |     |   | use |
| EX  | use<br>produce | use     | use | use |   |     |
| MEM |                | produce |     |     |   |     |
| WB  |                |         |     |     |   |     |

Load word는 user produce한 동시에 처리됨.  
→ 문제가 발생한다.

주소 계산  
use

produce → write

Forwarding은 still write 카운터는, 그는 produce를 때는 한 번에 바로 사용된다

- Even with data-forwarding, RAW dependence on an immediate preceding **LW instruction produces a hazard**

# Datapath with hazard detection



# Two ways to handle hazards

## ◆ Option 1) Let the software handle the hazards

- The instruction immediately following a load sees **the old value**
- The software (e.g., compiler) should handle the hazard (e.g., by inserting nop or by scheduling instructions)



하드웨어는 이를 전 처리를 하지  
않고, 소프트웨어가 NOP를 삽입하는  
등의 방법으로 처리하도록 한다.

## ◆ Option 2) Use stall on load operation

- A dependent immediate successor to LW **must stall 1 cycle in ID**
- **Stall = ( $rs(IR_{ID}) == dest_{EX}$ ) \&& use\_rs( $IR_{ID}$ ) \&& MemRead<sub>EX</sub>**



1-cycle Stall을 추가한다.  
LH? (Use가 producer)  
1-cycle stall

# Another solution: load delay slot (compiler-level solution)

로드 데이터槽에 load한 후에  
Add(연산)을 하여 놓는다.  
그리고 연산.



- ◆ Some ISA exposes the pipeline stages, so that the compiler may modify the code to remove data hazards
  - Insert *nop* or schedule independent instruction in between

# Load delay slot!

- ◆ Is this possible?



# Terminologies

- ◆ Dependencies
  - Ordering requirement between instructions
- ◆ Pipeline Hazards:
  - (Potential) violations of dependencies due to hazards
- ◆ Hazard Resolution:
  - **Static** → Schedule instructions at compile time to avoid hazards
  - **Dynamic** → Detect hazard and adjust pipeline operation
    - Stall, Flush or Forward
- ◆ Pipeline Interlock: Hazard ~~해결하기 위한~~ pipeline mechanism
  - Hardware mechanisms for dynamic hazard resolution
  - Detect and enforce dependences at run time

# Is 5-stage pipeline sufficient?

5 Stage Pipeline Stage 간에 걸리는 시간은? Hazard 가능성 ↑, 허브

→ Pipeline Hazard 발생 가능성 ↑

- ◆ There are still plenty of combinational delay between registers
- ◆ “**Superpipelining**” increases pipelining degree such that even intrinsic operations (e.g., ALU, RF read/write, memory access) require multiple stages
- ◆ Potential problem: “more data hazards”



*Pipeline stall even w/ forwarding*

# Intel P4's superpipelined Integer ALU

- ◆ Intel P4's ALU mitigates the dependency issues



# But above all, you cannot pipeline till infinity!

- ◆ There are diminishing returns on clock speed



- ◆ Complicates hardware (there should be tons of wires for data forwarding)
- ◆ You cannot always solve the dependency problem w/ forwarding!

**We need to use superscalar!!! (not now ...)**

# Question?

*Announcements:*

*Reading:*      *finish reading P&H Ch.4*

*Handouts:*      *none*