

# Pipelined Architecture

---

Virendra Singh

Computer Architecture and Dependable Systems Lab

Department of Electrical Engineering

Indian Institute of Technology Bombay

<http://www.ee.iitb.ac.in/~viren/>

E-mail: [viren@ee.iitb.ac.in](mailto:viren@ee.iitb.ac.in)

*CS-683: Advanced Computer Architecture*

Lecture 7 (12 Aug 2021)



**CADSL**

# Advantages of Pipeline

---

- After the fifth cycle (CC5), one instruction is completed each cycle; CPI  $\approx 1$ , neglecting the initial pipeline latency of 5 cycles.
  - *Pipeline latency is defined as the number of stages in the pipeline, or*
  - *The number of clock cycles after which the first instruction is completed.*
- The clock cycle time is about four times shorter than that of single-cycle datapath and about the same as that of multicycle datapath.
- For multicycle datapath, CPI = 3. ....
- So, pipelined execution is faster, but . . .



# Program Execution



# Pipeline Hazards

---

- Definition: *Hazard in a pipeline is a situation in which the next instruction cannot complete execution one clock cycle after completion of the present instruction.*
- Three types of hazards:

- Structural hazard (resource conflict)

- Data hazard

- Control hazard

- detect  
- Correct ✓ }

⊗ fetch one instruction per cycle

UNIFIED PIPELINES

⊗ all instructions have to pass through all the pipeline stages



# Data Hazard

- Data hazard means that an instruction cannot be completed because the needed data, to be generated by another instruction in the pipeline, is not available.
- Example: consider two instructions:

❖ add \$s0, \$t0, \$t1 ✓ →

❖ sub \$t2, \$s0, \$t3 ✓ # needs \$s0

10%.

STALL - 3 cycle ] penalty

$$CPI = 1 + \cdot 1 \times 3 = \underline{\underline{1.3}}$$



|    | IF    | ID    | Ex    | Mem   | WB    |
|----|-------|-------|-------|-------|-------|
| 1. | $I_1$ | $I_1$ |       |       |       |
| 2  | $I_2$ | $I_1$ |       |       |       |
| 3. | $I_3$ | $I_2$ |       |       |       |
| 4. | $I_4$ | $I_3$ |       | $I_1$ |       |
| 5  | $I_5$ | $I_4$ | $I_3$ | $I_2$ | $I_1$ |

Diagram illustrating data bypassing:

update state

CORRECTION [Data bypassing/forwarding  $\Rightarrow$  save penalty]



# Example of Data Hazard



# Forwarding or Bypassing

---

- Output of a resource used by an instruction is forwarded to the input of some resource being used by another instruction.
- Forwarding can eliminate some, but not all, data hazards.



# Forwarding for Data Hazard



# Forwarding Unit Hardware



# Forwarding Alone May Not Work



$$\begin{cases} \tau_1 \leftarrow a \\ \tau_3 = \tau_1 + \tau_2 \end{cases}$$

STALL  
for  
one cycle-

|    | IF    | ID    | Ex    | Mem   | WB |
|----|-------|-------|-------|-------|----|
| 1. | $I_1$ |       |       |       |    |
| 2. | $I_2$ | $I_1$ |       |       |    |
| 3. | $I_3$ | $I_2$ | $I_1$ |       |    |
| 4. | $I_4$ | $I_3$ | $I_2$ | $I_1$ |    |



|    |       |       |       |       |       |
|----|-------|-------|-------|-------|-------|
| 5  | $I_4$ | $I_3$ | $I_2$ | -     | $I_1$ |
| 6. | $I_5$ | $I_4$ | $I_3$ | $I_2$ | -     |

7

$$\begin{aligned} CPI &: 1 + 3 \times 4 \times 1 \\ &= 1 + .12 = 1.12 \end{aligned}$$

Is it the bottleneck?



# Use Bubble and Forwarding



# Hazard Detection Unit Hardware



# Resolving Hazards

---

- Hazards are resolved by Hazard detection and forwarding units.
- Compiler's understanding of how these units work can improve performance.



# Avoiding Stall by Code Reorder

C code:

$$\begin{aligned} & \checkmark A = B + E; \\ & C = B + F; \end{aligned}$$

MIPS code:

|     |       |            |
|-----|-------|------------|
| Iw  | \$t1, | 0(\$t0)    |
| Iw  | \$t2, | 4(\$t0)    |
| add | \$t3, | \$t1, \$t2 |
| sw  | \$t3, | 12(\$t0)   |
| Iw  | \$t4, | 8(\$t0)    |
| add | \$t5, | \$t1, \$t4 |
| sw  | \$t5, | 16,\$(t0)  |

$\times \left\{ \begin{array}{l} \text{1 of } S_1 \\ \text{add. } S_2, S_1, S_3 \end{array} \right.$



# Reordered Code

C code:

$$\begin{aligned} A &= B + E; \\ C &= B + F; \end{aligned}$$

MIPS code:

|     |       |            |
|-----|-------|------------|
| lw  | \$t1, | 0(\$t0)    |
| lw  | \$t2, | 4(\$t0)    |
| lw  | \$t4, | 8(\$t0)    |
| add | \$t3, | \$t1, \$t2 |
| sw  | \$t3, | 12(\$t0)   |
| add | \$t5, | \$t1, \$t4 |
| sw  | \$t5, | 16,\$t0)   |

- DATA DEPENDENCY

\* Detect  
\* Correct

Stall      forwarding      Reordering

no hazard

no hazard

CORRECTNESS

Reduce the stalls due to dependency



# Control Hazard

- Instruction to be fetched is not known!
- Example: Instruction being executed is branch-type, which will determine the next instruction:

$I_1$  add \$4, \$5, \$6  
 $I_2$  beq \$1, \$2, 40  
next instruction  
...  
40 and \$7, \$8, \$9

$I_r$   $\Rightarrow$  - detect  
?  $\Rightarrow$  - correct.

Branch  $\Rightarrow$  Conditional.  
Conditional  $\Rightarrow$  Unconditional ✓

$\boxed{f_1 = f_2}$



$I_1: J. loc$  ✓  
 $I_2: add \dots$   
 $I_{20}: Sub$

unconditional  
 every branch  
 $\Rightarrow 1$  cycle  $\Leftarrow$  STALL

↓  
penalty

180%:



$$CPI = 1 + 2 \times 2 \times 1 \quad \checkmark$$

$$CPI = \underline{1.4}$$



$I_1: \text{beg } r_1, r_2, 4D$   
 $I_2: \text{add } r_4, r_5, r_6$   
 $I_3$   
 $I_4: \text{sub } r_7, r_8, r_9$

$\text{if } (i < j)$

$NT \rightarrow$  do not need to  
do anything →

NO CORRECTION



$$\overline{x_0} \rightarrow \overline{x_0}^T$$

~~start~~  
Throw out  
2 instruc

$$CPI = 1 + 2 \times 5 \times 2 \\ = 1.2$$

2 cycles penalty

no  
penalty



How to reduce



Branch. 70%. T & 30%. NT

Stack



Branch History Table <sup>Address</sup>



| Pc  | BTA  |
|-----|------|
| 100 | 5000 |
| 700 | 1000 |
|     |      |
|     |      |
|     |      |
|     |      |

Fetch

PC Add





$I_1$ :  $beg. r_1, r_2, 40$

$I_2$ :  $add\ r_3, r_2, r_4$

$D_{20}$ :

IF ID EX MEAN WS

1.  $I_1$

2.  $D_{20} \rightarrow$



15%

$D_{20r.}$

$$\begin{aligned}
 CPI &= 1 + 15 \times 3 \times 2 \\
 &= 1 + 90 = \underline{\underline{109}} \\
 &\quad \underline{\underline{1.09}}
 \end{aligned}$$



SAFE



# Stall on Branch



# Why Only One Stall?

---

- Extra hardware in ID phase:
  - Additional ALU to compute branch address
  - Comparator to generate zero signal
  - Hazard detection unit writes the branch address in PC



# Ways to Handle Branch

---

- Stall or bubble
- Delayed branch
- Branch prediction:
  - Heuristics
    - Next instruction
    - Prediction based on statistics (dynamic)
    - Hardware decision (dynamic)
  - Prediction error: pipeline flush



# Branch Hazard

---

- Consider heuristic – branch not taken.
- Continue fetching instructions in sequence following the branch instructions.
- If branch is taken (indicated by *zero* output of ALU):
  - Control generates *branch* signal in ID cycle.
  - *branch* activates *PCSource* signal in the MEM cycle to load PC with new branch address.
  - *Three instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced?*



# Branch Not Taken

---



# Branch Taken



# Branch Prediction

---

- Useful for program loops.
- A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction
  - History bit = 1, branch was taken
  - History bit = 0, branch was not taken



# Branch Prediction



# Thank You

