

# Pipelined Architecture

Virendra Singh

Computer Architecture and Dependable Systems Lab

Department of Electrical Engineering

Indian Institute of Technology Bombay

<http://www.ee.iitb.ac.in/~viren/>

E-mail: [viren@ee.iitb.ac.in](mailto:viren@ee.iitb.ac.in)

JSA  
70 - 80's

RISC  
CISC

*CS-683: Advanced Computer Architecture*



Lecture 6 (10 Aug 2021)

**CADSL**

# RISC Architecture

---

- Simple instructions
- Fixed Instruction Encoding
- Limited Addressing Mode
- Instruction count increases ✓
- Simple controller ✓
- Load/Store architecture ✓
- Limited addressing modes



# Overview of MIPS

- ❖ Simple instructions, all 32 bits wide
- ❖ Very structured, no unnecessary baggage
- ❖ Only three instruction formats

~~add  
a < 100~~

- Reg Add.
- Immediate



$$2^6 = \cancel{32} \ 64$$



# Example processor MIPS subset

---

## MIPS Instruction – Subset

- ❖ Arithmetic and Logical Instructions
  - add, sub, or, and, slt ✓
- ❖ Memory reference Instructions
  - lw, sw ✓
- ❖ Branch
  - beq, j ✓



# Processor Architecture



Simple

ADD / SUB / MUL

- ① fetch an instruction  
(PC / IP)
- ② Understand (Decode)
- ③ Read Operands
- ④ Execute instruction
- ⑤ Update the result



Can complete  
in one step

add.  $r_1, r_2 \rightarrow r_3$

$$T = t_{inc} = t_f + t_d + t_{or} + t_e + t_u$$

$$r_1 = \frac{r_2 + r_3}{r}$$

Single cycle implementation CPI=1



Time ,  $T = \underline{\underline{C}} \times \overbrace{CPI \times \underline{\underline{C}}}$



## Combined Datapaths



# Time for Jump (J-Type)

---

- ALU (R-type) 6ns
- Load word (I-type) 8ns
- Store word (I-type) 7ns
- Branch on equal (I-type) 5ns
- Jump (J-type)
  - Fetch (memory read) 2ns
  - Total 2ns



- 1. fetch
- 2. understand (decode)
- 3. operand read
- 4. Execute ↗
- 5. update the result

One step → only one sub task.

Multiple steps ↗

$$\text{Time} = \max\{t_f + t_d, t_{or}, t_e, t_u\}$$

$$\underline{\underline{CPI}} > 1$$

Mult-cycle Implementation



AL

1. fetch 2. decode 3. Op. read. 4. Ex- 5. Update- } 5

Load

1. fetch, 2. Decode 3. Op. read. 4. Add computation, } 6  
5. Read memmey 6. Update }



# Multicycle Instruction Execution

| Step                                                      | R-type<br>(4 cycles)                                                                                                 | Mem. Ref.<br>(4 or 5 cycles)                              | Branch type<br>(3 cycles)                    | J-type<br>(3 cycles)                                 |
|-----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|----------------------------------------------|------------------------------------------------------|
| Instruction fetch                                         | IR $\leftarrow$ Memory[PC]; PC $\leftarrow$ PC+4                                                                     |                                                           |                                              |                                                      |
| Instr. decode/<br>Reg. fetch                              | A $\leftarrow$ Reg(IR[21-25]); B $\leftarrow$ Reg(IR[16-20])<br>ALUOut $\leftarrow$ PC + (sign extend IR[0-15]) << 2 |                                                           |                                              |                                                      |
| Execution, addr.<br>Comp., branch &<br>jump<br>completion | ALUOut $\leftarrow$<br>A op B                                                                                        | ALUOut $\leftarrow$<br>A+sign extend<br>(IR[0-15])        | If (A = B)<br>then<br>PC $\leftarrow$ ALUOut | PC $\leftarrow$ PC[28-<br>31]<br>  <br>(IR[0-25]<<2) |
| Mem. Access or<br>R-type<br>completion                    | Reg(IR[11-<br>15]) $\leftarrow$<br>ALUOut                                                                            | MDR $\leftarrow$ M[ALUout]<br>or M[ALUOut] $\leftarrow$ B |                                              |                                                      |
| Memory read<br>completion                                 |                                                                                                                      | Reg(IR[16-20]) $\leftarrow$<br>MDR                        |                                              |                                                      |



# Multicycle Datapath



# CPI of a Computer

$$\text{CPI} = \frac{\sum_k (\text{Instructions of type } k) \times \text{CPI}_k}{\sum_k (\text{instructions of type } k)}$$

where

$\text{CPI}_k$  = Cycles for instruction of type k

**Note:** *CPI is dependent on the instruction mix of the program being run. Standard benchmark programs are used for specifying the performance of CPUs.*



50% ALU  
CPI = 5  
- 5x5 +  
20% Load.  
• 2x6.

## What Next

Simple cycle.

$$\text{CPI} = 1$$

$T \uparrow$

Multicycle

$$\text{CPI} \uparrow$$

$C \downarrow$

$$\text{CPI} = 1$$



# Traffic Flow



# Single Lane Traffic

---



© BNPS.CO.UK



10 Aug 2021

CS683@IITB

17

CADSL

# ILP: Instruction Level Parallelism

---

- Single-cycle and multi-cycle datapaths execute one instruction at a time.
- How can we get better performance?
- Answer: Chaining of instructions
- Takes advantage of
  - CPI of single cycle ✓
  - Cycle time of multicycle implementation



# Automobile Team Assembly



1 car assembled every four hours  
6 cars per day  
180 cars per month  
2,040 cars per year

# Automobile Assembly Line



First car assembled in 4 hours (pipeline latency)

thereafter 1 car per hour

21 cars on first day, thereafter 24 cars per day

717 cars per month ✓

8,637 cars per year ✓



# Throughput: Team Assembly



Time of assembling one car =  $n$  hours

where  $n$  is the number of nearly equal subtasks,  
each requiring 1 unit of time

Throughput =  $\frac{1}{n}$  cars per unit time



# Throughput: Assembly Line



Time to complete first car =  $n$  time units (latency)

Cars completed in time  $T$  =  $T - n + 1$

Throughput =  $1 - (n - 1)/ T$  car per unit time

$$\frac{\text{Throughput (assembly line)}}{\text{Throughput (team assembly)}} = \frac{1 - (n - 1)/ T}{1/n} = n - \frac{n(n - 1)}{T} \rightarrow n \quad \text{as } T \rightarrow \infty$$



# Some Features of Assembly Line

---



# Pipelining in a Computer

- Divide datapath into nearly equal tasks, to be performed serially and requiring non-overlapping resources.
- Insert registers at task boundaries in the datapath; registers pass the output data from one task as input data to the next task.
- Synchronize tasks with a clock having a cycle time that just exceeds the time required by the longest task.
- Break each instruction down into a fixed number of tasks so that instructions can be executed in a staggered fashion.



AL

fetch  
Decode &  
Operando  
read  
Execute  
Update

(load)

fetch  
Decode &  
OR  
Execute  
(add.  
computatio)  
Mem.  
Access  
(read)  
Update

Store

fetch  
Decode  
Execute  
Memory  
Write



# Pipelined Datapath

| Instruction class                | Instr. fetch (IF) | Instr. Decode (also reg. file read) (ID) | Execution (ALU Operation) (EX) | Data access (MEM) | Write Back (Reg. file write) (WB) | Total time |
|----------------------------------|-------------------|------------------------------------------|--------------------------------|-------------------|-----------------------------------|------------|
| lw                               | 2ns               | 1ns<br>2ns                               | 2ns                            | 2ns               | 1ns<br>2ns                        | 10ns       |
| sw                               | 2ns               | 1ns<br>2ns                               | 2ns                            | 2ns               | 1ns<br>2ns                        | 10ns       |
| R-format: add, sub, and, or, slt | 2ns               | 1ns<br>2ns                               | 2ns                            | 2ns               | 1ns<br>2ns                        | 10ns       |
| B-format: beq                    | 2ns               | 1ns<br>2ns                               | 2ns                            | 2ns               | 1ns<br>2ns                        | 10ns       |

No operation on data; idle time inserted to equalize instruction lengths.



# Pipelining of RISC Instructions

---



*Although an instruction takes five clock cycles,  
one instruction is completed every cycle.*



# Pipelined Datapath



# Program Execution



|    | IF    | ID    | Ex    | Mem.  | WB/Wdcl - |
|----|-------|-------|-------|-------|-----------|
| 1. | $I_1$ |       |       |       | G         |
| 2  | $D_2$ | $I_1$ |       |       |           |
| 3  | $I_3$ | $D_2$ | $I_4$ |       |           |
| 4. | $I_4$ | $D_3$ | $D_2$ | $I_1$ |           |
| 5  | $D_5$ | $D_4$ | $D_3$ | $I_2$ | $I_1$     |
| 6. | $I_6$ | $D_5$ | $I_4$ | $I_3$ | $D_2$     |
| 7  | $I_7$ | $I_6$ | $I_5$ | $I_4$ | $D_3$ .   |

latency



## Assumption ✓

- 1. All the instructions are independent
- 2. Follow straight code.

↓  
No branch. [No change in  
control flow)

## REACTIVE

- Detect ✓ }
- Correct ✓ }



$$\frac{CPI = 1}{C = \downarrow}$$

T<sub>1</sub>  
T<sub>2</sub>  
T<sub>3</sub>  
f<sub>1</sub>  
f<sub>2</sub>  
f<sub>3</sub>  
f<sub>4</sub>  
f<sub>5</sub>  
f<sub>6</sub>  
f<sub>7</sub>



# Advantages of Pipeline

---

- After the fifth cycle (CC5), one instruction is completed each cycle; CPI  $\approx 1$ , neglecting the initial pipeline latency of 5 cycles.
  - *Pipeline latency is defined as the number of stages in the pipeline, or*
  - *The number of clock cycles after which the first instruction is completed.*
- The clock cycle time is about four times shorter than that of single-cycle datapath and about the same as that of multicycle datapath.
- For multicycle datapath, CPI = 3. ....
- So, pipelined execution is faster, but . . .



# Pipeline Hazards

---

- Definition: *Hazard in a pipeline is a situation in which the next instruction cannot complete execution one clock cycle after completion of the present instruction.*
- Three types of hazards:
  - Structural hazard (resource conflict)
  - Data hazard ✓
  - Control hazard ✓



# Data Hazard

---

- Data hazard means that an instruction cannot be completed because the needed data, to be generated by another instruction in the pipeline, is not available.
- Example: consider two instructions:

❖ add     $\$s0, \$t0, \$t1$  }  
❖ sub     $\$t2, \$s0, \$t3$  }      # needs  $\$s0$

1, add  $r_1, r_2, r_3$  )  
2, sub  $r_4, r_1, r_5$  )

↓  
Decide .





# Example of Data Hazard



# Forwarding or Bypassing

---

- Output of a resource used by an instruction is forwarded to the input of some resource being used by another instruction.
- Forwarding can eliminate some, but not all, data hazards.



# Forwarding for Data Hazard



# Forwarding Unit Hardware



# Thank You

