

5118007-02 Computer Architecture

# Ch. 4 The Processor

21 May 2024

Shin Hong

# Pipelining of Instruction Executions

- K instruction executions can be overlapped over K steps



# Pipeline Registers

- Transfer data and control via pipeline registers: all instructions advance during each clock cycle from one pipeline register to the next
- Load data at each clock cycle



# Ex. Load Instruction



# Ex. Load Instruction



# Ex. Load Instruction



# Ex. Load Instruction



# Ex. Store Instruction



# Ex. Store Instruction

SW  
Write-back



# Multi-clock-cycle Pipeline Diagram



# Correct Arguments?

1. Allowing jumps, branches, and ALU instructions to take fewer stages than the five required by the load instruction will increase pipeline performance under all circumstances.
2. Trying to allow some instructions to take fewer cycles does not help, since the throughput is determined by the clock cycle; the number of pipe stages per instruction affects latency, not throughput.
3. You cannot make ALU instructions take fewer cycles because of the write-back of the result, but branches and jumps can take fewer cycles, so there is some opportunity for improvement.
4. Instead of trying to make instructions take fewer cycles, we should explore making the pipeline longer, so that instructions take more cycles, but the cycles are shorter. This could improve performance.

# Pipelined Control

- Control signals can be classified by the affecting stages

| Instruction | Execution/address calculation stage control lines |        |        |        | Memory access stage control lines |          |           | Write-back stage control lines |           |
|-------------|---------------------------------------------------|--------|--------|--------|-----------------------------------|----------|-----------|--------------------------------|-----------|
|             | RegDst                                            | ALUOp1 | ALUOp0 | ALUSrc | Branch                            | Mem-Read | Mem-Write | Reg-Write                      | Memto-Reg |
| R-format    | 1                                                 | 1      | 0      | 0      | 0                                 | 0        | 0         | 1                              | 0         |
| lw          | 0                                                 | 0      | 0      | 1      | 0                                 | 1        | 0         | 1                              | 1         |
| sw          | X                                                 | 0      | 0      | 1      | 0                                 | 0        | 1         | 0                              | X         |
| beq         | X                                                 | 0      | 1      | 0      | 1                                 | 0        | 0         | 0                              | X         |



# Pipelined Datapath with Control Signal



# Example

```
lw      $10, 20($1)  
sub    $11, $2, $3  
and    $12, $4, $5  
or     $13, $6, $7  
add    $14, $8, $9
```

# Example



# Example



# Example



# Example















# Hazard

The situation when the next instruction cannot be executed in the following clock cycle.

1. Structural Hazard : Hardware cannot support combination of instruction.

ex) The instruction and data should be fetched from the memory at the same time.

2. Data Hazard : An instruction depends on the results of a previous instruction still in pipeline.

ex) add \$s0, \$t0, \$t1  
sub \$t2, \$s0, \$t3

# Dependencies

- Problem with starting next instruction before first is finished
  - dependencies that go backward in time are data hazards



# Hazard

3. Control Hazard : Control may be transferred to another instruction  
based on the results of an instruction

Ex)    beq \$t0, \$t1, L1

add .....

sub .....

L1 and .....

→ All the Hazards can be resolved by waiting

- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards

# Control (Branch) Hazard Example

- When we decide to branch, other instructions are in the pipeline!
- We are predicting branch not taken
  - need to add hardware for **flushing instructions** if we are wrong



# Data hazard (RAW hazard)

ADD \$1, \$2, \$3  
SUB \$4, \$1, \$5



Time



# Solutions to Data Hazard

- Freezing the pipeline
- (Internal) Forwarding
- Compiler scheduling

# Forwarding

| Time (in clock cycles)     |      |      |      |      |         |      |      |      |      |
|----------------------------|------|------|------|------|---------|------|------|------|------|
| CC 1                       | CC 2 | CC 3 | CC 4 | CC 5 | CC 6    | CC 7 | CC 8 | CC 9 |      |
| Value of register \$2 : 10 | 10   | 10   | 10   | 10   | 10/? 20 | ? 20 | ? 20 | ? 20 | ? 20 |
| Value of EX/MEM : X        | X    | X    | ? 20 | X    | X       | X    | X    | X    | X    |
| Value of MEM/WB : X        | X    | X    | X    | ? 20 | X       | X    | X    | X    | X    |

Program  
execution order  
(in instructions)

sub \$2, \$1, \$3



and \$12, \$2, \$5



or \$13, \$6, \$2



add \$14, \$2, \$2



sw \$15, 100(\$2)



what if this \$2 was \$13?

# Forwarding Paths



b. With forwarding

# EX hazard

```
sub    $2, $1,$3      # Register $2 written by sub  
and    $12,$2,$5      # 1st operand($2) depends on sub
```

$$\text{EX/MEM.RegisterRd} = \text{ID/EX.RegisterRs} = \$2$$

```
if (EX/MEM.RegWrite  
and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10  
  
if (EX/MEM.RegWrite  
and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
```

# MEM Hazard

```
if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0)  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01  
  
if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0)  
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01  
  
if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0)  
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
       and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01  
  
if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0)  
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
       and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
```

# Stalls

```
if (ID/EX.MemRead and  
((ID/EX.RegisterRt = IF/ID.RegisterRs) or  
(ID/EX.RegisterRt = IF/ID.RegisterRt)))  
    stall the pipeline
```



# Datapath with Hazard Detection



# Data Hazard : Compiler scheduling

ADD \$1, \$2, \$3

SUB \$4, \$1, \$5

Iw \$6, 100(\$7)

AND \$8, \$8, \$10

- Add ‘nop’ instruction

ADD \$1, \$2, \$3

NOP

NOP

SUB \$4, \$1, \$5

- Or rearrange the order of the instruction

ADD \$1, \$2, \$3

Iw \$6, 100(\$7)

AND \$8, \$8, \$10

SUB \$4, \$1, \$5

# Solutions to Control Hazard

- Stall
- Optimized branch processing
- Branch prediction
- Delayed branch

# Control Hazard : Stall

- Caused by PC-changing instructions  
(Branch, Jump, Call/Return)

| Branch instruction   | IF | ID | EX    | MEM   | WB |    |    |     |    |
|----------------------|----|----|-------|-------|----|----|----|-----|----|
| Branch successor     |    | IF | stall | stall | IF | ID | EX | MEM | WB |
| Branch successor + 1 |    |    |       |       | IF | ID | EX | MEM | WB |
| Branch successor + 2 |    |    |       |       | IF | ID | EX | MEM |    |
| Branch successor + 3 |    |    |       |       | IF | ID | EX |     |    |
| Branch successor + 4 |    |    |       |       | IF | ID |    |     |    |
| Branch successor + 5 |    |    |       |       |    |    |    |     | IF |

For 5-stage pipeline, 3 cycle penalty  
15% branch frequency. CPI =  $1 + 0.15 * 3 = 1.45$

# Control Hazard : Optimized Branch Processing

1. Find out branch taken or not early

→ simplified branch condition

2. Compute branch target address early

→ extra hardware

# Control Hazard : Optimized Branch Processing

- Reducing Delay by moving Branch execution to ID
  - There is only one instruction to flush if the branch is taken (the currently being fetched instruction)
  - Move up the Branch Address Calculation
    - duplicate adder
  - Moving up the Branch Test itself (use XORs and OR for equality test)
  - IF.Flush control line : clearing IF/ID registers transforms the fetch instruction into nop (instruction that does no operation to change state)

# Control Hazard : Optimized Branch Processing



# Control Hazard : Optimized Branch Processing



# Control Hazard : Branch Prediction

- One simple approach is to always predict branch will fail.
- If the prediction is wrong insert a bubble. : flush the instruction in the pipe and fetch the instruction at branch target address.



# Control Hazard : Branch Prediction

- Predict: guess one direction then back up if wrong
- Predict-not-taken

| Untaken branch instruction | IF | ID | EX | MEM | WB  |     |     |    |  |
|----------------------------|----|----|----|-----|-----|-----|-----|----|--|
| Instruction $i+1$          | IF | ID | EX | MEM | WB  |     |     |    |  |
| Instruction $i+2$          |    | IF | ID | EX  | MEM | WB  |     |    |  |
| Instruction $i+3$          |    |    | IF | ID  | EX  | MEM | WB  |    |  |
| Instruction $i+4$          |    |    |    | IF  | ID  | EX  | MEM | WB |  |

| Taken branch instruction | IF | ID   | EX   | MEM  | WB   |     |     |    |  |
|--------------------------|----|------|------|------|------|-----|-----|----|--|
| Instruction $i+1$        | IF | Idle | Idle | idle | Idle |     |     |    |  |
| Branch target            |    | IF   | ID   | EX   | MEM  | WB  |     |    |  |
| Branch target +1         |    |      | IF   | ID   | EX   | MEM | WB  |    |  |
| Branch target +2         |    |      |      | IF   | ID   | EX  | MEM | WB |  |

# Static Branch Prediction

- (1)Never branch – Assume the branch never take place.
- (2)Always branch – Assume the branch always take place.
- (3)Predict by op-code – Decision is made based on the opcode.

ex) beq ~ never branch

bne ~ always branch

# Dynamic Branch Prediction

- Use a [Branch Prediction Buffer](#) or [Branch History Table](#)
  - The memory contains a bit saying the branch was recently taken or not.
- Prediction is just a Hint assumed to be correct. Fetch begins in the predicted direction. If Hint turns out to be wrong, the prediction bits are changed.
- If an instruction is predicted to be taken, fetching begins from the target as soon as the PC is known. (It can be as early as ID stage)

# Dynamic Branch Prediction

## Simple 1-bit prediction scheme



# Dynamic Branch Prediction



A Prediction must be Wrong Twice before it is changed.

# Exceptions and Interrupts)

- events other than branches or jumps that change the normal flow of instruction executions
  - to handle unexpected events from within the processor (exceptions), or external I/O devices (interrupt)

| Type of event                                 | From where? | MIPS terminology       |
|-----------------------------------------------|-------------|------------------------|
| I/O device request                            | External    | Interrupt              |
| Invoke the operating system from user program | Internal    | Exception              |
| Arithmetic overflow                           | Internal    | Exception              |
| Using an undefined instruction                | Internal    | Exception              |
| Hardware malfunctions                         | Either      | Exception or interrupt |

# Exceptions

- Two cases
  - Undefined instruction
  - Arithmetic overflow
- Response
  - Save the address of the offending instruction in the Exception Program Counter (EPC)
  - Transfer control to the operating system at a specified address

# Exception Handling

- How to identify the cause
  - via status register (Cause register)
    - MIPS32
    - Five-bit fields encodes the two possible source of exceptions
  - via vectored interrupt
    - Different interrupt handlers are executed depending on the cause

# Exception in Pipeline Implementation

- An exception is treated as another form of control hazard
  - An overflow exception is detected in the EX stage
  - Prevent the instruction in EX from writing its result in WB, and flush the ID and IF stages.



# RISC vs. CISC

- Instruction set design determines the way that machine language programs are constructed. It also determines structure of CPU.
- Early computer
  - small and simple instruction set  
(to minimize the hardware & lack of technology)
    - ➔ Digital hardware became cheaper (VLSI technology)
    - ➔ CISC (Complex Instruction Set Computer)
- In the early 80's, new concept has emerged.
  - : Fewer instructions without having to use memory as often  
RISC (Reduced Instruction Set Computer)

# CISC

- The essential goal of a CISC architecture is to attempt to provide a single machine instruction for each statement that is written in a high-level language.
  - to simplify the compilation & improve the overall computer performance

# CISC

- Major characteristics of CISC

1. A large number of instructions (typically 100 to 250 instructions)
2. A large variety of addressing modes (5 to 20)
  - 1 & 2 cause more complex hardware
  - cause the computation to slow down
3. Variable-length instruction formats

ex) short register addressing mode inst.

& long direct addressing mode instruction.



→ the instruction may not be aligned

→ may need special decoding circuits

4. Instruction that manipulate operands in memory



# RISC

- Reduce execution time by simplifying the instruction set
- Major characteristics of RISC
  - 1. Relatively fewer instructions
  - 2. Relatively fewer addressing mode
  - 3. Memory access limited to load and store instruction
    - Large number of registers in CPU
    - Overlapped register window to speed up procedure call & return

# RISC

## # Overlapped Register Windows

- Subroutine call is a time-consuming job due to stack operation  
(We need to save register values & return address and restore them when subroutine terminates.)

→ use overlapped register windows

ex) Suppose the computer has 74 registers

R0 ~ R9 : common to all procedures

64 registers are divided into 4 windows

# RISC



# RISC

4. Most operations are done within registers

- Faster execution time
- Almost all instructions have simple register addressing mode.

5. Fixed-length, easily decoded instruction format

- Aligned on word boundaries
- Simple control logic

6. Hardwired rather than micro-programmed control : faster

7. One instruction per clock cycle

- Due to pipelining