

# **CS305**

# **Computer Architecture**

## **Introduction to Pipelining**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Some Fun Videos to Watch

- Charlie Chaplin's critique of industrialization
  - [https://www.youtube.com/watch?v=a-FbVF1x1\\_U](https://www.youtube.com/watch?v=a-FbVF1x1_U)
- Modern Times, Assembly Line with Charlie Chaplin
  - <https://www.youtube.com/watch?v=QdwH84AT5fU>
- I Love Lucy – The Candy Wrapping Job
  - <https://www.youtube.com/watch?v=Clr6Zyjj7iI>

# Lessons Learnt

- Pipelining is natural (at least for machines)
- Overall speed only as good as the slowest unit
- Non-uniformity is bad for pipeline
- Exceptions are really bad

# Example Task: Bicycle Manufacture

Make frame → Paint frame → Fit wheels → Fit accessories



- Time to make 1 bicycle?
- Time to make 10 bicycles?
  - With pipelining? Without pipelining?
  - Pipelining:
    - “Make frame” for 2<sup>nd</sup> bicycle, when “Paint frame” for 1<sup>st</sup> bicycle
    - “Make frame” for 3<sup>rd</sup> bicycle, “Paint frame” for 2<sup>nd</sup> bicycle, when “Fit wheels” for 1<sup>st</sup> cycle
    - And so on...

# Pipeline Timing Diagram



# Pipelining in Abstract



# Other Examples of Pipelining

- Water pipeline (origin of word)
- Network packets
- Course project evaluation
- Many others; pipelining is natural!



# Pipeline Speedup

- Time taken to manufacture  $N$  bicycles
  - Without pipelining:  $5N$
  - With pipelining:  $5 + (N-1) \times 2$
  - Speedup =  $5/2$  for large  $N$
- Suppose “Fit wheels” also takes 1 hour?
  - Speedup =  $4N/(4+N-1) = 4$  for large  $N$
- Suppose “Fit wheels” takes 2 hours, but is broken into “Fit front wheel”, “Fit rear wheel” each of 1 hour?
  - Speedup =  $5N/(5+N-1) = 5$  for large  $N$

# Pipeline Speedup (continued)

- In general,  $k$  stages, each of time  $t$
- Time for  $N$  units:
  - Without pipeline:  $N \times kt$
  - With pipelining:  $kt + (N-1)t$ 
    - Pipeline setup time = time to fill pipeline = time to first output
    - Is a significant component, for small  $N$
  - Speedup =  $k$  for large  $N$
  - This is the *ideal* speedup

# Ideal Pipeline Speedup

- Ideal pipeline speedup = number of pipeline stages
- Necessary conditions:
  - All stages are of equal length in time
  - Enough hardware/hands: e.g. same spanner/person/machine cannot be used for “fit wheels” and “fit accessories”
  - Many, many more conditions (stated later)

# Latency versus Throughput

- Latency: from perspective of single unit of work
- Throughput: rate at which work completes

# Pipelining in MIPS

Stage-1:

Instruction fetch, PC=PC+4 (all instructions)

IF

Stage-2:

Reg read, branch target computation (all instructions)

ID

Stage-3:

ALU operation (reg-reg)

Memory address computation (lw, sw)

Branch condition computation (beq)

EX

Stage-4:

Memory access (lw, sw)

Reg write back (reg-reg)

MEM

Stage-5:

Reg write back (lw)

WB



# Pipelining for Sequence of lw

lw \$t0, 0(\$sp)



lw \$t1, 4(\$sp)



lw \$t2, 8(\$sp)



Such pipelined execution of MIPS instructions is *potentially* possible

# Pipelining for Sequence with Other Instructions

add \$t0, \$t1, \$t2



ori \$t1, \$s0, 15



lw \$t2, 8(\$sp)



Effect of moving WB to stage-5 of reg-reg ?  
Latency of those instructions increases  
Instruction throughput not affected!

# Summary

- Pipelining a natural idea for speed-up
- Latency versus throughput
- MIPS: uniformity of instructions allows pipelining
- Issues with pipelining: hazards
  - Charlie Chaplin wants to scratch under his arm, take a break
  - Gets slowed down by buzzing bee
  - Pipeline stops → startup delays

# **CS305**

# **Computer Architecture**

## **Structural Hazards, Pipelined Datapath**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Issues with Pipelining: Hazards

- Hazard: condition or situation which does not allow the pipeline to operate “normally”
- Three kinds:
  - Structural hazards
  - Data hazards
  - Control hazards

# Structural Hazard: Hardware Limitation

- Ideally, no hardware limitation
  - Separate spanners for “fit wheels” and “fit accessories”
  - IM and DM must be separate
  - Separate adders for PC+4, branch target computation
- If hardware limitation: structural hazard
  - E.g. same memory for IM and DM
  - Can still use pipelining, as much as possible

# Structural Hazard → Stall Pipeline

lw \$t0, 0(\$sp)



lw \$t1, 4(\$sp)



lw \$t2, 8(\$sp)



lw \$t3, 0(\$fp)



Bubble in pipeline = 1 Stall = 1 cycle wasted

# Performance Evaluation in Pipelined Implementation

- CPI in ideal pipeline = 1
  - CPI is cycles per instruction completion
- CPI in pipeline with stalls =  $1 + F_{\text{stall}}$

## Example:

Multi-cycle implementation has 9ns cycle time

Pipelined implementation has 10ns cycle time (due to extra datapath overheads)

Instruction mix:

Reg-reg 65%, beq 15%, lw 15%, sw 5%       $\frac{\text{CPI} = 4}{\text{CPI} = 1}$  (ideal)       $\frac{4 \times 9 \text{ ns}}{1 \times 10 \text{ ns}} = 3.6$

What is the ideal speed-up?

What is the speed-up with structural hazard in memory?  $\frac{4 \times 9}{1.2 \times 10} = 3$

# Handling Structural Hazard in the Register File

- Too expensive to stall
- A large fraction of instructions write registers
- Solution:
  - Write RegFile on rising edge (first half)
  - Read RegFile on falling edge (second half)
  - Why this order? Has to do with handling of data hazards

# Pipelined Datapath, Control

- Modifications from multi-cycle
  - Simple modifications to datapath
  - Control becomes much more complex
- Same pattern as: single-cycle → multi-cycle

# Pipelined Datapath



# Summary

- Pipeline hazards:
  - Structural hazard
  - Data hazard
  - Control hazard
- Dealing with hazards: pipeline stall, affects ideal CPI
- Pipelined datapath: minor modifications
- Pipelined control: much more complex!

# CS305

# Computer Architecture

## Data Hazards in the Pipeline

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

*Structural*  
*→ Data*  
*control*

<http://www.cse.iitb.ac.in/~br>

# Data Hazards

- Data dependence across instructions → data hazard
  - Register dependence
  - Memory dependence
- Example (DLX instruction set):
  - **ADD R1, R2, R3**
  - **SUB R4, R1, R5**
  - **AND R6, R1, R7**
  - **OR R8, R1, R9**
  - **XOR R10, R1, R11**
- All instructions after ADD depend on **R1**

# Pipeline With Data Hazards



Data hazard: normal pipelined execution will produce **WRONG** result!  
Solution? Can stall. Can we do better?

# Register File: Reads after Writes



# Minimizing Stalls via Data Forwarding



# Data Forwarding to MEM Stage



# Data Forwarding to MEM Stage (continued)



# Memory Data Hazards

- Not possible in MIPS
  - Memory operations always happen in order

$\rightarrow SW \quad R1, \cancel{X}$   
 $\rightarrow LW \quad R2, \cancel{X}$

# Data Hazard Classification

- **Read after Write (RAW):** use data forwarding to overcome
- **Write after Write (WAW):** arises only when writes can happen in different pipeline stages

|                        | CC1 | CC2 | CC3 | CC4  | CC5  | CC6 |
|------------------------|-----|-----|-----|------|------|-----|
| LW <u>R1</u> , 0 (R2)  | IF  | ID  | EX  | MEM1 | MEM2 | WB  |
| ADD <u>R1</u> , R2, R3 |     | IF  | ID  | EX   | WB   |     |

- Has other problems as well: structural hazards
- **Write after Read (WAR):** rare

|                        | CC1 | CC2 | CC3 | CC4  | CC5  | CC6 |
|------------------------|-----|-----|-----|------|------|-----|
| SW R1, 0 ( <u>R2</u> ) | IF  | ID  | EX  | MEM1 | MEM2 | WB  |
| ADD <u>R2</u> , R4, R3 |     | IF  | ID  | EX   | WB   |     |

# Data Dependence Requiring a Stall



All data forwarding, stalling require control logic  
(reason for complexity in control)

# Compiler's Role in Avoiding Stalls

```
a = b + c;  
d = e + f;
```

## Naïve compiler

```
LW R1, b  
LW R2, c  
ADD R4, R1, R2  
SW a, R4  
LW R10, e  
LW R11, f  
ADD R12, R10, R11  
SW d, R12
```



## Clever compiler

```
LW R1, b  
LW R2, c  
LW R10, e  
ADD R4, R1, R2  
LW R11, f  
SW a, R4  
ADD R12, R10, R11  
SW d, R12
```



CPI without clever code scheduling =  $1 + F_{\text{loads-causing-stalls}}$

# Summary

- Data hazard: data dependence
  - MIPS: only RAW dependence in registers
  - Still, can have stall
- Need control logic for:
  - Data forwarding
  - Stalling

# **CS305**

# **Computer Architecture**

## **Control Hazards in the Pipeline**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Control Hazards

- Control hazard: pipeline cannot operate normally due to ((possibility of) non-sequential) control flow



Execution of SUB and LW should depend on branch condition: known only in CC4

# Stalling “Solution” for Control Hazard



Q: Suppose 10% instructions are branches, what is the performance implication?

A: CPI will increase from 1 to 1.2 due to control stalls: **EXPENSIVE**

# Techniques to Reduce Branch Penalty

- 2-stage branch completion
- Assume branch not taken
- Branch prediction
- Delayed branches
- Many advanced techniques (not in this course):
  - Correlating predictors
  - Branch target buffer
  - Special instructions for branch delay slot

# Techniques to Reduce Branch Penalty

- 2-stage branch completion
- Assume branch not taken
- Branch prediction
- Delayed branches
- Many advanced techniques (not in this course):
  - Correlating predictors
  - Branch target buffer
  - Special instructions for branch delay slot

# Reducing Stalls: 2-Stage Branch Completion

- Extra hardware required:
  - Comparator in stage-2
  - Needs to complete in half a cycle!
- Data hazard implications:
  - Extra forwarding required: forwarding to ID stage, for branch instruction
  - **beq** itself may stall due to dependence on earlier instruction!

# Implication-1 of 2-Stage beq completion



Q: Suppose 10% instructions are branches, what is the performance implication?

A: CPI will increase from 1 to 1.1 due to control stalls: still not so good

# Implications-2,3 of 2-Stage beq completion



# Techniques to Reduce Branch Penalty

- 2-stage branch completion
- Assume branch not taken
- Branch prediction
- Delayed branches
- Many advanced techniques (not in this course):
  - Correlating predictors
  - Branch target buffer
  - Special instructions for branch delay slot

# Reducing Stalls: Assume Branch Not Taken



In CC4, cancel out the instruction in ID stage, if branch taken

Does not help much for loops: branch taken in most cases

Price paid in this approach: increased control complexity

# Techniques to Reduce Branch Penalty

- 2-stage branch completion
- Assume branch not taken
- Branch prediction
- Delayed branches
- Many advanced techniques (not in this course):
  - Correlating predictors
  - Branch target buffer
  - Special instructions for branch delay slot

# Reducing Stalls: Branch Prediction

- Idea: remember whether branch was taken last time

| Last few bits of PC | Prediction<br>(taken=1, not taken=0) |
|---------------------|--------------------------------------|
| ...                 | 1                                    |
| ...                 | 0                                    |

- Only last few bits of PC needed: need to deal with branch mis-prediction anyway
- Single-bit predictor: loops are mis-predicted twice
- 2-bit predictor: predict based on last two bits

# Techniques to Reduce Branch Penalty

- 2-stage branch completion
- Assume branch not taken
- Branch prediction
- **Delayed branches**
- Many advanced techniques (not in this course):
  - Correlating predictors
  - Branch target buffer
  - Special instructions for branch delay slot

# Reducing Branch Penalty: Delayed Branch, Branch Delay Slot

- Change semantic of branch in the ISA
  - Instruction after branch WILL BE executed, even if branch is taken
  - Such an instruction is said to be in the branch delay slot



Q: Who will fill the branch delay slot?

A: Compiler has to do it: yet another important role for the compiler

# Filling the Branch Delay Slot

```
AND R1, R2, R3  
SLL R5, R4, 2  
OR  R2, R3, R4  
BEQ R2, R10, LBL
```

```
AND R1, R2, R3  
OR  R2, R3, R4  
BEQ R2, R10, LBL
```

```
ADDI R5, R2, 4  
SUB R6, R2, R7  
...  
LBL:  
ADDI R2, R5, 4  
ADD R6, R5, R2
```

```
AND R1, R2, R3  
OR  R2, R3, R4  
BEQ R2, R10, LBL
```

```
ADDI R5, R2, 4  
SUB R6, R2, R5  
...  
LBL:  
ADDI R2, R5, 4  
ADD R6, R5, R7
```

From before branch

From branch fall through

From branch target

- Desirable to fill delay slot from before branch: why? A: always executed
- It may not always be possible to fill delay slot: fill nop

# Techniques to Reduce Branch Penalty

- 2-stage branch completion
- Assume branch not taken
- Branch prediction
- Delayed branches
- Many advanced techniques (not in this course):
  - Correlating predictors
  - Branch target buffer
  - Special instructions for branch delay slot

Q: Which of these techniques are complementary? That is, can be used together with one another?

# Summary

- Control hazard: non-sequential control flow in program causes disruption in normal pipeline
- Stalling is expensive
- A variety of solutions to reduce branch penalty
  - 2-stage branch
  - Branch prediction
  - Delay slot: filled with compiler's help
  - Many, many advanced techniques!

# **CS305**

# **Computer Architecture**

## **Pipeline Control**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Controlling the Pipeline

- Without hazards: simple ☺
- Principle: carry your work with you!

# Carry Your Work With You

**Single cycle control lines:**

ALUSrc, ALUIP4: for EX stage    branch

MemRd, MemWr: for MEM stage

RegWr, RegDst, Mem2Reg: for WB stage



# Pipeline Control to Handle Hazards

- Single-cycle control: combinational circuit
- Multi-cycle control: state machine
- Pipeline control: micro-code (Turing machine)
- Recall: micro-code is code executed by a small/simple processor within the main processor (to generate control lines for the pipelined main processor)

# Pipeline Control for Data Forwarding to EX



# Forwarding to EX: Data-path Changes



# Micro-coded Control for Forwarding to EX

Q: What control lines to generate?

A: FwdA, FwdB

Q: When should these be generated?

A: In ID stage of dependent instrn.

Q: What variables to use?

A: Latches, parts of latches

Q: What is the logic to implement?

A: Cases of no fwd, fwd from K-1, K-2

|     |                    |
|-----|--------------------|
| ADD | <u>R1</u> , R2, R3 |
| SUB | <u>R1</u> , R4, R5 |
| SLT | R6, <u>R1</u> , R0 |

```
// FwdA: micro-coded control  
// Case-1: no forwarding  
FwdA = 00;  
  
// Case-2: fwd from K-1 ←  
if( (IF/ID.Rs==ID/EX.Rd) &&  
    (ID/EX.RegWr==1) &&  
    (ID/EX.MemRd==0) )  
    FwdA = 10; //frm EX/MEM  
  
// Case-3: fwd from K-2 ←  
if( (IF/ID.Rs==EX/MEM.Rd) &&  
    (EX/MEM.RegWr==1) &&  
    (EX/MEM.MemRd==0) )  
    FwdA = 01; //frm MEM/WB
```

Swap  
order

Q: Change for FwdB?

A: Rs → Rt

# Why is Pipeline Control Logic Difficult?

- A lot of subtle possibilities → need attention to detail
- Things to keep in mind:
  - Logic runs in ID stage of dependent instruction
  - Actual forwarding happens later
  - Beware of multiple dependences
  - Beware of assumptions
- Exercise: add support for forwarding to EX from **lw**

# Forwarding to MEM: Data-path Changes

|     |                    | CC1 | CC2 | CC3 | CC4 | CC5 |
|-----|--------------------|-----|-----|-----|-----|-----|
| OR  | <u>R1</u> , R4, R5 | MEM | WB  |     |     |     |
| SLL | <u>R1</u> , R2, 3  | EX  | MEM | WB  |     |     |
| ADD | <u>R1</u> , R2, R3 | ID  | EX  | MEM | WB  |     |
| SW  | R1, 4 (R20)        | IF  | ID  | EX  | MEM | WB  |



# Forwarding to MEM: Micro-coded Control

```
// FwdMem: micro-coded control
// Case-1: no forwarding
FwdMem = 00;
// Case-2: fwd from K-2
if((IF/ID.Rt==EX/MEM.Rd) && (CtlUnit.MemWr==1)
   (EX/MEM.RegWr==1) && (IF/ID.opCode == SW)
   (EX/MEM.MemRd==0))
    FwdMem = 01; //frm Post-WB
// Case-3: fwd from K-1
if((IF/ID.Rt==ID/EX.Rd) && (CtlUnit.MemWr==1)
   (ID/EX.RegWr==1) &&
   (ID/EX.MemRd==0))
    FwdMem = 10; //frm MEM/WB
```

(IF/ID.opCode == SW)

# So Far...

- Pipeline control without hazards
- Pipeline control for data forwarding
  - Pseduo-micro-code
- Next: pipeline control for stalling

# Stalling Logic: Dependent reg-reg After lw

```
// Stalling: micro-coded control
// Case-1: dependent Rs
if((IF/ID.Rs==ID/EX.Rt) &&
(ID/EX.MemRd==1))
    STALL // What does this mean?
// Case-2: dependent Rt
if((IF/ID.Rt==ID/EX.Rt) &&
(ID/EX.MemRd==1))
    STALL // What does this mean?
```

# What does STALL mean?

- Do nothing → write nothing
  - **nop** proceeds in the pipeline
  - Zero out control signals, specifically RegWr, MemWr, MemRd
    - . No change to machine state
- IF and ID stages must repeat
  - Disable PCWr
  - Disable writing of IF/ID latch

```
// STALL pseudo-micro-code
PCWr = 0;
IF/ID.Wr = 0;
ID/EX latch = 0; // nop (bubble) in pipeline
```

# Stalling Logic for Control Hazard

```
if (CtlUnit.branch == 1)
    IF/ID latch = 0; // 1st nop follows branch
if (ID/EX.branch == 1)
    IF/ID latch = 0; // 2nd nop follows branch
```

**Q:** Changes for 2-stage branch?

**A:** Second if condition unnecessary

**Q:** Diff from data hazard stall?

**A:** nop **FOLLOWS** branch

# Summary

- Pipeline control: subtle logic, involving many details
  - Data forwarding logic
    - Control lines generated in ID stage, actual forwarding may happen later
  - Stalling logic: **nop** before stalling instruction, **nop** after branch
- We've seen only bits and pieces
- Very difficult to write control logic without micro-code
- Next: exceptions, the bane of all pipelines

# CS305

# Computer Architecture

## Exceptions in the Pipeline

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# How to Handle Exceptions in the Pipeline?

|                | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 |
|----------------|-----|-----|-----|-----|-----|-----|-----|-----|
| SLL R2, R2, 3  | IF  | ID  | EX  | MEM | WB  |     |     |     |
| LW R4, 4 (R5)  |     | IF  | ID  | EX  | MEM | WB  |     |     |
| ADD R1, R2, R3 |     |     | IF  | ID  | EX  | MEM | WB  |     |
| SW R4, 4 (R20) |     |     |     | IF  | ID  | EX  | MEM | WB  |

Suppose LW has a misaligned memory address exception

Q: When is the exception detected?

A: End of CC4

Q: What should happen?

A: Flush pipeline, but SLL must be allowed to complete!

# Steps in Exception Handling

|                | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 |
|----------------|-----|-----|-----|-----|-----|-----|-----|-----|
| SLL R2, R2, 3  | IF  | ID  | EX  | MEM | WB  |     |     |     |
| LW R4, 4(R5)   |     | IF  | ID  | EX  | MEM | WB  |     |     |
| ADD R1, R2, R3 |     |     | IF  | ID  | EX  | MEM | WB  |     |
| SW R4, 4(R20)  |     |     |     | IF  | ID  | EX  | MEM | WB  |

- Flush → zero out latches
  - IF/ID, ID/EX, EX/MEM; **do not** flush MEM/WB (SLL)
- Load “appropriate” PC value onto EPC
  - PC-8 in this case
  - What if BEQ instead of ADD ?
- Load appropriate value onto Cause register
- PC = 0x8000 0180

# Complexity-1: Multiple Exceptions

|                | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 |
|----------------|-----|-----|-----|-----|-----|-----|-----|-----|
| SLL R2, R2, 3  | IF  | ID  | EX  | MEM | WB  |     |     |     |
| LW R4, 4(R5)   |     | IF  | ID  | EX  | MEM | WB  |     |     |
| Invalid opcode |     |     | IF  | ID  | EX  | MEM | WB  |     |
|                |     |     |     | IF  | ID  | EX  | MEM | WB  |

Q: Example of multiple exceptions?

A: Invalid opcode following LW

Q: What should happen?

A: Instruction causing earlier exception takes precedence

Complexity arises due to sheer number of such possibilities

# Complexity-2: Out-of-Order Completion

|                | CC1 | CC2 | CC3  | CC4  | CC5  | CC6  | CC7  | CC8 |
|----------------|-----|-----|------|------|------|------|------|-----|
| MUL R1, R2, R4 | IF  | ID  | MUL1 | MUL2 | MUL3 | MUL4 | MUL5 | MEM |
| ADD R4, R5, R6 |     | IF  | ID   | EX   | MEM  | WB   |      |     |
|                |     |     | IF   | ID   | EX   | MEM  | WB   |     |
|                |     |     |      | IF   | ID   | EX   | MEM  | WB  |

- MUL exception in CC7
- ADD is done by then!
- Need to “rollback” (Charlie Chaplin in pipeline)
- Solution is well beyond scope of this course

# Complexity-3: Partially Changed Machine State

- Classic example: string copy instruction in Intel
- Say, half way through string copy, some other instruction causes exception
- What a mess!

# Conclusion

- Exceptions and non-uniformity: enemies of pipelines
- Advanced topics:
  - Out-of-order completion
  - Super-scalar processors: multiple instruction issue per cycle
  - Data driven architectures
- Next: memory systems

# **CS305**

# **Computer Architecture**

**One System to Know Them All**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# The World is Changed



# Much That Once Was...



# The Memory System



# One System to Know Them All

The world is changed. I feel it in the phones. I feel it in the phablets. I sense it in my apps.

Much that once was is now too slow, and none now live who have use for it.

It all began with the design of the memory system.

The first idea is that of **caching**, for the illusion of memory very vast, yet very fast.

The second is that of **hierarchy**, a great technique to magnify the benefits manifold.

The third is of adding a **level of indirection**, which above all else, provides flexibility.

The three cornerstone ideas of all of Computer Science Systems, they were all of them used, in the design of the memory system. Deep inside every computer, the memory system rules, and dictates the performance of every app you run, and every snap you take. And in this system is steeped, all the three great ideas of Computer Science Systems, their power, their prowess, and their versatility.

**One system to know them all.**

-- Bhaskaran Raman, 02 Oct 2014

Based on: Galadriel's words in the introduction to the movie "Lord of the Rings: Fellowship of the Ring"

# **CS305**

# **Computer Architecture**

## **The Memory System: A Hierarchy of Caches**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Memory Systems: Why Important?

- Memory: the second crucial part of a computer
- Today: memory systems dictate performance
  - Processor performance well above memory performance
  - Cannot throw more gates to get faster memory
- Some numbers:
  - Memory access latency: 20+ ns
  - Compare: processor cycle < 1 ns

# What Programmer Wants vs Reality

- What programmer wants: large memory, fast, cheap
- Reality: large  $\times$  fast
  - Large memory  $\rightarrow$  slow
  - Large memory  $\rightarrow$  cost per byte is smaller
- Memory system: create **illusion** of large & fast memory
  - Cache memory, main memory, virtual memory, secondary memory (I/O)

# What is a Cache?

- **Cache (English):** a safe place to store something
- **Cache (CS):** a temporary place for a copy (usually) of something, for fast, easy, efficient access
- Examples of caching you are aware of ?

# Cache in a Computer System



# DRAM versus SRAM

## Dynamic RAM (DRAM)

- Uses less transistors 1
- Needs to refresh periodically
- More power consumption
- Slower: access latency 20+ ns
- Cycle time can be > access time
  - DRAM needs time to refresh
- Cheaper
- Used for main memory

## Static RAM (SRAM)

- Uses more transistors 6
- No need to refresh
- Less power consumption ↙
- Faster: access latency ~2 ns
- Cycle time = access time
  - Access one locn. after another
- More expensive
- Used for cache memory

Three reasons why cache is faster: SRAM, smaller, closer to CPU

# Why Caches Work: The Principle of Locality

- **Temporal locality:** if X is accessed now, it will likely be accessed again in the near future
- **Spatial locality:** if X is accessed now, locations  $X \pm \delta$  will likely be accessed in the near future
- For instructions: sequential execution, loops
- For data: arrays, structures, variables in a function

# The Memory Hierarchy



# Summary

- Memory system: a hierarchy of caches
- Caching: principle of locality
- Next: cache design

# **CS305**

# **Computer Architecture**

## **Cache Design: A Beginning**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Terminology

**Hit-ratio/hit-rate:** #hits/#accesses

**Miss-ratio/miss-rate:** #misses/#accesses   Cache



**Hit-time:** time to access memory during hit

**Miss-penalty:** time spent on a miss (to fetch from main memory)

Main memory



# Simplest Possible Cache Design: Single Word per Block, Direct-Mapped



Single word per block; block size = 1 word

Given memory word,  
block # in cache =  
(memory block #) MOD (# blocks in cache)

Only one cache location for a given  
memory word → **Direct Mapped**

If # blocks in cache is power of 2,  
MOD == last few bits of memory word addr

# Direct Mapped Cache: A Numeric Example

Memory =  $2^6$  words, Cache =  $2^3$  words

Number of bits for memory address = 8



Number of possible states of a cache block =  $2^3 + 1$  (can be invalid)

Each cache block, in addition to data (or instruction) has:

- 1 valid bit
- 3 bits of tag (which memory block is in cache currently)

# What Happens on a Memory Read? (by Processor)

Given memory address, find cache block:

$$\text{Cache block \#} = (\text{Memory block \#}) \text{ MOD } (\#\text{ cache blocks})$$

Implemented by cache controller

For that cache block:

(a) If (valid and cache tag matches tag from memory address)

Cache hit → read from cache

(b) If (invalid or cache tag mismatch)

Cache miss → load appropriate block from memory onto cache, then case (a)

Processor stalls in the meantime!

What happens on memory write? Discussed later...

# Direct Mapped Cache with Multiple Words Per Block

Cache



To take advantage of spatial locality: block size > 1, say block size = 2 words

Recall: block is unit of transfer between cache and memory → on read of any word in a block, entire block is loaded onto cache



# Multiple Words/Block: A Numeric Example

Main memory = 32 MB, Cache = 512 KB,

Block size = 16 words

Draw fields of memory address



$$\frac{512 \text{ KB}}{16 \text{ words}} = \frac{2^{19} \text{ bytes}}{2^6 \text{ bytes}} = 2^{13}$$

$$\# \text{ bits in cache} = 512 \text{ KB} + (2^{13} \times (6+1))$$

# Cache in Action: An Example

Cache

|   |
|---|
| 7 |
| 6 |
| 5 |
| 4 |
| 3 |
| 2 |
| 1 |
| 0 |

Block size = 2 words

Cache size = 16 words

Main memory word access sequence:

0, 1, 20, 19, 17, 0

Walk through cache state...

20, 21 in cache                    0 : Miss  
19 in cache                        1 : Hit  
0, 1 in cache X                    17, 16 in cache X  
17, 16 in cache X                (1000) 20 : Miss  
0, 1 in cache                      (1001) 19 : Miss  
                                      (10001) 17 : Miss  
                                      0 : Miss

# Handling Writes: What Happens on Write Hit?

**Write-back:** write (block) to memory “later”

Need “dirty” bit to know if block needs to be written back, at the time of replacement

**Write-through:** write (word) to memory now

Can be slow!

Can have a write-buffer to mask delay

Sequence of writes close to one another →

delay will surface

# Write Hit: Some Remarks on Performance

In write-back, even write hit can take two cycles!  
One to read the tag, one to do the actual write  
Can use [store-buffer](#) to mask latency

**Note:** can have write-buffer for “write-back” too, to reduce miss penalty  
**Beware:** on read-miss, be sure to check store-buffer and write-buffer as well!

# Handling Writes: What Happens on Write Miss?

**Write-allocate:** fetch block from memory

**Write-no-allocate:** do not fetch block from memory

# Handling Writes: The Four Design Choices

|             | Write-back                                | Write-through                             |
|-------------|-------------------------------------------|-------------------------------------------|
| Allocate    |                                           | <i>Possible,<br/>makes less<br/>sense</i> |
| No-Allocate | <i>Possible,<br/>makes less<br/>sense</i> | Write-<br>around                          |

# Summary

- Direct mapped: many-to-one mapping
- Memory address =  
**tag::block#incache::word#inblock::00**
- Cache has: block (data), tag, valid bit
- Write policy: four design possibilities

# **CS305**

# **Computer Architecture**

## **Associative Caches**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Recall Direct Mapped Cache Example

Cache



Block size = 2 words

Cache size = 16 words

Main memory word access sequence:

0, 1, 20, 19, 17, 0

17 evicts 0&1 even though there is plenty of empty space in the cache! **CONFLICT** miss  
This causes 0 to miss next. 

Suppose 17 & 0 accessed in a loop: hit-ratio=0 !

**Main culprit:**

Many-to-one mapping of Direct Mapped cache

# Associative Cache: The Idea



A block of main memory can be in ANY cache block: **fully associative** cache

Main memory block # → Cache block #  
Now a many-to-many mapping!

Memory address



tag

Block# in cache

Price to pay: search for block in cache, given memory address → parallel comparator

Memory address



tag

Word# in block

- (+) Reduced conflict misses
- (-) Comparator cost
- (-) Increased hit time!

# Set Associative Cache: The Idea

**Direct Mapped**

A block of main memory can be in exactly  
**ONE** cache block

**Fully Associative**

A block of main memory can be in **ANY** cache block

**Set Associative**

A block of main memory can be in **ANY of a SET** of cache blocks

# How a Set Associative Cache Works



**Direct Mapped**

**Fully Associative**

**Set Associative**

Given memory address, find set#

#parallel comparators = #blocks in set



# The Design Space Continuum

## Direct Mapped

Set associative cache with set size = 1 block  
Called **1-way set associative**

## Fully Associative

Set associative cache with set size = K  
 $K = (\# \text{ blocks in cache})$ , **K-way set associative**

## Set Associative

Set associative cache with set size S:  $1 < S < K$   
**S-way set associative**

# Set Associative Cache: A Numeric Example

Main memory = 32MB, Cache size = 512 KB, Block size = 16 words

Show main memory address fields for 8-way set associative cache



Number of comparators required = 8

Number of bits to be compared in each comparator = tag size = 9 bits

# Replacement Policy

Tag

Cache



What to do when set is full? Which cache block to replace?

Set 11

Not relevant for direct mapped cache: no choice

Set 10

Some possibilities:

- a) Random
- b) First-In-First-Out (FIFO)
- c) Least frequently used (LFU)
- d) Least recently used (LRU)

**LRU** found most effective in practice: difficult to implement in hardware

What is implemented: **clock algorithm**, an approximation of LRU, works well

More comprehensive treatment of replacement algorithms: in OS course

# Summary

- Associativity: helps reduce cache conflict misses
- Fully associative: too expensive (cost, hit time)
- Set associative: get benefits of fully associative and direct mapped
- Block replacement policy: LRU
- Next: facets of cache performance

# **CS305**

# **Computer Architecture**

## **Cache Performance Analysis**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Outline

- The three kinds of cache misses
- Performance implications of block size
- Joint I+D cache vs separate I, D caches
- Performance implications of associativity
- Design of cache  $\leftrightarrow$  main memory interface
- Multi-level caches

# The Three Kinds of Cache Misses

**Compulsory**

The miss caused the first time a block is accessed  
Since the cache starts “cold”, “empty”...

**Conflict**

A miss caused due to insufficient set size  
Cannot happen in a fully associative cache (by defn.)

**Capacity**

A miss caused due to insufficient cache size  
All misses after compulsory misses, in fully assoc. caches

# Effect of Increase in Block Size

- (+) Increased spatial locality
  - ➔ Lesser compulsory misses
- (-) More conflict misses: for same cache size
- (-) Higher miss penalty

# Miss Rate vs Block Size



# Increased Block Size: Techniques to Reduce Miss Penalty

- Early restart
  - Processor can proceed as soon as required word is in cache
- Critical word first
  - Get the word required by the processor first, then the rest of the block
- Implications: increased complexity in cache controller and/or memory system

# **Joint I+D Cache or Separate I, D Caches**

## **Joint I+D Cache**

- (+) Better hit rate
- (-) Lower instruction throughput (pipeline stalls)

## **Separate I, D Caches**

- (-) Slightly lower hit rate
- (+) Better instruction throughput

# Performance Implications of Associativity

- (+) Reduced conflict misses
- (-) Increased hit time!

miss  
rate

associativity

Note: decreasing benefits of higher associativity.

Reason: only a certain number of conflict misses to remove

# **Design of Cache-to-Main-Memory Interface**

# What Happens on a Cache Miss?

Miss penalty: time to load a block from main memory to cache

- a) Send address to memory → Say 1 cycle
- b) DRAM access initiation latency → Say 15 cycles
- c) Read 1 word of data from memory to cache → Say 1 cycle



Miss penalty for 4-word block =  $4 \times (1+15+1) = 68$  cycles

# Reducing Miss Penalty: Wide Memory (Naïve Solution)



- (+) Reduced miss penalty:  
 $1+15+1=17$  cycles
- (-) Wider memory bus: expensive
- (-) Increased hit time!

# Reducing Miss Penalty: Interleaved Memory (Smart Solution)



- a) Send address to memory
- b) Each DRAM reads independently
- c) Read data from memory to cache 1 word after another

Miss penalty for 4-word block =  $1 + 15 + 4 \times 1 = 20$  cycles

# Multi-Level Caches

# Reducing Miss Penalty: Multi-Level Caches



L1 thinks L2 to be main memory, L2 thinks L1 is processor  
Miss in L1 → see in L2, Miss in L2 → see in main memory

# Reducing Miss Penalty: Multi-Level Caches



Optimize for hit-time:  
(miss penalty low anyway)  
Smaller size, smaller blocks  
Direct mapped or low associativity

Optimize for miss-rate:  
(hit time does not matter anyway)  
Larger size, larger blocks  
2, 4, or 8-way associative

# L2 Miss Rate: Local vs Global

L2 **local** miss rate: with respect to L2 accesses

L2 **global** miss rate: with respect to memory references by processor

# Summary

- The three C's: compulsory, conflict, capacity
- Performance implications of design options: block size, separate vs unified, associativity
- Interleaved memory
- Multi-level caches
- Next: program performance in presence of caches

# **CS305**

# **Computer Architecture**

## **Program Performance Analysis in Presence of Cache**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Reworking the Performance Equation in Presence of Cache

Cache hit → normal operation

CPU time = CPU time without misses + stalls due to misses

Read stalls: # read misses x read miss penalty

Write stalls: depends on write access scheme

For write-back: read misses will potentially have to write dirty blocks

For write-through: write stalls = # write-buffer stalls

For write-through + write-allocate:

write miss penalty = read miss penalty + % write-buffer stalls

Note: miss rate is given in terms of # memory accesses

# Program Performance Analysis in Presence of Cache: An Example

I-cache miss rate: 1%, D-cache miss rate: 5%, % memory instructions = 40%

Miss penalty = 100 cycles

CPI without memory stalls = 3, CPI with memory stalls = ?

CPI with memory stalls =  $3 + 1\% \times 100 \text{ cycles} + 40\% \times 5\% \times 100 \text{ cycles} = 6$

Suppose original CPI halved due to better pipelining and data forwarding

New CPI =  $1.5 + 3 = 4.5$ , effective speedup =  $6/4.5 = 4/3$  only, not 2

Suppose CPI further halved (e.g. superscalar architecture), new speedup?

New CPI =  $0.75 + 3 = 3.75$ , effective speedup =  $4.5/3.75 = 6/5$  only, not 2

Amdahl's Law

# Program Performance with Multi-Level Caches: An Example

L1 miss rate = 2%, L2 miss rate (global) = 0.5% (both w.r.t. executed instructions)

Miss penalty into L2 = 25 cycles, miss penalty into main memory = 500 cycles

CPI without misses = 2.5, what is the performance improvement due to L2 ?

$$\text{CPI without L2} = 2.5 + 2\% \times 500 = 12.5$$

$$\text{CPI with L2} = 2.5 + 2\% \times 25 + 0.5\% \times 500 = 5.5$$

$$\text{Performance improvement due to L2} = 12.5/5.5$$

# Summary

- Program performance with cache: **add to CPI** corresponding to miss rate and miss penalty
- Instance of **Amdahl's law**: improvements to processor almost useless beyond a point
- Octa-core processor versus quad-core processor: what do you think is the performance difference?
- Questions most relevant to program performance: **cache configuration** (cache size, number of levels, on-chip versus off-chip)

# **CS305**

# **Computer Architecture**

## **Virtual Memory**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Programmer's Burden: Program Size > Main Memory Size



Initially Part1, Part2



Then Part1, Part3



Then Part1, Part4



Program had to deal with bringing in relevant parts of the program, into main memory!

Not a problem today: most programs < memory size

Illusion required: memory larger than main memory

# Loader's Burden: Program Relocation



Program, or parts of program need to be constantly rewritten, if loaded to different parts of memory

Illusion required: program at same memory location

# OS's Burden: Program Safety



Different user's programs must be protected from one another: malicious or buggy behaviour

Illusion required: each program thinks it has the entire memory to itself, it is the only program

# Virtual Memory: Illusions Galore

Illusion-1: Larger memory than main memory

Illusion-2: Program at same memory location

Illusion-3: The current program is the only program, with entire memory to itself

Illusion-N... E.g. with shared dynamic libraries, shared memory, etc.

# Virtual Memory: The Cache Perspective



Main memory is a cache for virtual memory

Removes programmer's burden

Incomplete perspective:  
e.g. main memory can be  
8GB, virtual memory 4GB

# Virtual Memory: The Level-of-Indirection Perspective



OS, with hardware support, maintains VA → PA mapping

# Level-of-Indirection: The Basis for All Illusions

Virtual address (VA) → physical address (PA) mapping:  
Like main memory address → cache location mapping  
Can support much larger virtual address space than physical (main) memory

Program relocation: change VA → PA mapping!  
Program (in terms of VA) remains unchanged

Program safety: separate VA space for each program  
No chance of inadvertent access by another program



# The Cache Perspective

## L1 as Cache for MM

Unit of transfer = block

Not in cache → miss

Mapping: using MM addr, tags

Miss penalty: a few 100 cycles

## MM as Cache for VM

Unit of transfer = page

Not in cache → page fault

Mapping: using page table, other schemes

Miss penalty: to disk → a few ms!  
Millions or billions of cycles!

# VM Design Decisions in the Cache Perspective

## Strategy-1: Reduce miss rate

Page size: large (e.g. 32KB)

Associativity: full

Write-back scheme

Write-allocate scheme

Page fault: handled in software (OS), sophisticated replacement policy

## Strategy-2: Multi-tasking

When a program has a miss (page fault), run some other program!

# The Swap Space



# The VA-to-PA Mapping: Page Table



# Page Table: A Numeric Example

32-bit VA space, 8KB pages

Say, 4 bytes per page table entry

Page table size = ?

# pages in VM =  $2^{32}/2^{13} = 2^{19}$

Page table size =  $2^{21}$  bytes = 2MB !

This is per process !

64-bit VA space → page table much larger!

What can we do?

# Dealing with Large Mapping Size: Solution-1: Page Table Limits



Maintain mapping only for usable  
parts of virtual memory

# Dealing with Large Mapping Size: Solution-2: Inverted Page Table

| Virtual page number | Physical page number |
|---------------------|----------------------|
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |

- One entry per physical page
- Entries hashed on virtual page number

# Dealing with Large Mapping Size: Solution-3: Multi-level Page Table



- Main idea: hierarchy
- Idea works since most entries in first level page table will be invalid

Page tables

| Page table entry number | Physical page number |
|-------------------------|----------------------|
| Physical address 1      | Physical address 1   |
| Physical address 2      | Physical address 2   |
| Physical address 3      | Physical address 3   |
| Physical address 4      | Physical address 4   |

# Multi-Level Page Table: An Example



# Dealing with Large Mapping Size: Solution-4: Paged Page Tables

| Virtual page number | Physical page number |
|---------------------|----------------------|
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |
|                     |                      |

- Main ideas: level of indirection, caching
  - Earlier: page table was in physical memory
  - Now: page table in virtual memory (of a special OS process)
  - Can manage even if page table size is large
  - Page table of OS process pinned to physical memory

# Paged Page Tables: An Example



# Dealing with Large Mapping Size: Idea Behind the Solutions

- Page table limits: works since most of the page table is empty
- Inverted page table: works since most of the page table is empty
- Multi-level page table: works since most of the page table is empty
- Paged page table: works since there is locality of reference in virtual page references

# Summary

- Virtual memory: VA-to-PA mapping is a level of indirection
  - Provides immense flexibility
- Segments: unequal sized blocks
- Managing the mapping: page table
- Ideas to deal with mapping size: page table limits, inverted page table, multi-level page table, paged page table
- Next: putting together VM, caches

# **CS305**

# **Computer Architecture**

## **Virtual Memory and Caches**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# What Happens on a Memory Reference?



# Solution: Caching!



# Translation Look-aside Buffer (TLB)



- TLB is on-chip
- Unified or separate
- Fully associative and small, or larger with lower associativity
- Cache replacement policy: can't be as sophisticated as for pages
- Some typical parameters:
  - TLB size: 16-512 entries
  - Block size: 1-2 entries (4-8 bytes each)
  - Hit time  $\leq$  1 cycle, TLB access = 1 pipeline stage, if hit time = 1 cycle
  - Miss penalty = 10-100 cycles, miss rate = 0.01-1%
- TLB miss may be handled in hardware or software

# Possible Sequence of Events



# Table of Possibilities

| TLB  | PT      | L1   | Remarks                   |
|------|---------|------|---------------------------|
| Hit  | Valid   | Hit  | Best case                 |
| Hit  | Valid   | Miss | Only cache miss           |
| Miss | Valid   | Hit  | Only TLB miss             |
| Miss | Valid   | Miss | TLB miss, then cache miss |
| Miss | Invalid | Miss | Worst case                |
| x    | Invalid | Hit  | Not possible              |
| Hit  | Invalid | x    | Not possible              |

# Putting it all Together: A Numeric Example

64-bit VA space, 16GB main memory, 32KB page size

Page table entry: mapping, 16-bit disk addr, bits: valid, used, dirty, write protection

TLB has 16 entries

Q: Compute page table size, indicate fields of VA, PA

A: # virtual pages =  $2^{64}/2^{15} = 2^{49}$ , # physical pages =  $2^{34}/2^{15} = 2^{19}$

Page table entry size = 19 bits for physical page # + 16 + 4 = 39

Page table size =  $2^{49} \times 39$  bits

Q: Compute TLB size

A: TLB has 49 bit virt page #, 19 bit phys page #, use bit (for page), use bit (for TLB), valid bit (for TLB), write protect bit (for page), dirty bit (for page), dirty bit (for TLB)

Total =  $(49+19+6) \times 16 = 74 \times 16 = 1184$  bits

# Numeric Example (Continued)

TLB is 2-way set associative

L1 cache: 2MB, 16-word block, 4-way set associative

Show bit manipulations during a particular memory access



# Shared Memory Using Virtual Memory



# Copy-on-Write Using Virtual Memory



# Virtually Addressed Caches

(+) cache hit  $\Rightarrow$  VA  $\rightarrow$  PA translation  
unnecessary

(-) shared mem  
has problems



Anti-aliasing



# Summary

- TLB: cache for PT entries
- Caches in the memory system: TLB, L1, L2, L3, MM
  - Others too: disk is cache for network access, web proxy is cache for server content
- Features using VM: shared memory, copy-on-write
- Next: hardware and OS interaction for VM

# **CS305**

# **Computer Architecture**

## **Hardware and OS Interaction for Virtual Memory**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Necessity of Hardware-OS Interaction



# TLB, PT Management

- PT miss (page fault) handled in software: exception handler
- TLB miss can be handled in hardware or software
  - MIPS handles TLB miss in software: exception handler
  - Need special instructions for TLB access
- What if regular programs write TLB or PT?
  - Need (at least) two processor modes: kernel or supervisor mode, regular user mode
  - TLB, PT writing allowed only in kernel mode

# Switching Processor Modes

- Switching modes needs to be controlled
- User-to-kernel:
  - On exception, enter kernel mode automatically
  - `syscall` or `trap` instructions: also called software exceptions
- Kernel-to-user:
  - `eret` (exception return)
- While triggering exception, the hardware:
  - Switches to kernel mode
  - Disables further exceptions (will be enabled at a safe stage)

# The MIPS TLB Miss Handler

```
TLB miss exception handler at 0x80000000
mfc0 $k1, Context # spl reg with addr of relevant PT entry
lw   $k1, 0($k1)  # load PT entry (1 word) into reg
mtc0 $k1, EntryLo # prepare to load TLB
tlbwr                # EntryLo --> random locn in TLB
eret                 # done handling TLB miss
```

- Invalid PT entries may be loaded onto TLB too!
- TLB miss considered more common than page fault
- Common case is made fast (~ a dozen cycles)

# TLB and Multiple Processes

- VA space is per process →
- VA-to-PA mapping is per process →
- TLB entries (cache of this mapping) is per process
- What to do on context switch?
  - Option-1: flush TLB
  - Option-2: have a PID (process ID) tag field in TLB, and process has a PID register (filled by OS)

# Page Fault Handler

- At 0x8000 0180, separate from TLB miss handler
  - Optimize common case of TLB miss
- Page fault handler has to:
  - Save process state: all GPRs, Hi, Lo onto exception stack
  - Re-enable exceptions
  - Read PT from HD (may need to write dirty page first)
  - Typically, switch to another process which is ready to run

# Some Remarks

- Restarting exceptions is NOT easy (e.g. string copy instruction)
- Exception handling itself is not virtual
  - Unmapped memory
  - In MIPS: 0x8000 0000 to 0x8000 FFFF mapped statically to lower portion of physical memory

# Thrashing, Working Set



- **Working set:** the set of pages a program or set of “active” programs need in main memory
- **Thrashing:** when working set size exceeds the main memory size
- High page fault rate
- Processor stalls, waiting for disk: terrible performance

# Summary

- VM: hardware and OS need to work together
  - Special instructions for TLB access
  - Processor modes
  - Special instructions for switching modes
- Next: Input/Output systems

# **CS305**

# **Computer Architecture**

## **Input/Output Systems: An Introduction**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Recall: Von Neumann Architecture



# Connecting I/O Devices



# IO Performance Metrics

- Latency sensitive:
  - Examples: keyboard, mouse
  - Any device with which user interacts directly
- Throughput sensitive:
  - Hard disk, network, graphics
  - Large number of small transactions (e.g. web proxy)
  - Large transactions (e.g. graphics)

# Topics in IO Systems

- Magnetic hard disk technology
- RAID: Redundant Array of Inexpensive Disks
- Hamming Codes
- Buses
  - Bus interfacing
  - Bus protocols
  - Bus arbitration

# **CS305**

# **Computer Architecture**

## **Introduction to Buses**

Bhaskaran Raman  
Room 406, KR Building  
Department of CSE, IIT Bombay

<http://www.cse.iitb.ac.in/~br>

# Reference

- “Computer Architecture and Organization”, John P Hayes, McGraw Hill, 3<sup>rd</sup> Edition

# What is a Bus?



# Buses: Merits and Demerits

- Advantages:
  - Versatility or flexibility: Same IO device can be used in different computers
  - Low cost (compared to individual lines betn. proc. & device)
- Disadvantage:
  - Communication bottleneck
  - Max bus speed: limited by length, #devices

# System Bus vs IO Bus



# Bus Lines



- Control lines: for bus control
- Address and data lines may be shared in some buses
- Data: up to 4 words in width

# Bus Lines in the Power-PC Processor



# Bus Standards

- Standards: interoperability
- SCSI: Small Computer System Interconnect
- PCI: Peripheral Component Interconnect
- USB: Universal Serial Bus (hot pluggable)
- IEEE 1394: Firewire (hot pluggable)
- Standards specify: length, frequency, protocol, number of devices, etc
- Our focus: concepts, not standards

# Going Forward

- Three bus related concepts:
  - Bus interfacing: how do devices connect to the bus?
  - Bus protocols: how do devices exchange data on the bus?
  - Bus arbitration: who gets access to the bus?