

# 6.823 Computer System Architecture

Instructor: *Daniel Sanchez*  
TA: *Hyun Ryong (Ryan) Lee*



The processor you  
built in 6.004\*

What you'll  
understand after  
taking 6.823

February 16, 2021

MIT 6.823 Spring 2021

L01-1 February 16, 2021

MIT 6.823 Spring 2021

L01-2

## Computing devices then...



## Computing devices now



February 16, 2021

MIT 6.823 Spring 2021

L01-3 February 16, 2021

MIT 6.823 Spring 2021

L01-4

## A journey through this space

- What do computer architects actually do?
- Illustrate via historical examples
  - Early days: ENIAC, EDVAC, and EDSAC
  - Arrival of IBM 650 and then IBM 360
  - Seymour Cray – CDC 6600, Cray 1
  - Microprocessors and PCs
  - Multicores
  - Cell phones
- Focus on ideas, mechanisms, and principles, especially those that have withstood the test of time

## Abstraction layers



February 16, 2021

MIT 6.823 Spring 2021

L01-5 February 16, 2021

MIT 6.823 Spring 2021

L01-6

## Computer Architecture is the design of abstraction layers

- What do abstraction layers provide?
  - Environmental stability within generation
  - Environmental stability across generations
  - Consistency across a large number of units
- What are the consequences?
  - *Encouragement to create reusable foundations:*
    - Toolchains, operating systems, libraries
  - Enticement for application innovation

# Technology is the dominant factor in computer design

Technology  
Transistors  
Integrated circuits  
VLSI (initially)  
Flash memories, ...



Technology  
Core memories  
Magnetic tapes  
Disks



Technology  
ROMs, RAMs  
VLSI  
Packaging  
Low Power



February 16, 2021

MIT 6.823 Spring 2021

L01-7 February 16, 2021

MIT 6.823 Spring 2021

L01-8

# But Software...

As people write programs and use computers, our understanding of *programming* and *program behavior* improves.

*This has profound though slower impact on computer architecture*

Modern architects must pay attention to software and compilation issues.



# Architecture is engineering design under constraints

Factors to consider:

- Performance of whole system on target applications
  - Average case & worst case
- Cost of manufacturing chips and supporting system
- Power to run system
  - Peak power & energy per operation
- Reliability of system
  - Soft errors & hard errors
- Cost to design chips (engineers, computers, CAD tools)
  - Becoming a limiting factor in many situations, fewer unique chips can be justified
- Cost to develop applications and system software
  - Often the dominant constraint for any programmable device

*At different times, and for different applications at the same point in time, the relative balance of these factors can result in widely varying architectural choices*

February 16, 2021

MIT 6.823 Spring 2021

L01-9 February 16, 2021

MIT 6.823 Spring 2021

L01-10

# Course Information

All info kept up to date on the website:

<http://www.csg.csail.mit.edu/6.823>

# Contact times

- Lectures Tuesdays and Thursdays
  - 1:00pm to 2:30pm
- Tutorial on Fridays
  - 1:00pm to 2:00pm
  - Attendance is optional
  - Additional tutorials will be held in evenings before quizzes
- Quizzes on Friday (except last quiz)
  - 1:00pm to 2:30pm
  - Attendance is NOT optional
- Instructor office hours
  - After class or by email appointment
- TA office hours
  - Wednesday 4-5:30pm

February 16, 2021

MIT 6.823 Spring 2021

L01-11 February 16, 2021

MIT 6.823 Spring 2021

L01-12

# Lectures and tutorials

- Lectures/tutorials are synchronous, through Zoom
  - Video recordings will be available on the website
- Two ways of asking questions:
  - Unmute yourself and ask – for questions specific to lecture
  - Zoom chat – for relevant but less direct questions
- If you can, please enable video 😊
  - Helps you stay engaged, helps us get to know you and get nonverbal feedback like in an in-person lecture
  - Your video won't appear on recordings, and recordings won't be publicly available

MIT 6.823 Spring 2021

L01-12

## Online resources & help

- We use Piazza extensively
  - Fastest way to get your questions answered
  - Links to lecture & tutorial videos will be posted on Piazza
  - All course announcements are made on Piazza
- This is not a normal term; if you need help, let us know!
  - We can be accommodating

February 16, 2021

MIT 6.823 Spring 2021

L01-13 February 16, 2021

MIT 6.823 Spring 2021

L01-14

## The course has four modules

- Module 1

  - Instruction Set Architecture (ISA)
  - Caches and Virtual Memory
  - Simple Pipelining and Hazards

- Module 2

  - Complex Pipelining and Out of Order Execution
  - Branch Prediction and Speculative Execution

- Module 3

  - Multithreading and Multiprocessors
  - Coherence and consistency
  - On-chip networks

- Module 4

  - VLIW, EPIC
  - Vector machines and GPUs

## Textbook and readings

- "Computer Architecture: A Quantitative Approach", Hennessy & Patterson, 5<sup>th</sup> / 6<sup>th</sup> ed.
  - 5<sup>th</sup> edition available online through MIT Libraries
  - Recommended, but not necessary
- Course website lists H&P reading material for each lecture, and optional readings that provide more in-depth coverage

February 16, 2021

MIT 6.823 Spring 2021

L01-15 February 16, 2021

MIT 6.823 Spring 2021

L01-16

## Grading

- Grades are not assigned based on a predetermined curve
  - Most of you are capable of getting an A
- 75% of the grade is based on four closed book 1.5 hour quizzes
  - The first three quizzes will be held during the tutorials; the last one during the last lecture (dates on web syllabus)
  - We'll have distant-timezone quizzes and makeups if needed
- 25% of the grade is based on four laboratory exercises
- No final exam
- No final project

## Problem sets & labs

- Problem sets
  - One problem set per module, not graded
  - Intended for private study and for tutorials to help prepare for quizzes
  - Quizzes assume you are very familiar with the content of problem sets
- Labs
  - Four graded labs (Lab 0 is introductory)
  - Based on widely-used PIN tool
  - Labs 2 and 4 are open-ended challenges
- You must complete labs & quizzes individually
  - Please review the collaboration & academic honesty policy

February 16, 2021

MIT 6.823 Spring 2021

L01-17 February 16, 2021

MIT 6.823 Spring 2021

L01-18

## Self evaluation take-home quiz

- Goal is to help you judge for yourself whether you have prerequisites for this class, and to help refresh your memory
- We assume that you understand digital logic, a simple 5-stage pipeline, and simple caches
- Please work by yourself on this quiz – not in groups
- Remember to complete self-evaluation section at end of the quiz
- Due by Friday (on recitation or send answers to TA mailing list)

*Please email us if you have concerns about your ability to take the class*

## Prehistory

- 1800s: Charles Babbage
  - Difference Engine (conceived in 1823, first implemented in 1855 by Scheutz)
  - Analytic Engine, the first conception of a general purpose computer (1833, never implemented)
- 1890: Tabulating machines
- Early 1900s: Analog computers
- 1930s: Early electronic (fixed-function) digital computers

February 16, 2021

MIT 6.823 Spring 2021

L01-19 February 16, 2021

MIT 6.823 Spring 2021

L01-20

## Electronic Numerical Integrator and Computer (ENIAC)

- Designed and built by Eckert and Mauchly at the University of Pennsylvania during 1943-45
- The first, completely electronic, operational, general-purpose analytical calculator!
  - 30 tons, 72 square meters, 200KW
- Performance
  - Read in 120 cards per minute
  - Addition took 200  $\mu$ s, Division 6 ms
- Not very reliable!



*Application:* Ballistic calculations

angle = f (location, tail wind, cross wind, air density, temperature, weight of shell, propellant charge, ... )

February 16, 2021

MIT 6.823 Spring 2021

L01-21 February 16, 2021

MIT 6.823 Spring 2021

L01-22

## Electronic Discrete Variable Automatic Computer (EDVAC)

- ENIAC's programming system was external
  - Sequences of instructions were executed independently of the results of the calculation
  - Human intervention required to take instructions "out of order"
- EDVAC was designed by Eckert, Mauchly, and von Neumann in 1944 to solve this problem
  - Solution was the *stored program computer*
  - ⇒ "*program can be manipulated as data*"
- First Draft of a report on EDVAC was published in 1945, but just had von Neumann's signature!
  - Without a doubt the most influential paper in computer architecture

## Stored Program Computer

Program = A sequence of instructions

How to control instruction sequencing?

manual control calculators

automatic control  
external (paper tape)  
Harvard Mark I, 1944  
Zuse's Z1, WW2

internal  
plug board  
read-only memory  
read-write memory

|       |                |
|-------|----------------|
| ENIAC | 1946           |
| ENIAC | 1948           |
| EDVAC | 1947 (concept) |

- The same storage can be used to store program and data

|       |      |                |
|-------|------|----------------|
| EDSAC | 1950 | Maurice Wilkes |
|-------|------|----------------|

## The Spread of Ideas

ENIAC & EDVAC had immediate impact

*brilliant engineering:* Eckert & Mauchly  
*lucid paper:* Burks, Goldstein & von Neumann

|         |            |       |            |
|---------|------------|-------|------------|
| IAS     | Princeton  | 46-52 | Bigelow    |
| EDSAC   | Cambridge  | 46-50 | Wilkes     |
| MANIAC  | Los Alamos | 49-52 | Metropolis |
| JOHNIAC | Rand       | 50-53 |            |
| ILLIAC  | Illinois   | 49-52 |            |
|         | Argonne    | 49-53 |            |
| SWAC    | UCLA-NBS   |       |            |

UNIVAC - the first commercial computer, 1951

Alan Turing's direct influence on these developments is often debated by historians.

February 16, 2021

MIT 6.823 Spring 2021

L01-23 February 16, 2021

MIT 6.823 Spring 2021

L01-24

# Dominant Technology Issue: Reliability

ENIAC  $\Rightarrow$  EDVAC  
 18,000 tubes  $\Rightarrow$  4,000 tubes  
 20 10-digit numbers 2000 word storage  
 mercury delay lines

Mean time between failures (MTBF)  
*MIT's Whirlwind with an MTBF of 20 min. was perhaps the most reliable machine!*

Reasons for unreliability:

1. Vacuum tubes
2. Storage medium
  - Acoustic delay lines
  - Mercury delay lines
  - Williams tubes
  - Selections

CORE J. Forrester 1954

February 16, 2021

MIT 6.823 Spring 2021

L01-25 February 16, 2021

MIT 6.823 Spring 2021

L01-26

## Accumulator-based computing



- Single Accumulator
  - Calculator design carried over to computers

Why?

Registers expensive

February 16, 2021

MIT 6.823 Spring 2021

L01-27 February 16, 2021

MIT 6.823 Spring 2021

L01-28

## Programming: Single Accumulator Machine

$$C_i \leftarrow A_i + B_i, \quad 1 \leq i \leq n$$

|      |       |      |
|------|-------|------|
| LOOP | LOAD  | N    |
|      | JGE   | DONE |
|      | ADD   | ONE  |
|      | STORE | N    |
| F1   | LOAD  | A    |
| F2   | ADD   | B    |
| F3   | STORE | C    |
|      | JUMP  | LOOP |
| DONE | HLT   |      |



Problem?

How to modify the addresses A, B and C ?

February 16, 2021

MIT 6.823 Spring 2021

L01-29 February 16, 2021

MIT 6.823 Spring 2021

L01-30

## Computers in the mid 50's

- Hardware was expensive
- Stores were small (1000 words)
  - $\Rightarrow$  No resident system-software!
- Memory access time was 10 to 50 times slower than the processor cycle
  - $\Rightarrow$  Instruction execution time was totally dominated by the memory reference time
- The ability to design complex control circuits to execute an instruction was the central design concern as opposed to the speed of decoding or an ALU operation
- Programmer's view of the machine was inseparable from the actual hardware implementation

L01-25 February 16, 2021

MIT 6.823 Spring 2021

L01-26

## The Earliest Instruction Sets

Burks, Goldstein & von Neumann ~1946

|             |   |                                                    |
|-------------|---|----------------------------------------------------|
| LOAD        | x | $AC \leftarrow M[x]$                               |
| STORE       | x | $M[x] \leftarrow (AC)$                             |
| ADD         | x | $AC \leftarrow (AC) + M[x]$                        |
| SUB         | x |                                                    |
| MUL         | x | Involved a quotient register                       |
| DIV         | x |                                                    |
| SHIFT LEFT  |   |                                                    |
| SHIFT RIGHT |   | $AC \leftarrow 2 \times (AC)$                      |
| JUMP        | x |                                                    |
| JGE         | x | if $(AC) \geq 0$ then $PC \leftarrow x$            |
| LOAD ADR    | x | $AC \leftarrow \text{Extract address field}(M[x])$ |
| STORE ADR   | x |                                                    |

Typically less than 2 dozen instructions!

L01-27 February 16, 2021

MIT 6.823 Spring 2021

L01-28

## Self-Modifying Code

|      |           |      |                                                   |
|------|-----------|------|---------------------------------------------------|
| LOOP | LOAD      | N    | $C_i \leftarrow A_i + B_i, \quad 1 \leq i \leq n$ |
|      | JGE       | DONE |                                                   |
|      | ADD       | ONE  |                                                   |
|      | STORE     | N    |                                                   |
| F1   | LOAD      | A    |                                                   |
| F2   | ADD       | B    |                                                   |
| F3   | STORE     | C    |                                                   |
|      | LOAD ADR  | F1   |                                                   |
|      | ADD       | ONE  |                                                   |
|      | STORE ADR | F1   |                                                   |
|      | LOAD ADR  | F2   |                                                   |
|      | ADD       | ONE  |                                                   |
|      | STORE ADR | F2   |                                                   |
|      | LOAD ADR  | F3   |                                                   |
|      | ADD       | ONE  |                                                   |
|      | STORE ADR | F3   |                                                   |
|      | JUMP      | LOOP |                                                   |
| DONE | HLT       |      |                                                   |

modify the program for the next iteration

Each iteration involves total book-keeping  
 instruction fetches  
 operand fetches  
 stores

Most of the executed instructions are for bookkeeping!

February 16, 2021

MIT 6.823 Spring 2021

L01-29 February 16, 2021

MIT 6.823 Spring 2021

L01-30



## More Instruction Formats

- **One address formats:** Accumulator machines
  - Accumulator is always other implicit operand
- **Zero address formats:** operands on a stack

add     $M[sp-1] \leftarrow M[sp] + M[sp-1]$   
load     $M[sp] \leftarrow M[M[sp]]$



- Stack can be in registers or in memory
  - usually top of stack cached in registers

Many different formats are possible!

February 16, 2021

MIT 6.823 Spring 2021

L01-37 February 16, 2021

MIT 6.823 Spring 2021

L01-38

## Instruction Set Architecture & Hardwired, Non-pipelined ISA Implementation

Daniel Sanchez  
Computer Science & Artificial Intelligence Lab  
M.I.T.

February 18, 2021

MIT 6.823 Spring 2021

L02-1 February 18, 2021

MIT 6.823 Spring 2021

L02-2

## Programmer's view of a machine: IBM 650

A drum machine with 44 instructions

Instruction: 60 1234 1009

- "Load the contents of location 1234 into the distribution; put it also into the upper accumulator; set lower accumulator to zero; and then go to location 1009 for the next instruction."

- Programmer's view of the machine was inseparable from the actual hardware implementation
- Good programmers optimized the placement of instructions on the drum to reduce latency!

## Instruction sets in the mid 50's

- Great variety of instruction sets, but all intimately tied to implementation details
- Programmer's view of the machine was inseparable from the actual hardware implementation!

Next Lecture:  
**Instruction Set Architectures:  
Decoupling Interface and  
Implementation**

## The IBM 650 (1953-4)



MIT 6.823 Spring 2021

L02-3

## Compatibility Problem at IBM

By early 60's, IBM had 4 incompatible lines of computers!

|      |   |      |
|------|---|------|
| 701  | → | 7094 |
| 650  | → | 7074 |
| 702  | → | 7080 |
| 1401 | → | 7010 |

Each system had its own

- Instruction set
- I/O system and Secondary Storage: magnetic tapes, drums and disks
- Assemblers, compilers, libraries,...
- Market niche business, scientific, real time, ...

⇒ IBM 360

February 18, 2021

MIT 6.823 Spring 2021

L02-3 February 18, 2021

MIT 6.823 Spring 2021

L02-4

# IBM 360: Design Premises

Amdahl, Blaauw, and Brooks, 1964

The design must lend itself to *growth and successor machines*

- General method for connecting I/O devices
- Total performance - answers per month rather than bits per microsecond  $\Rightarrow$  *programming aids*
- Machine must be capable of *supervising itself* without manual intervention
- Built-in *hardware fault checking* and locating aids to reduce down time
- Simple to assemble systems with redundant I/O devices, memories, etc. for *fault tolerance*
- Some problems required floating point words larger than 36 bits

February 18, 2021

MIT 6.823 Spring 2021

L02-5 February 18, 2021

MIT 6.823 Spring 2021

L02-6

## Instruction Set

*The control for changing the information held in the processor are specified by the instructions available in the instruction set architecture or ISA.*

Some things an ISA must specify:

- *A way to reference registers and memory*
- *The computational operations available*
- *How to control the sequence of instructions*
- *A binary representation for all of the above*

*ISA must satisfy the needs of the software:*  
- assembler, compiler, OS, VM

February 18, 2021

MIT 6.823 Spring 2021

L02-7 February 18, 2021

MIT 6.823 Spring 2021

L02-8

## IBM 360: Initial Implementations (1964)

|                 | Model 30              | ... | Model 70           |
|-----------------|-----------------------|-----|--------------------|
| Memory Capacity | 8K - 64 KB            | ... | 256K - 512 KB      |
| Memory Cycle    | 2.0 $\mu$ s           | ... | 1.0 $\mu$ s        |
| Datapath        | 8-bit                 |     | 64-bit             |
| Circuit Delay   | 30 nsec/level         |     | 5 nsec/level       |
| Registers       | in Main Store         |     | in Transistor      |
| Control Store   | Read only 1 $\mu$ sec |     | Dedicated circuits |

- Six implementations (Models, 30, 40, 50, 60, 62, 70)
- 50x performance difference across models
- *ISA completely hid the underlying technological differences between various models*

With minor modifications, IBM 360 ISA is still in use

February 18, 2021

MIT 6.823 Spring 2021

L02-9 February 18, 2021

MIT 6.823 Spring 2021

L02-10

## Processor State and Data Types

*The information held in the processor at the end of an instruction to provide the processing context for the next instruction.*

Program Counter, Accumulator, ...

- The information held in the processor will be interpreted as having data types manipulated by the instructions.
- If the processing of an instruction can be interrupted then the *hardware* must save and restore the state in a transparent manner

*Programmer's machine model* is a *contract* between the hardware and software

## IBM 360: A General-Purpose Register (GPR) Machine

- Processor State
  - 16 General-Purpose 32-bit Registers
  - 4 Floating Point 64-bit Registers
  - A Program Status Word (PSW)
    - PC, Condition codes, Control flags
- Data Formats
  - 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words
  - 24-bit addresses
- A 32-bit machine with 24-bit addresses
  - *No instruction contains a 24-bit address!*
- Precise interrupts



September 2019  
Image credit: IBM

## IBM 360: Fifty-five years later... z15 Microprocessor

- 9.2 billion transistors, 12-core design
- Up to 190 cores (2 spare) per system
- 5.2 GHz, 14nm CMOS technology
- 64-bit virtual addressing
  - Original 360 was 24-bit; 370 was a 31-bit extension
- Superscalar, out-of-order
  - 12-wide issue
  - Up to 180 instructions in flight
- 16K-entry Branch Target Buffer
  - Very large buffer to support commercial workloads
- Four Levels of caches
  - 128KB L1 I-cache, 128KB L1 D-cache
  - 4MB L2 cache per core
  - 256MB shared on-chip L3 cache
  - 960MB shared off-chip L4 cache
- Up to 40TB of main memory per system

# Instruction Set Architecture (ISA) versus Implementation

- ISA is the hardware/software interface
  - Defines set of programmer visible state
  - Defines data types
  - Defines instruction semantics (operations, sequencing)
  - Defines instruction format (bit encoding)
  - Examples: *MIPS, RISC-V, Alpha, x86, IBM 360, VAX, ARM, JVM*
- Many possible implementations of one ISA
  - 360 implementations: model 30 (c. 1964), z15 (c. 2019)
  - x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4, Core i7, AMD Athlon, AMD Opteron, Transmeta Crusoe, SoftPC
  - MIPS implementations: R2000, R4000, R10000, ...
  - JVM: HotSpot, PicoJava, ARM Jazelle, ...

February 18, 2021

MIT 6.823 Spring 2021

L02-11 February 18, 2021

MIT 6.823 Spring 2021

L02-12

## Hardware Elements

### • Combinational circuits

- Mux, Demux, Decoder, ALU, ...



### • Synchronous state elements

- Flipflop, Register, Register file, SRAM, DRAM



*Edge-triggered: Data is sampled at the rising edge*

February 18, 2021

MIT 6.823 Spring 2021

L02-13 February 18, 2021

MIT 6.823 Spring 2021

L02-14

## Register File Implementation



### • Register files with a large number of ports are difficult to design

- Area scales with ports<sup>2</sup>
- Almost all Alpha instructions have exactly 2 register source operands
- Intel's Itanium GPR File has 128 registers with 8 read ports and 4 write ports!!

February 18, 2021

MIT 6.823 Spring 2021

L02-15 February 18, 2021

MIT 6.823 Spring 2021

L02-16

## Processor Performance

$$\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} * \frac{\text{Cycles}}{\text{Instruction}} * \frac{\text{Time}}{\text{Cycle}}$$

- Instructions per program depends on source code, compiler technology and ISA
- Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
- Time per cycle depends upon the microarchitecture and the base technology

rest of  
this lecture

| Microarchitecture        | CPI | cycle time |
|--------------------------|-----|------------|
| Microcoded               | >1  | short      |
| Single-cycle unpipelined | 1   | long       |
| Pipelined                | 1   | short      |

## Register Files



No timing issues when reading and writing the same register (writes happen at the end of the cycle)

MIT 6.823 Spring 2021

L02-15

## A Simple Memory Model



### • Reads and writes are always completed in one cycle

- A Read can be done any time (i.e., combinational)
- If enabled, a Write is performed at the rising clock edge (the write address and data must be stable at the clock edge)

Later in the course we will present a more realistic model of memory

February 18, 2021

MIT 6.823 Spring 2021

L02-15 February 18, 2021

MIT 6.823 Spring 2021

L02-16

# The MIPS ISA

## Implementing MIPS: Single-cycle per instruction datapath & control logic

### Processor State

32 32-bit GPRs, R0 always contains a 0  
 32 single precision FPRs, may also be viewed as  
 16 double precision FPRs  
 FP status register, used for FP compares & exceptions  
 PC, the program counter  
 Some other special registers

### Data types

8-bit byte, 16-bit half word  
 32-bit word for integers  
 32-bit word for single precision floating point  
 64-bit word for double precision floating point

### Load/Store style instruction set

Data addressing modes: immediate & indexed  
 Branch addressing modes: PC relative & register indirect  
 Byte-addressable memory, big-endian mode

All instructions are 32 bits

February 18, 2021

MIT 6.823 Spring 2021

L02-17 February 18, 2021

MIT 6.823 Spring 2021

L02-18

## Instruction Execution

Execution of an instruction involves

1. Instruction fetch
2. Decode
3. Register fetch
4. ALU operation
5. Memory operation (optional)
6. Write back

And computing the address of the  
*next instruction (next PC)*

## Datapath: Reg-Reg ALU Instructions



February 18, 2021

MIT 6.823 Spring 2021

L02-19 February 18, 2021

MIT 6.823 Spring 2021

L02-20

## Datapath: Reg-Imm ALU Instructions

## Conflicts in Merging Datapath



February 18, 2021

MIT 6.823 Spring 2021

L02-21 February 18, 2021

MIT 6.823 Spring 2021

L02-22



February 18, 2021

MIT 6.823 Spring 2021

L02-21 February 18, 2021

MIT 6.823 Spring 2021

L02-22

## Datapath for ALU Instructions



February 18, 2021

MIT 6.823 Spring 2021

L02-23 February 18, 2021

MIT 6.823 Spring 2021

L02-24

## Datapath for Memory Instructions

Should program and data memory be separate?

*Harvard style: separate* (Aiken and Mark 1 influence)

- read-only program memory
- read/write data memory

- Note:

There must be a way to load the program memory

*Princeton style: the same* (von Neumann's influence)

- single read/write memory for program and data

- Note:

Executing a Load or Store instruction requires accessing the memory more than once

## Load/Store Instructions

*Harvard Datapath*



31 26 25 21 20 16 15 0

rs is the base register

rt is the destination of a Load or the source for a Store

February 18, 2021

MIT 6.823 Spring 2021

L02-25 February 18, 2021

MIT 6.823 Spring 2021

L02-26

## MIPS Control Instructions

Conditional (on GPR) PC-relative branch

|        |    |   |        |
|--------|----|---|--------|
| 6      | 5  | 5 | 16     |
| opcode | rs |   | offset |

BEQZ, BNEZ

Unconditional register-indirect jumps

|        |    |   |    |
|--------|----|---|----|
| 6      | 5  | 5 | 16 |
| opcode | rs |   |    |

JR, JALR

Unconditional absolute jumps

|        |        |
|--------|--------|
| 6      | 26     |
| opcode | target |

J, JAL

Condition

BEQZ

BNEZ

JR, JALR

J, JAL

- Jump-&-link stores PC+4 into the link register (R31)

- Control transfers are not delayed  
*we will worry about the branch delay slot later*

## Conditional Branches (BEQZ, BNEZ)



February 18, 2021

MIT 6.823 Spring 2021

L02-27 February 18, 2021

MIT 6.823 Spring 2021

L02-27

## Register-Indirect Jumps (JR)



February 18, 2021

MIT 6.823 Spring 2021

L02-28 February 18, 2021

MIT 6.823 Spring 2021

L02-28

## Register-Indirect Jump-&-Link (JALR)



February 18, 2021

MIT 6.823 Spring 2021

L02-29 February 18, 2021

## Harvard-Style Datapath for MIPS



February 18, 2021

MIT 6.823 Spring 2021

L02-31 February 18, 2021

MIT 6.823 Spring 2021

L02-32

## Absolute Jumps (J, JAL)



February 18, 2021

MIT 6.823 Spring 2021

L02-30

## Hardwired Control is pure Combinational Logic



## ALU Control & Immediate Extension



February 18, 2021

MIT 6.823 Spring 2021

L02-33 February 18, 2021

MIT 6.823 Spring 2021

L02-34

## Hardwired Control Table

| Opcode              | ExtSel | BSrc | OpSel | MemW | RegW | WBSrc | RegDst | PCSrc |
|---------------------|--------|------|-------|------|------|-------|--------|-------|
| ALU                 |        |      |       |      |      |       |        |       |
| ALUi                |        |      |       |      |      |       |        |       |
| ALUiu               |        |      |       |      |      |       |        |       |
| LW                  |        |      |       |      |      |       |        |       |
| SW                  |        |      |       |      |      |       |        |       |
| BEQZ <sub>Z=0</sub> |        |      |       |      |      |       |        |       |
| BEQZ <sub>Z=1</sub> |        |      |       |      |      |       |        |       |
| J                   |        |      |       |      |      |       |        |       |
| JAL                 |        |      |       |      |      |       |        |       |
| JR                  |        |      |       |      |      |       |        |       |
| JALR                |        |      |       |      |      |       |        |       |

BSrc = Reg / Imm  
RegDst = rt / rd / R31

WBSrc = ALU / Mem / PC  
PCSrc = pc+4 / br / rind / jabs

MIT 6.823 Spring 2021

L02-35

## Single-Cycle Hardwired Control: Harvard architecture

We will assume

- Clock period is sufficiently long for all of the following steps to be "completed":

1. instruction fetch
2. decode and register fetch
3. ALU operation
4. data fetch if required
5. register write-back setup time

$$\Rightarrow t_C > t_{IFetch} + t_{RFetch} + t_{ALU} + t_{DMem} + t_{RWB}$$

- At the rising edge of the following clock, the PC, the register file and the memory are updated

February 18, 2021

MIT 6.823 Spring 2021

L02-35 February 18, 2021

MIT 6.823 Spring 2021

L02-36

## Princeton Microarchitecture Datapath & Control



February 18, 2021

MIT 6.823 Spring 2021

L02-37 February 18, 2021

L02-38

## Two-State Controller: Princeton Architecture



A flipflop can be used to remember the phase

## Hardwired Controller: Princeton Architecture



February 18, 2021

MIT 6.823 Spring 2021

L02-39 February 18, 2021

MIT 6.823 Spring 2021

L02-40

## Clock Rate vs CPI

$$t_{C-Princeton} > \max \{ t_M, t_{RF} + t_{ALU} + t_M + t_{WB} \}$$

$$t_{C-Princeton} > t_{RF} + t_{ALU} + t_M + t_{WB}$$

$$t_{C-Harvard} > t_M + t_{RF} + t_{ALU} + t_M + t_{WB}$$

Suppose  $t_M \gg t_{RF} + t_{ALU} + t_{WB}$

$$t_{C-Princeton} = 0.5 * t_{C-Harvard}$$

$$CPI_{Princeton} = 2$$

$$CPI_{Harvard} = 1$$

No difference in performance!

Is it possible to design a controller for the Princeton architecture with  $CPI < 2$ ?

$CPI = \text{Clock cycles Per Instruction}$

Stay tuned!

MIT 6.823 Spring 2021

L02-41

## CPU-Memory Bottleneck



Performance of high-speed computers is usually limited by memory *bandwidth & latency*

- Latency (time for a single access)  
Memory access time  $>>$  Processor cycle time
- Bandwidth (number of accesses per unit time)  
if fraction  $m$  of instructions access memory,  
 $\Rightarrow 1+m$  memory references / instruction  
 $\Rightarrow$  CPI = 1 requires  $1+m$  memory refs / cycle

February 23, 2021

MIT 6.823 Spring 2021

L03-1 February 23, 2021

MIT 6.823 Spring 2021

L03-2

## Memory Technology

- Early machines used a variety of memory technologies
  - Manchester Mark I used CRT Memory Storage
  - EDVAC used a mercury delay line
- Core memory was first large scale reliable main memory
  - Invented by Forrester in late 40s at MIT for Whirlwind project
  - Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires
- First commercial DRAM was Intel 1103
  - 1Kbit of storage on single chip
  - charge on a capacitor used to hold value
- Semiconductor memory quickly replaced core in 1970s
  - Intel formed to exploit market for semiconductor memory
- Flash memory
  - Slower, but denser than DRAM. Also non-volatile, but with wearout issues
- Phase change memory (PCM, 3D XPoint)
  - Slightly slower, but much denser than DRAM and non-volatile

February 23, 2021

MIT 6.823 Spring 2021

L03-3 February 23, 2021

MIT 6.823 Spring 2021

L03-4

## DRAM Architecture



- Bits stored in 2-dimensional arrays on chip
- Modern chips have around 8 logical banks on each chip
  - Each logical bank physically implemented as many smaller arrays

## DRAM timing



February 23, 2021

MIT 6.823 Spring 2021

L03-5 February 23, 2021

MIT 6.823 Spring 2021

L03-6

## Processor-DRAM Gap (latency)



Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access!

## Little's Law

$$\text{Throughput } (T) = \frac{\text{Number in Flight } (N)}{\text{Latency } (L)}$$



Example:

- Assume infinite-bandwidth memory
- 100 cycles / memory reference
- 1 + 0.2 memory references / instruction

$$\Rightarrow \text{Table size} = 1.2 * 100 = 120 \text{ entries}$$

120 independent memory operations in flight!

February 23, 2021

MIT 6.823 Spring 2021

L03-7 February 23, 2021

MIT 6.823 Spring 2021

L03-8

## Basic Static RAM Cell

6-Transistor SRAM Cell



• Write:

1. Drive bit lines ( $\text{bit}=1, \overline{\text{bit}}=0$ )
2. Select word line

• Read:

1. Precharge bit and  $\overline{\text{bit}}$  to Vdd
2. Select word line
3. Cell pulls one bit line low
4. Column sense amp detects difference between  $\text{bit}$  &  $\overline{\text{bit}}$

## Multilevel Memory

Strategy: Reduce average latency using small, fast memories called caches.

Caches are a mechanism to reduce memory latency based on the empirical observation that the patterns of memory references made by a processor are often highly predictable:

|                                                                                            |                                               |
|--------------------------------------------------------------------------------------------|-----------------------------------------------|
| $\text{Loop: add r2, r1, r1}$<br>$\text{subi r3, r3, #1}$<br>$\text{bneq r3, loop}$<br>... | $\text{PC}$<br>96<br>100<br>104<br>108<br>112 |
|--------------------------------------------------------------------------------------------|-----------------------------------------------|

February 23, 2021

MIT 6.823 Spring 2021

L03-9 February 23, 2021

MIT 6.823 Spring 2021

L03-10

## Typical Memory Reference Patterns



## Common Predictable Patterns

Two predictable properties of memory references:

- **Temporal Locality:** If a location is referenced, it is likely to be referenced again in the near future
- **Spatial Locality:** If a location is referenced, it is likely that locations near it will be referenced in the near future

## Memory Hierarchy



- **size:** Register << SRAM << DRAM why?
- **latency:** Register << SRAM << DRAM why?
- **bandwidth:** on-chip >> off-chip why?

On a data access:

- hit* (data  $\in$  fast memory)  $\Rightarrow$  low latency access
- miss* (data  $\notin$  fast memory)  $\Rightarrow$  long latency access (DRAM)

February 23, 2021

MIT 6.823 Spring 2021

L03-11 February 23, 2021

MIT 6.823 Spring 2021

L03-12

# Management of Memory Hierarchy

- Small/fast storage, e.g., registers
  - Address usually specified in instruction
  - Generally implemented directly as a register file
    - but hardware might do things behind software's back, e.g., stack management, register renaming
- Large/slower storage, e.g., memory
  - Address usually computed from values in register
  - Generally implemented as a cache hierarchy
    - hardware decides what is kept in fast memory
    - but software may provide "hints", e.g., don't cache or prefetch

February 23, 2021

MIT 6.823 Spring 2021

L03-13 February 23, 2021

MIT 6.823 Spring 2021

L03-14

## Inside a Cache



Q: How many bits needed in tag? \_\_\_\_\_

## Cache Algorithm (Read)

Look at Processor Address, search cache tags to find match.  
Then either



February 23, 2021

MIT 6.823 Spring 2021

L03-15 February 23, 2021

MIT 6.823 Spring 2021

L03-15

## Direct-Mapped Cache



Q: What is a bad reference pattern? \_\_\_\_\_

## Direct Map Address Selection *higher-order vs. lower-order address bits*



Q: Why might this be undesirable? \_\_\_\_\_

February 23, 2021

MIT 6.823 Spring 2021

L03-17 February 23, 2021

MIT 6.823 Spring 2021

L03-18

## Hashed Address Mapping



Q: What are the tradeoffs of hashing? \_\_\_\_\_

## 2-Way Set-Associative Cache



February 23, 2021

MIT 6.823 Spring 2021

L03-19 February 23, 2021

## Placement Policy



February 23, 2021

MIT 6.823 Spring 2021

L03-21 February 23, 2021

MIT 6.823 Spring 2021

L03-22

## Improving Cache Performance

$$\text{Average memory access time} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}$$

To improve performance:

- reduce the hit time
- reduce the miss rate (e.g., larger, better policy)
- reduce the miss penalty (e.g., L2 cache)

*What is the simplest design strategy?*

## Causes for Cache Misses

### • Compulsory:

First reference to a block *a.k.a.* cold start misses  
- misses that would occur even with infinite cache

### • Capacity:

cache is too small to hold all data the program needs  
- misses that would occur even under perfect placement & replacement policy

### • Conflict:

misses from collisions due to block-placement strategy  
- misses that would not occur with full associativity

## Effect of Cache Parameters on Performance

|                   | Larger capacity cache | Higher associativity cache | Larger block size cache * |
|-------------------|-----------------------|----------------------------|---------------------------|
| Compulsory misses |                       |                            |                           |
| Capacity misses   |                       |                            |                           |
| Conflict misses   |                       |                            |                           |
| Hit latency       |                       |                            |                           |
| Miss latency      |                       |                            |                           |

\* Assume substantial spatial locality

February 23, 2021

MIT 6.823 Spring 2021

L03-23 February 23, 2021

MIT 6.823 Spring 2021

L03-24

# Block-level Optimizations

- Tags are too large, i.e., too much overhead
  - Simple solution: Larger blocks, but miss penalty could be large.
- Sub-block placement (aka sector cache)
  - A valid bit added to units smaller than the full block, called sub-blocks
  - Only read a sub-block on a miss
  - If a tag matches, is the sub-block in the cache?

|     |   |   |   |
|-----|---|---|---|
| 100 | 1 | 1 | 1 |
| 300 | 1 | 1 | 0 |
| 204 | 0 | 1 | 0 |

February 23, 2021

MIT 6.823 Spring 2021

L03-25 February 23, 2021

MIT 6.823 Spring 2021

L03-26

# Multiple replacement policies

Use the best replacement policy for a program  
Cache



February 23, 2021

MIT 6.823 Spring 2021

L03-27 February 23, 2021

MIT 6.823 Spring 2021

L03-28

# Inclusion Policy

- Inclusive multilevel cache:
  - Inner cache holds copies of data in outer cache
  - On miss, line inserted in inner and outer cache; replacement in outer cache invalidates line in inner cache
  - External accesses need only check outer cache
  - Commonly used (e.g., Intel CPUs up to Broadwell)
- Non-inclusive multilevel caches:
  - Inner cache may hold data not in outer cache
  - Replacement in outer cache doesn't invalidate line in inner cache
  - Used in Intel Skylake, ARM
- Exclusive multilevel caches:
  - Inner cache and outer cache hold different data
  - Swap lines between inner/outer caches on miss
  - Used in AMD processors

Why choose one type or the other?

February 23, 2021

MIT 6.823 Spring 2021

L03-29 February 23, 2021

MIT 6.823 Spring 2021

L03-30

# Replacement Policy

Which block from a set should be evicted?

- Random
- Least Recently Used (LRU)
  - LRU cache state must be updated on every access
  - true implementation only feasible for small sets (2-way)
  - pseudo-LRU binary tree was often used for 4-8 way
- First In, First Out (FIFO) a.k.a. Round-Robin
  - used in highly associative caches
- Not Least Recently Used (NLRU)
  - FIFO with exception for most recently used block or blocks
- One-bit LRU
  - Each way represented by a bit. Set on use, replace first unused.

L03-25 February 23, 2021

MIT 6.823 Spring 2021

L03-26

# Multilevel Caches

- A memory cannot be large and fast
- Add level of cache to reduce miss penalty
  - Each level can have longer latency than level above
  - So, increase sizes of cache at each level



Metrics:

Local miss rate = misses in cache / accesses to cache

Global miss rate = misses in cache / CPU memory accesses

Misses per instruction (MPI) = misses in cache / number of instructions

# Victim Caches (HP 7200)



Victim cache is a small associative back up cache, added to a direct mapped cache, which holds recently evicted lines

- First look up in direct mapped cache
  - If miss, look in victim cache
  - If hit in victim cache, swap hit line with line now evicted from L1
  - If miss in victim cache, L1 victim -> VC, VC victim->?
- Fast hit time of direct mapped but with reduced conflict misses

MIT 6.823 Spring 2021

L03-30

## Typical memory hierarchies



(a) Memory hierarchy for a server



(b) Memory hierarchy for a personal mobile device

February 23, 2021

MIT 6.823 Spring 2021

L03-31 February 23, 2021

MIT 6.823 Spring 2021

L03-32

## HBM DRAM or MCDRAM



## Mixed technology caching (Intel Knights Landing)



February 23, 2021

MIT 6.823 Spring 2021

L03-33 February 23, 2021

MIT 6.823 Spring 2021

L03-34

Thank you!

Next lecture:  
Virtual memory

## Memory Management: From Absolute Addresses to Demand Paging

Daniel Sanchez

Computer Science and Artificial Intelligence Laboratory  
M.I.T.

## Recap: Cache Organization

- Caches are small and fast memories that transparently retain recently accessed data
- Cache organizations
  - Direct-mapped
  - Set-associative
  - Fully associative
- Cache performance
  - $AMAT = HitLatency + MissRate * MissLatency$
  - Minimizing AMAT requires balancing competing tradeoffs

## Multilevel Caches

- A memory cannot be large and fast
  - Add level of cache to reduce miss penalty
    - Each level can have longer latency than level above
    - So, increase sizes of cache at each level



## Metrics:

Local miss rate = misses in cache / accesses to cache

Global miss rate = misses in cache / CPU memory accesses

Misses per instruction = misses in cache / number of instructions

February 25, 2021

MIT 6.823 Spring 2021

L04-3 February 23, 2021

MIT 6.823 Spring 2021

L03-4

## Victim Caches (HP 7200)



Victim cache is a small associative back up cache, added to a direct mapped cache, which holds recently evicted lines

- First look up in direct-mapped cache
  - If miss, look in victim cache
  - If hit in victim cache, swap hit line with line now evicted from L1
  - If miss in victim cache, L1 victim  $\rightarrow$  VC, VC victim  $\rightarrow$ ?

Fast hit time of direct-mapped but with reduced conflict misses

Table Two

[View Details](#)

①) Memory hierarchy for a person

## HBM DRAM or MCDRAM



Source: AMD

## Inclusion Policy

- Inclusive multilevel cache:
    - Inner cache holds copies of data in outer cache
    - On miss, line inserted in inner and outer cache; replacement in outer cache invalidates line in inner cache
    - External accesses need only check outer cache
    - Commonly used (e.g., Intel CPUs up to Broadwell)
  - Non-inclusive multilevel caches:
    - Inner cache may hold data not in outer cache
    - Replacement in outer cache doesn't invalidate line in inner cache
    - Used in Intel Skylake, ARM
  - Exclusive multilevel caches:
    - Inner cache and outer cache hold different data
    - Swap lines between inner/outer caches on miss
    - Used in AMD processors

Why choose one type or the other?

## Typical memory hierarchies



#### Memory hierarchy for a person

# Memory Management

- The Fifties
    - Absolute Addresses
    - Dynamic address translation
  - The Sixties
    - Atlas and Demand Paging
    - Paged memory systems and TLBs
  - Modern Virtual Memory Systems

## Names for Memory Locations



- Machine language address
  - as specified in machine code
- Virtual address
  - ISA specifies translation of machine code address into virtual address of program variable (sometimes called effective address)
- Physical address
  - ⇒ operating system specifies mapping of virtual address into name for a physical memory location

February 25, 2021

MIT 6.823 Spring 2021

L04-9 February 25, 2021

MIT 6.823 Spring 2021

L04-10

## Absolute Addresses

*EDSAC, early 50's*

**virtual address = physical memory address**

- Only one program ran at a time, with unrestricted access to entire machine (RAM + I/O devices)
- Addresses in a program depended upon where the program was to be loaded in memory
- *But it was more convenient for programmers to write location-independent subroutines*

*How could location independence be achieved?*

*Linker and/or loader modify addresses of subroutines and callers when building a program memory image*

## Multiprogramming

### Motivation

In the early machines, I/O operations were slow and each word transferred involved the CPU

Higher throughput if CPU and I/O of 2 or more programs were overlapped. How?  
⇒ *multiprogramming*

### Location-independent programs

Programming and storage management ease  
⇒ need for a *base register*

### Protection

Independent programs should not affect each other inadvertently  
⇒ need for a *bound register*



February 25, 2021

MIT 6.823 Spring 2021

L04-11 February 25, 2021

MIT 6.823 Spring 2021

L04-12

## Simple Base and Bound Translation



Base and bounds registers are visible/accessible only when processor is running in *supervisor mode*

## Separate Areas for Code and Data



*What is an advantage of this separation?*

(Scheme used on all Cray vector supercomputers prior to X1, 2002)

February 25, 2021

MIT 6.823 Spring 2021

L04-13 February 25, 2021

## Memory Fragmentation



As users come and go, the storage is "fragmented". Therefore, at some stage programs have to be moved around to compact the storage.

February 25, 2021

MIT 6.823 Spring 2021

L04-14

# Paged Memory Systems

- Processor-generated address can be interpreted as a pair <page number, offset>
- page number      offset
- A page table contains the physical address of the base of each page



*Page tables make it possible to store the pages of a program non-contiguously.*

February 25, 2021

MIT 6.823 Spring 2021

L04-15 February 25, 2021

MIT 6.823 Spring 2021

L04-16

## Where Should Page Tables Reside?

- Space required by the page tables (PT) is proportional to the address space, number of users, ...
  - Space requirement is large
  - Too expensive to keep in registers
- Idea: Keep PT of the current user in special registers
  - may not be feasible for large page tables
  - Increases the cost of context swap
- Idea: Keep PTs in the main memory
  - needs one reference to retrieve the page base address and another to access the data word
    - doubles the number of memory references!*

February 25, 2021

MIT 6.823 Spring 2021

L04-17 February 25, 2021

MIT 6.823 Spring 2021

L04-18

## A Problem in Early Sixties

- There were many applications whose data could not fit in the main memory, e.g., payroll
  - Paged memory system reduced fragmentation but still required the whole program to be resident in the main memory*
- Programmers moved the data back and forth from the secondary store by *overlaying* it repeatedly on the primary store

*tricky programming!*

February 25, 2021

MIT 6.823 Spring 2021

L04-19 February 25, 2021

MIT 6.823 Spring 2021

L04-20

## Private Address Space per User



- Each user has a page table
- Page table contains an entry for each user page

MIT 6.823 Spring 2021

L04-17

## Page Tables in Physical Memory



MIT 6.823 Spring 2021

L04-18

## Manual Overlays

- Assume an instruction can address all the storage on the drum
- Method 1:* programmer keeps track of addresses in the main memory and initiates an I/O transfer when required
- Method 2:* automatic initiation of I/O transfers by software address translation

*Brooker's interpretive coding, 1960*



Ferranti Mercury  
1956

Problems?

Method1: Difficult, error prone  
Method2: Inefficient

MIT 6.823 Spring 2021

L04-20

# Demand Paging in Atlas (1962)

"A page from secondary storage is brought into the primary storage whenever it is (implicitly) demanded by the processor."

*Tom Kilburn*

Primary memory as a cache for secondary memory

User sees  $32 \times 6 \times 512$  words of storage



# Hardware Organization of Atlas



Compare the effective page address against all 32 PARs

match  $\Rightarrow$  normal access

no match  $\Rightarrow$  page fault

save the state of the partially executed instruction

February 25, 2021

MIT 6.823 Spring 2021

L04-21 February 25, 2021

MIT 6.823 Spring 2021

L04-22

## Atlas Demand Paging Scheme

- On a page fault:
  - Input transfer into a free page is initiated
  - The Page Address Register (PAR) is updated
  - If no free page is left, a page is selected to be replaced (based on usage)
  - The replaced page is written on the drum
    - to minimize the drum latency effect, the first empty page on the drum was selected
  - The page table is updated to point to the new location of the page on the drum

February 25, 2021

MIT 6.823 Spring 2021

L04-23 February 25, 2021

MIT 6.823 Spring 2021

L04-24

## Caching vs. Demand Paging



## Modern Virtual Memory Systems

*Illusion of a large, private, uniform store*

### Protection & Privacy

several users, each with their private address space and one or more shared address spaces  
page table = name space

### Demand Paging

Provides the ability to run programs larger than the primary memory

Hides differences in machine configurations

*The price is address translation on each memory reference*



February 25, 2021

MIT 6.823 Spring 2021

L04-25 February 25, 2021

## Linear Page Table

- Page Table Entry (PTE) contains:
  - A bit to indicate if a page exists
  - PPN (physical page number) for a memory-resident page
  - DPN (disk page number) for a page on the disk
  - Status bits for protection and usage
- OS sets the Page Table Base Register whenever active user process changes



MIT 6.823 Spring 2021

VIRTUAL address

L04-26

# Size of Linear Page Table

With 32-bit addresses, 4 KB pages & 4-byte PTEs:  
 ⇒  $2^{20}$  PTEs, i.e., 4 MB page table per user  
 ⇒ 4 GB of swap space needed to back up the full virtual address space

Larger pages?

- Internal fragmentation (Not all memory in a page is used)
- Larger page fault penalty (more time to read from disk)

What about 64-bit virtual address space???

- Even 1MB pages would require  $2^{44}$  8-byte PTEs (35 TB!)  
*What is the "saving grace"?*

February 25, 2021

MIT 6.823 Spring 2021

L04-27 February 25, 2021

MIT 6.823 Spring 2021

L04-28

# Address Translation & Protection



- Every instruction and data access needs address translation and protection checks

*A good VM design needs to be fast (~ one cycle) and space-efficient*

February 25, 2021

MIT 6.823 Spring 2021

L04-29 February 25, 2021

MIT 6.823 Spring 2021

L04-29

# TLB Designs

- Typically 32-128 entries, usually highly associative
  - Each entry maps a large page, hence less spatial locality across pages → more likely that two entries conflict
  - Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative
- Random or FIFO replacement policy
- No process information in TLB?
- TLB Reach: Size of largest virtual address space that can be simultaneously mapped by TLB

Example: 64 TLB entries, 4KB pages, one page per entry

TLB Reach = \_\_\_\_\_ ?

February 25, 2021

MIT 6.823 Spring 2021

L04-31 February 25, 2021

MIT 6.823 Spring 2021

L04-32

# Hierarchical Page Table



February 25, 2021

MIT 6.823 Spring 2021

L04-28

# Translation Lookaside Buffers

Address translation is very expensive!  
 In a two-level page table, each reference becomes several memory accesses

Solution: Cache translations in TLB

TLB hit      ⇒ Single-cycle Translation  
 TLB miss      ⇒ Page Table Walk to refill



February 25, 2021

MIT 6.823 Spring 2021

L04-30

# Variable-Sized Page Support



February 25, 2021

MIT 6.823 Spring 2021

L04-32

## Variable-Size Page TLB



Alternatively, have a separate TLB for each page size (pros/cons?)

February 25, 2021

MIT 6.823 Spring 2021

L04-33 February 25, 2021

MIT 6.823 Spring 2021

L04-34

## Handling a TLB Miss

### Software (MIPS, Alpha)

TLB miss causes an exception and the operating system walks the page tables and reloads TLB. A privileged "untranslated" addressing mode used for walk

### Hardware (SPARC v8, x86, PowerPC)

A memory management unit (MMU) walks the page tables and reloads the TLB

If a missing (data or PT) page is encountered during the TLB reloading, MMU gives up and signals a Page-Fault exception for the original instruction

## Hierarchical Page Table Walk: SPARC v8



MMU does this table walk in hardware on a TLB miss

February 25, 2021

MIT 6.823 Spring 2021

L04-35 February 25, 2021

MIT 6.823 Spring 2021

L04-35

## Address Translation: putting it all together



Next lecture:

Modern Virtual Memory Systems

## Modern Virtual Memory Systems

Daniel Sanchez  
Computer Science and Artificial Intelligence Laboratory  
M.I.T.

February 25, 2021

MIT 6.823 Spring 2021

L04-37 March 2, 2021

MIT 6.823 Spring 2021

L05-1

# Recap: Virtual Memory Systems

*Illusion of a large, private, uniform store*

## Protection & Privacy

- several users, each with their private address space and one or more shared address spaces
  - page table = name space

## Demand Paging

- Provides the ability to run programs larger than the primary memory
- Hides differences in machine configurations

*The price is address translation on each memory reference*



March 2, 2021

MIT 6.823 Spring 2021

L05-2 March 2, 2021

MIT 6.823 Spring 2021

L05-3

## Reminder: Translation Lookaside Buffers

Address translation is very expensive!

In a hierarchical page table, each reference becomes several memory accesses

Solution: Cache translations in TLB



## Reminder: TLB Designs

- Typically 32-128 entries, usually highly associative
- Keep process information in TLB?
  - No process id → Must flush on context switch
  - Tag each entry with process id → No flush, but costlier
- TLB Reach: Size of largest virtual address space that can be simultaneously mapped by TLB  
Example: 64 TLB entries, 4KB pages, one page per entry  
TLB Reach = \_\_\_\_\_?
- Ways to increase TLB reach
  - Multi-level TLBs (e.g., Intel Skylake: 64-entry L1 data TLB, 128-entry L1 instruction TLB, 1.5K-entry L2 TLB)
  - Multiple page sizes (e.g., x86-64: 4KB, 2MB, 1GB)

March 2, 2021

MIT 6.823 Spring 2021

L05-4 March 2, 2021

MIT 6.823 Spring 2021

L05-5

## Variable-Sized Page Support



## Variable-Size Page TLB



Alternatively, have a separate TLB for each page size (pros/cons?)

March 2, 2021

MIT 6.823 Spring 2021

L05-6 March 2, 2021

MIT 6.823 Spring 2021

L05-7

## Handling a TLB Miss

### Software (MIPS, Alpha)

TLB miss causes an exception and the operating system walks the page tables and reloads TLB. A privileged "untranslated" addressing mode used for walk

### Hardware (SPARC v8, x86, PowerPC)

A memory management unit (MMU) walks the page tables and reloads the TLB

If a missing (data or PT) page is encountered during the TLB reloading, MMU gives up and signals a Page-Fault exception for the original instruction

# Hierarchical Page Table Walk: SPARC v8



MMU does this table walk in hardware on a TLB miss

March 2, 2021

MIT 6.823 Spring 2021

L05-8 March 2, 2021

# Address Translation: putting it all together



## Topics

- Interrupts
- Speeding up the common case:
  - TLB & Cache organization
- Speeding up page table walks
- Modern Usage

March 2, 2021

MIT 6.823 Spring 2021

L05-10 March 2, 2021

MIT 6.823 Spring 2021

L05-11

## Interrupts: altering the normal flow of control



An *external or internal event* that needs to be processed by another (system) program. The event is usually unexpected or rare from program's point of view.

## Causes of Interrupts

Interrupt: an event that requests the attention of the processor

- Asynchronous: an *external event*
  - input/output device service-request
  - timer expiration
  - power disruptions, hardware failure
- Synchronous: an *internal event (a.k.a. exception)*
  - undefined opcode, privileged instruction
  - arithmetic overflow, FPU exception
  - misaligned memory access
  - *virtual memory exceptions*: page faults, TLB misses, protection violations
  - *traps*: system calls, e.g., jumps into kernel

March 2, 2021

MIT 6.823 Spring 2021

L05-12 March 2, 2021

MIT 6.823 Spring 2021

L05-13

## Asynchronous Interrupts Invoking the interrupt handler

- An I/O device requests attention by asserting one of the *prioritized interrupt request lines*
- When the processor decides to process interrupt
  - It stops the current program at instruction I<sub>i</sub>, completing all the instructions up to I<sub>i-1</sub> (*precise interrupt*)
  - It saves the PC of instruction I<sub>i</sub> in a special register (EPC)
  - It disables interrupts and transfers control to a designated interrupt handler running in kernel mode

# Interrupt Handler

- Saves EPC before enabling interrupts to allow nested interrupts ⇒
  - need an instruction to move EPC into GPRs
  - need a way to mask further interrupts at least until EPC can be saved
- Needs to read a *status register* that indicates the cause of the interrupt
- Uses a special indirect jump instruction RFE (*return-from-exception*) that
  - enables interrupts
  - restores the processor to the user mode
  - restores hardware status and control state

March 2, 2021

MIT 6.823 Spring 2021

L05-14 March 2, 2021

MIT 6.823 Spring 2021

L05-15

# Synchronous Interrupts

- A synchronous interrupt (exception) is caused by a *particular instruction*
- In general, the instruction cannot be completed and needs to be *restarted* after the exception has been handled
  - With pipelining, requires undoing the effect of one or more partially executed instructions
- In case of a trap (system call), the instruction is considered to have been completed
  - A special jump instruction involving a change to privileged kernel mode

# Topics

- Interrupts
- Speeding up the common case:
  - TLB & Cache organization
- Speeding up page table walks
- Modern Usage

March 2, 2021

MIT 6.823 Spring 2021

L05-16 March 2, 2021

MIT 6.823 Spring 2021

L05-17

# Address Translation in CPU



- Software handlers need a *restartable* exception on page fault or protection violation
- Handling a TLB miss needs a *hardware* or *software* mechanism to refill TLB
- Need mechanisms to cope with the additional latency of TLB:
  - slow down the clock
  - pipeline the TLB and cache access
  - virtual-address caches
  - parallel TLB/cache access

# Virtual-Address Caches



Alternative: place the cache before the TLB



- one-step process in case of a hit (+)
- cache needs to be flushed on a context switch unless address space identifiers (ASIDs) included in tags (-)
- aliasing problems due to the sharing of pages (-)

# Aliasing in Virtual-Address Caches



Two virtual pages share one physical page

| Page Table      | Tag             | Data                   |
|-----------------|-----------------|------------------------|
| VA <sub>1</sub> | VA <sub>1</sub> | 1st Copy of Data at PA |
|                 |                 |                        |
| VA <sub>2</sub> | VA <sub>2</sub> | 2nd Copy of Data at PA |

Virtual cache can have two copies of same physical data. Writes to one copy not visible to reads of other!

General Solution: *Disallow aliases to coexist in cache*

Software (i.e., OS) solution for direct-mapped cache

VAs of shared pages must agree in cache index bits; this ensures all VAs accessing same PA will conflict in direct-mapped cache (early SPARCcs)

March 2, 2021

MIT 6.823 Spring 2021

L05-18 March 2, 2021

MIT 6.823 Spring 2021

L05-19

## Concurrent Access to TLB & Cache



Index L is available without consulting the TLB

$\Rightarrow$  cache and TLB accesses can begin simultaneously

Tag comparison is made after both accesses are completed

When does this work?  $L + b < k \_\_ L + b = k \_\_ L + b > k \_\_$

March 2, 2021

MIT 6.823 Spring 2021

L05-20 March 2, 2021

MIT 6.823 Spring 2021

L05-21

## Virtual-Index Physical-Tag Caches: Associative Organization



After the PPN is known,  $2^a$  physical tags are compared

Is this scheme realistic?

March 2, 2021

MIT 6.823 Spring 2021

L05-22 March 2, 2021

MIT 6.823 Spring 2021

L05-23

## Anti-Aliasing Using L2: MIPS R10000



March 2, 2021

MIT 6.823 Spring 2021

L05-24 March 2, 2021

MIT 6.823 Spring 2021

L05-25

## Concurrent Access to TLB & Large L1

The problem with L1 > Page size



Can VA<sub>1</sub> and VA<sub>2</sub> both map to PA?

## A solution via Second-Level Cache



Usually a common L2 cache backs up both Instruction and Data L1 caches

L2 is "inclusive" of both Instruction and Data caches

## Virtually Addressed L1: Anti-Aliasing using L2



March 2, 2021

MIT 6.823 Spring 2021

L05-24 March 2, 2021

MIT 6.823 Spring 2021

L05-25

# Topics

- Interrupts
- Speeding up the common case:
  - TLB & Cache organization
- Speeding up page table walks
- Modern Usage

March 2, 2021

MIT 6.823 Spring 2021

L05-26 March 2, 2021

MIT 6.823 Spring 2021

L05-27

## Translation for Page Tables

- Can references to page tables cause TLB misses?
- Can this go on forever?



A program that traverses the page table needs a "no translation" addressing mode.

March 2, 2021

MIT 6.823 Spring 2021

L05-28 March 2, 2021

MIT 6.823 Spring 2021

L05-29

## Atlas Revisited

- One PAR for each physical page
- PAR's contain the VPN's of the pages *resident in primary memory*
- *Advantage:* The size is proportional to the size of the primary memory
- *What is the disadvantage?*



March 2, 2021

MIT 6.823 Spring 2021

L05-30 March 2, 2021

MIT 6.823 Spring 2021

L05-31

## Page Fault Handler

- When the referenced page is not in DRAM:
  - The missing page is located (or created)
  - It is brought in from disk, and page table is updated
    - Another job may be run on the CPU while the first job waits for the requested page to be read from disk*
  - If no free pages are left, a page is swapped out
    - Pseudo-LRU replacement policy*
- Since it takes a long time to transfer a page (msecs), page faults are handled completely in software by the OS
  - Untranslated addressing mode is essential to allow kernel to access page tables

March 2, 2021

MIT 6.823 Spring 2021

L05-26 March 2, 2021

MIT 6.823 Spring 2021

L05-27

## Swapping a Page of a Page Table

- A PTE in primary memory contains primary or secondary memory addresses
  - A PTE in secondary memory contains *only* secondary memory addresses
- ⇒ a page of a PT can be swapped out only if none of its PTE's point to pages in the primary memory

Why? \_\_\_\_\_

March 2, 2021

MIT 6.823 Spring 2021

L05-28 March 2, 2021

MIT 6.823 Spring 2021

L05-29

## Hashed Page Table: Approximating Associative Addressing



March 2, 2021

MIT 6.823 Spring 2021

L05-30 March 2, 2021

MIT 6.823 Spring 2021

L05-31

## Virtual Memory Use Today - 1

- Desktop/server/cellphone processors have full demand-paged virtual memory
  - Portability between machines with different memory sizes
  - Protection between multiple users or multiple tasks
  - Share small physical memory among active tasks
  - Simplifies implementation of some OS features
- Vector supercomputers and GPUs have translation and protection but not demand paging  
(Older Crays: base&bound, Japanese & Cray X1: pages)
  - Don't waste expensive processor time thrashing to disk (make jobs fit in memory)
  - Mostly run in batch mode (run set of jobs that fits in memory)
  - Difficult to implement restartable vector instructions

March 2, 2021

MIT 6.823 Spring 2021

L05-32 March 2, 2021

MIT 6.823 Spring 2021

L05-33

## Virtual Memory Use Today - 2

- Most embedded processors and DSPs provide physical addressing only
  - Can't afford area/speed/power budget for virtual memory support
  - Often there is no secondary storage to swap to!
  - Programs custom-written for particular memory configuration in product
  - Difficult to implement restartable instructions for exposed architectures

*Next lecture: Pipelining!*

March 2, 2021

MIT 6.823 Spring 2021

L05-34 March 2, 2021

MIT 6.823 Spring 2021

L05-35

## Global System Address Space



- Level A maps users' address spaces into the global space providing privacy, protection, sharing etc.
- Level B provides demand paging for the large global system address space
- Level A and Level B translations may be kept in separate TLB's

## Hashed Page Table Walk: PowerPC Two-level, Segmented Addressing



March 2, 2021

MIT 6.823 Spring 2021

L05-36 March 2, 2021

## Power PC: Hashed Page Table



MIT 6.823 Spring 2021

L05-37

## Reminder: Harvard-Style Single-Cycle Datapath for MIPS

# Instruction Pipelining and Hazards

Daniel Sanchez

Computer Science and Artificial Intelligence Laboratory  
M.I.T.



March 4, 2021

MIT 6.823 Spring 2021

L06-1 March 4, 2021

MIT 6.823 Spring 2021

L06-2

## Reminder: Princeton Microarchitecture Datapath & Control for 2 cycles-per-instruction



March 4, 2021

MIT 6.823 Spring 2021

L06-3 March 4, 2021

MIT 6.823 Spring 2021

L06-4

## Princeton Microarchitecture (redrawn)



Only one of the phases is active in any cycle  
⇒ a lot of datapath not used at any given time

## Princeton Microarchitecture Overlapped execution



Can we overlap instruction fetch and execute?

Which action should be prioritized?

What do we do with Fetch?

March 4, 2021

MIT 6.823 Spring 2021

L06-5 March 4, 2021

MIT 6.823 Spring 2021

L06-6

## Stalling the instruction fetch Princeton Microarchitecture



What if IR contains a jump or branch instruction?

# Need to stall on branches

Princeton Microarchitecture



When IR contains a jump or taken branch

- no "structural conflict" for the memory
- but we do not have the correct PC value in the PC
- memory cannot be used – Address Mux setting is irrelevant
- insert a nop in the IR
- insert the nextPC (branch-target) address in the PC

March 4, 2021

MIT 6.823 Spring 2021

L06-7

# Pipelined Princeton Microarchitecture



March 4, 2021

MIT 6.823 Spring 2021

L06-8

# Pipelined Princeton: Control Table

| Opcode              | Stall | Ext Sel          | B Src | Op Sel | Mem W | WB W | Reg W | WB Src | Reg Dst | PC Src1 | PC Src2 | IR Src | MAddr Src |
|---------------------|-------|------------------|-------|--------|-------|------|-------|--------|---------|---------|---------|--------|-----------|
| ALU                 | no    | *                | Reg   | Func   | no    | yes  | ALU   | rd     | pc+4    | npc     | mem     | pc     |           |
| ALUi                | no    | sE <sub>16</sub> | Imm   | Op     | no    | yes  | ALU   | rt     | pc+4    | npc     | mem     | pc     |           |
| ALUiu               | no    | uE <sub>16</sub> | Imm   | Op     | no    | yes  | ALU   | rt     | pc+4    | npc     | mem     | pc     |           |
| LW                  | yes   | sE <sub>16</sub> | Imm   | +      | no    | yes  | Mem   | rt     | pc+4    | pc      | nop     | ALU    |           |
| SW                  | yes   | sE <sub>16</sub> | Imm   | +      | yes   | no   | *     | *      | pc+4    | pc      | nop     | ALU    |           |
| BEQZ <sub>z=1</sub> | yes   | sE <sub>16</sub> | *     | 0?     | no    | no   | *     | *      | br      | npc     | nop     | *      |           |
| BEQZ <sub>z=0</sub> | no    | sE <sub>16</sub> | *     | 0?     | no    | no   | *     | *      | pc+4    | npc     | mem     | pc     |           |
| J                   | yes   | *                | *     | *      | no    | no   | *     | *      | jabs    | npc     | nop     | *      |           |
| JAL                 | yes   | *                | *     | *      | no    | yes  | PC    | R31    | jabs    | npc     | nop     | *      |           |
| JR                  | yes   | *                | *     | *      | no    | no   | *     | *      | rind    | npc     | nop     | *      |           |
| JALR                | yes   | *                | *     | *      | no    | yes  | PC    | R31    | rind    | npc     | nop     | *      |           |
| NOP                 | no    | *                | *     | *      | no    | no   | *     | *      | pc+4    | npc     | mem     | pc     |           |

BSrc = Reg / Imm ; WBSrc = ALU / Mem / PC; IRSrc = nop/mem; MAddSrc = pc/ALU

RegDst = rt / rd / R31; PCSrc1 = pc+4 / br / rind / jabs; PCSrc2 = pc/nPC

stall & IRSrc columns are identical

March 4, 2021

MIT 6.823 Spring 2021

L06-9

MIT 6.823 Spring 2021

L06-10

# Pipelined Princeton Architecture

Clock:  $t_{C-Princeton} > t_{RF} + t_{ALU} + t_M + t_{WB}$

CPI:  $(1 - f) + 2f$  cycles per instruction  
where  $f$  is the fraction of instructions that cause a stall

What is a likely value of  $f$ ?

# An Ideal Pipeline



- All objects go through the same stages
- No sharing of resources between any two stages
- Propagation delay through all pipeline stages is equal
- The scheduling of an object entering the pipeline is not affected by the objects in other stages

These conditions generally hold for industrial assembly lines.

But what about an instruction pipeline?

# Pipelined Datapath



Clock period can be reduced by dividing the execution of an instruction into multiple cycles

$t_C > \max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\}$  ( $= t_{DM}$  probably)

However, CPI will increase unless instructions are pipelined

March 4, 2021

MIT 6.823 Spring 2021

L06-11

MIT 6.823 Spring 2021

L06-12

## How to divide datapath into stages

Suppose memory is significantly slower than other stages. For example, suppose

$$\begin{aligned} t_{IM} &= 10 \text{ units} \\ t_{DM} &= 10 \text{ units} \\ t_{ALU} &= 5 \text{ units} \\ t_{RF} &= 1 \text{ unit} \\ t_{RW} &= 1 \text{ unit} \end{aligned}$$

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance

## Alternative Pipelining



Write-back stage takes much less time than other stages.  
Suppose we combined it with the memory phase

March 4, 2021

MIT 6.823 Spring 2021

L06-13 March 4, 2021

MIT 6.823 Spring 2021

L06-14

## Maximum Speedup by Pipelining

| Assumptions                                                                                 | Unpipelined $t_c$ | Pipelined Speedup $t_c$ |
|---------------------------------------------------------------------------------------------|-------------------|-------------------------|
| 1. $t_{IM} = t_{DM} = 10$ ,<br>$t_{ALU} = 5$ ,<br>$t_{RF} = t_{RW} = 1$<br>4-stage pipeline |                   |                         |
| 2. $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$<br>4-stage pipeline                    |                   |                         |
| 3. $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$<br>5-stage pipeline                    |                   |                         |

What seems to be the message here?

## 5-Stage Pipelined Execution Instruction Flow Diagram



March 4, 2021

MIT 6.823 Spring 2021

L06-15 March 4, 2021

MIT 6.823 Spring 2021

L06-15

## 5-Stage Pipelined Execution Resource Usage Diagram



March 4, 2021

MIT 6.823 Spring 2021

L06-17 March 4, 2021

MIT 6.823 Spring 2021

L06-17

## Pipelined Execution ALU Instructions



Not quite correct!

March 4, 2021

MIT 6.823 Spring 2021

L06-18 March 4, 2021

MIT 6.823 Spring 2021

L06-18

## Pipelined MIPS Datapath without jumps



What else is needed?

March 4, 2021

MIT 6.823 Spring 2021

L06-19 March 4, 2021

MIT 6.823 Spring 2021

L06-20

## Data Hazards



March 4, 2021

MIT 6.823 Spring 2021

L06-21 March 4, 2021

MIT 6.823 Spring 2021

L06-22

## Resolving Data Hazards

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages → stall

Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage → bypass

Strategy 3: Speculate on the dependence  
Two cases:  
Guessed correctly → do nothing  
Guessed incorrectly → kill and restart

## Resolving Data Hazards (1)

### Strategy 1:

Wait for the result to be available by freezing earlier pipeline stages → stall

## Feedback to Resolve Hazards



- Later stages provide dependence information to earlier stages which can stall (or kill) instructions
- Controlling a pipeline in this manner works provided the instruction at stage  $i+1$  can complete without any interference from instructions in stages 1 to  $i$  (otherwise deadlocks may occur)

March 4, 2021

MIT 6.823 Spring 2021

L06-23 March 4, 2021

MIT 6.823 Spring 2021

L06-24

# Resolving Data Hazards by Stalling

## Stall Condition



March 4, 2021

MIT 6.823 Spring 2021

L06-25 March 4, 2021

## Stall Control Logic



Compare the *source registers* of the instruction in the decode stage with the *destination register* of the *uncommitted instructions*.

March 4, 2021

MIT 6.823 Spring 2021

L06-27 March 4, 2021

MIT 6.823 Spring 2021

L06-26

## Stall Control Logic ignoring jumps & branches



Should we always stall if the rs field matches some rd?

March 4, 2021

MIT 6.823 Spring 2021

MIT 6.823 Spring 2021

L07-28

## Source & Destination Registers

R-type: 

|    |    |    |    |      |
|----|----|----|----|------|
| op | rs | rt | rd | func |
|----|----|----|----|------|

I-type: 

|    |    |    |             |
|----|----|----|-------------|
| op | rs | rt | immediate16 |
|----|----|----|-------------|

J-type: 

|    |             |  |  |
|----|-------------|--|--|
| op | immediate26 |  |  |
|----|-------------|--|--|

source(s) destination

|      |                                                 |        |    |
|------|-------------------------------------------------|--------|----|
| ALU  | $rd \leftarrow (rs) \text{ func } (rt)$         | rs, rt | rd |
| ALUi | $rt \leftarrow (rs) \text{ op imm}$             | rs     | rt |
| LW   | $rt \leftarrow M [(rs) + imm]$                  | rs     | rt |
| SW   | $M [(rs) + imm] \leftarrow (rt)$                | rs, rt |    |
| BZ   | cond (rs)                                       |        |    |
|      | true: $PC \leftarrow (PC) + imm$                | rs     |    |
|      | false: $PC \leftarrow (PC) + 4$                 | rs     |    |
| J    | $PC \leftarrow (PC) + imm$                      | rs     | 31 |
| JAL  | $r31 \leftarrow (PC), PC \leftarrow (PC) + imm$ | rs     | 31 |
| JR   | $PC \leftarrow (rs)$                            | rs     |    |
| JALR | $r31 \leftarrow (PC), PC \leftarrow (rs)$       | rs     | 31 |

March 4, 2021

MIT 6.823 Spring 2021

L07-29 March 4, 2021

MIT 6.823 Spring 2021

L07-30

## Stalled Stages and Pipeline Bubbles



Resource Usage

$\text{nop} \Rightarrow \text{pipeline bubble}$

L06-27

## Deriving the Stall Signal

$C_{\text{dest}}$

$ws = \text{Case opcode}$

|           |                   |
|-----------|-------------------|
| ALU       | $\Rightarrow rd$  |
| ALUi, LW  | $\Rightarrow rt$  |
| JAL, JALR | $\Rightarrow R31$ |

$we = \text{Case opcode}$

|               |                           |
|---------------|---------------------------|
| ALU, ALUi, LW | $\Rightarrow (ws \neq 0)$ |
| JAL, JALR     | $\Rightarrow on$          |
| ...           | $\Rightarrow off$         |

$C_{\text{re}}$

$re1 = \text{Case opcode}$

|                                                 |                   |
|-------------------------------------------------|-------------------|
| ALU, ALUi,<br>LW, SW, BZ,<br>JR, JALR<br>J, JAL | $\Rightarrow on$  |
|                                                 | $\Rightarrow off$ |

$re2 = \text{Case opcode}$

|                |                   |
|----------------|-------------------|
| ALU, SW<br>... | $\Rightarrow on$  |
|                | $\Rightarrow off$ |

$C_{\text{stall}}$

$$\text{stall} = ((rs_D == ws_E) \cdot we_E + (rs_D == ws_M) \cdot we_M + (rs_D == ws_W) \cdot we_W) \cdot re1_D + ((rt_D == ws_E) \cdot we_E + (rt_D == ws_M) \cdot we_M + (rt_D == ws_W) \cdot we_W) \cdot re2_D$$

This is not the full story!

## Hazards due to Loads & Stores



## Load & Store Hazards

$\dots$   
 $M[(r1)+7] \leftarrow (r2)$   
 $r4 \leftarrow M[(r3)+5]$   
 $\dots$

$(r1)+7 = (r3)+5 \Rightarrow \text{data hazard}$

However, the hazard is avoided because our memory system completes writes in one cycle!

Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself.

More on this later in the course.

Next lecture:  
 Control Hazards,  
 Bypassing,  
 and Speculation

## Instruction Pipelining: Hazard Resolution, Timing Constraints

Daniel Sanchez

Computer Science and Artificial Intelligence Laboratory  
 M.I.T.

March 4, 2021

MIT 6.823 Spring 2021

L06-33   March 11, 2021

MIT 6.823 Spring 2021

L07-1

## Resolving Data Hazards

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  $\rightarrow$  stall

Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage  $\rightarrow$  bypass

Strategy 3: Speculate on the dependence  
 Two cases:  
 Guessed correctly  $\rightarrow$  no special action required  
 Guessed incorrectly  $\rightarrow$  kill and restart

## Reminder: Stall Control Logic *ignoring jumps & branches*



Stall DEC & IF when instruction in DEC reads a register that is written by any earlier in-flight instruction (in EXE, MEM, or WB)

March 11, 2021

MIT 6.823 Spring 2021

L07-2   March 11, 2021

MIT 6.823 Spring 2021

L07-3

## Reminder: Load & Store Hazards



March 11, 2021

MIT 6.823 Spring 2021

L07-4 March 11, 2021

MIT 6.823 Spring 2021

L07-5

## Bypassing



Each stall or kill introduces a bubble  $\Rightarrow CPI > 1$

When is data actually available?



A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input

March 11, 2021

MIT 6.823 Spring 2021

L07-6 March 11, 2021

MIT 6.823 Spring 2021

L07-7

## The Bypass Signal

Deriving it from the Stall Signal

$$\text{stall} = ((rs_D == ws_E) \cdot we_E + (rs_D == ws_M) \cdot we_M + (rs_D == ws_W) \cdot we_W) \cdot re1_D + ((rt_D == ws_E) \cdot we_E + (rt_D == ws_M) \cdot we_M + (rt_D == ws_W) \cdot we_W) \cdot re2_D$$

ws = Case opcode  
ALU  $\Rightarrow$  rd  
ALUi, LW  $\Rightarrow$  rt  
JAL, JALR  $\Rightarrow$  R31

we = Case opcode  
ALU, ALUi, LW  $\Rightarrow$  (ws  $\neq$  0)  
JAL, JALR  $\Rightarrow$  on  
 $\dots$   $\Rightarrow$  off

$$ASrc = (rs_D == ws_E) \cdot we_E \cdot re1_D$$

Is this correct?

How might we address this?

March 11, 2021

MIT 6.823 Spring 2021

L07-8 March 11, 2021

MIT 6.823 Spring 2021

L07-9

## Resolving Data Hazards (2)

### Strategy 2:

Route data as soon as possible after it is calculated to the earlier pipeline stage  $\rightarrow$  bypass

## Adding a Bypass



MIT 6.823 Spring 2021

MIT 6.823 Spring 2021

L07-7

## Bypass and Stall Signals

Split we<sub>E</sub> into two components: we-bypass, we-stall

we-bypass<sub>E</sub> = Case opcode<sub>E</sub>  
ALU, ALUi  $\Rightarrow$  (ws  $\neq$  0)  
...  $\Rightarrow$  off

we-stall<sub>E</sub> = Case opcode<sub>E</sub>  
LW  $\Rightarrow$  (ws  $\neq$  0)  
JAL, JALR  $\Rightarrow$  on  
...  $\Rightarrow$  off

$$ASrc = (rs_D == ws_E) \cdot we-bypass_E \cdot re1_D$$

$$\text{stall} = ((rs_D == ws_E) \cdot we-stall_E + (rs_D == ws_M) \cdot we_M + (rs_D == ws_W) \cdot we_W) \cdot re1_D + ((rt_D == ws_E) \cdot we_E + (rt_D == ws_M) \cdot we_M + (rt_D == ws_W) \cdot we_W) \cdot re2_D$$

## Fully Bypassed Datapath



Is there still a need for the stall signal?

## Resolving Data Hazards (3)

### Strategy 3:

*Speculate on the dependence. Two cases:*

Guessed correctly → no special action required

Guessed incorrectly → kill and restart

March 11, 2021

MIT 6.823 Spring 2021

L07-10 March 11, 2021

MIT 6.823 Spring 2021

L07-11

## Instruction to Instruction Dependence

- What do we need to calculate next PC?
  - For Jumps
    - Opcode, offset, and PC
  - For Jump Register
    - Opcode and register value
  - For Conditional Branches
    - Opcode, offset, PC, and register (for condition)
  - For all others
    - Opcode and PC
- In what stage do we know these?
  - PC → Fetch
  - Opcode, offset → Decode (or Fetch?)
  - Register value → Decode
  - Branch condition ( $(rs) == 0$ ) → Execute (or Decode?)

March 11, 2021

MIT 6.823 Spring 2021

L07-12 March 11, 2021

MIT 6.823 Spring 2021

L07-13

## Speculate NextPC is PC+4



I<sub>1</sub> 096 ADD  
I<sub>2</sub> 100 J 200  
I<sub>3</sub> 104 ADD  
I<sub>4</sub> 304 ADD

What happens on mis-speculation,  
i.e., when next instruction is not PC+4?

How?

March 11, 2021

MIT 6.823 Spring 2021

L07-14 March 11, 2021

MIT 6.823 Spring 2021

L07-15

## NextPC Calculation Bubbles



What's a good guess for next PC?

## Pipelining Jumps



To kill a fetched instruction -- Insert a nop in IR  
 $IR_{Src_D} = \begin{cases} Case \text{ opcode}_D \\ J, JAL & \Rightarrow \text{nop} \\ \dots & \Rightarrow \text{IM} \end{cases}$

Any interaction between stall and jump?

March 11, 2021

MIT 6.823 Spring 2021

L07-14 March 11, 2021

MIT 6.823 Spring 2021

L07-15

## Jump Pipeline Diagrams



March 11, 2021

MIT 6.823 Spring 2021

L07-16 March 11, 2021

MIT 6.823 Spring 2021

L07-17

## Pipelining Conditional Branches



I<sub>1</sub> 096 ADD  
 I<sub>2</sub> 100 BEQZ r1 200  
 I<sub>3</sub> 104 ADD  
 I<sub>4</sub> 304 ADD

Branch condition is not known until the execute stage  
*what action should be taken in the decode stage?*

## Pipelining Conditional Branches



If the branch is taken  
 - kill the two following instructions  
 - the instruction at the decode stage is not valid  
 $\Rightarrow$  stall signal is not valid

MIT 6.823 Spring 2021

I<sub>1</sub> 096 ADD  
 I<sub>2</sub> 100 BEQZ r1 200  
 I<sub>3</sub> 104 ADD  
 I<sub>4</sub> 304 ADD

## Pipelining Conditional Branches



I<sub>1</sub> 096 ADD  
 I<sub>2</sub> 100 BEQZ r1 200  
 I<sub>3</sub> 104 ADD  
 I<sub>4</sub> 304 ADD

If the branch is taken  
 - kill the two following instructions  
 - the instruction at the decode stage is not valid  
 $\Rightarrow$  stall signal is not valid

MIT 6.823 Spring 2021

L07-19

## New Stall Signal

```
stall = ( ((rsD==wsE)·weE + (rsD==wsM)·weM + (rsD==wsW)·weW)·re1D
  + ((rtD==wsE)·weE + (rtD==wsM)·weM + (rtD==wsW)·weW)·re2D
) · !(opcodeE==BEQZ)·z + (opcodeE==BNEZ)·lz
```

Don't stall if the branch is taken. Why?

## Control Equations for PC and IR Muxes

IRSrc<sub>D</sub> = Case opcode<sub>E</sub>  
 BEQZ·z, BNEZ·lz  $\Rightarrow$  nop  
 ...  
 Case opcode<sub>D</sub>  
 J, JAL, JR, JALR  $\Rightarrow$  nop  
 ...  $\Rightarrow$  IM

Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction

IRSrc<sub>E</sub> = Case opcode<sub>E</sub>  
 BEQZ·z, BNEZ·lz  $\Rightarrow$  nop  
 ...  $\Rightarrow$  stall·nop + !stall·IR<sub>D</sub>

PCSrc = Case opcode<sub>E</sub>  
 BEQZ·z, BNEZ·!z  $\Rightarrow$  br  
 ...  
 Case opcode<sub>D</sub>  
 J, JAL  
 JR, JALR  $\Rightarrow$  jabs  
 ...  $\Rightarrow$  rind  
 ...  $\Rightarrow$  pc+4

pc+4 is a speculative guess

nop  $\Rightarrow$  Kill  
 br/jabs/rind  $\Rightarrow$  Restart  
 pc+4  $\Rightarrow$  Speculate

March 11, 2021

MIT 6.823 Spring 2021

L07-20 March 11, 2021

MIT 6.823 Spring 2021

L07-21

## Branch Pipeline Diagrams (resolved in execute stage)



March 11, 2021

MIT 6.823 Spring 2021

L07-22 March 11, 2021

MIT 6.823 Spring 2021

L07-23

## Branch Delay Slots (expose control hazard to software)

- Change the ISA semantics so that the instruction that follows a jump or branch is always executed
  - gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted.



- Other techniques include branch prediction, which can dramatically reduce the branch penalty... *to come later*

March 11, 2021

MIT 6.823 Spring 2021

L07-24 March 11, 2021

MIT 6.823 Spring 2021

L07-25

## Handling Control Hazards due to Exceptions



- Typical strategy: Record exceptions, process the first one to reach commit point (i.e., the point where architectural state is modified)
  - Pros/cons vs handling exceptions eagerly, like branches?

March 11, 2021

MIT 6.823 Spring 2021

L07-26 March 11, 2021

MIT 6.823 Spring 2021

L07-27

## Reducing Branch Penalty (resolve in decode stage)

- One pipeline bubble can be removed if an extra comparator is used in the Decode stage



MIT 6.823 Spring 2021

L07-28

## Handling Control Hazards due to Exceptions



- Instructions may suffer exceptions in different pipeline stages
- Must prioritize exceptions from earlier instructions

MIT 6.823 Spring 2021

L07-29

## Why an instruction may not be dispatched every cycle (CPI>1)

- Full bypassing may be too expensive to implement
  - Typically all frequently used paths are provided
  - Some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI
- Loads have two-cycle latency
  - Instruction after load cannot use load result
  - MIPS-I ISA defined *load delay slots*, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II.
- Conditional branches, jumps, and exceptions may cause bubbles
  - Kill instruction(s) following branch if no delay slots

*Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler.*

MIT 6.823 Spring 2021

L07-30

## Next lecture: Superscalar & Scoreboarded Pipelines

# Complex Pipelining

Daniel Sanchez

Computer Science and Artificial Intelligence Laboratory  
M.I.T.

March 11, 2021

MIT 6.823 Spring 2021

L07-28 March 16, 2021

MIT 6.823 Spring 2021

L08-1

## Complex Pipelining: Motivation

Instruction pipelining becomes complex when we want high performance in the presence of

- Multi-cycle operations, for example:
  - Full or partially pipelined floating-point units, or
  - Long-latency operations, e.g., divides
- Variable-latency operations, for example:
  - Memory systems with variable access time
- Replicated function units, for example:
  - Multiple floating-point or memory units

## CDC 6600 Seymour Cray, 1963



- A fast pipelined machine with 60-bit words
  - 128 Kword main memory capacity, 32 banks
- Ten functional units (parallel, unpipelined)
  - Floating Point: adder, 2 multipliers, divider
  - Integer: adder, 2 incrementers, ...
- Hardwired control
- Dynamic scheduling of instructions using a scoreboard
- Ten Peripheral Processors for Input/Output
  - A fast multi-threaded 12-bit integer ALU
- Very fast clock, 10 MHz (FP add in 4 clocks)
- >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, new freon-based cooling technology
- Fastest machine in world for 5 years (until CDC 7600)
  - Over 100 sold (\$7-10M each)

March 16, 2021

MIT 6.823 Spring 2021

L08-2 March 16, 2021

MIT 6.823 Spring 2021

L08-3

## CDC 6600: Datapath



## CDC 6600: A Load/Store Architecture

- Separate instructions to manipulate three types of reg.
  - 8 60-bit data registers (X)
  - 8 18-bit address registers (A)
  - 8 18-bit index registers (B)
- All arithmetic and logic instructions are reg-to-reg
 

|        |   |   |   |
|--------|---|---|---|
| 6      | 3 | 3 | 3 |
| opcode | i | j | k |

 $Ri \leftarrow (Rj) op (Rk)$
- Only Load and Store instructions refer to memory!
 

|        |   |   |      |
|--------|---|---|------|
| 6      | 3 | 3 | 18   |
| opcode | i | j | disp |

 $Ri \leftarrow M[(Rj) + disp]$ 
  - Touching address registers 1 to 5 initiates a load
  - 6 to 7 initiates a store
  - very useful for vector operations

March 16, 2021

MIT 6.823 Spring 2021

L08-4 March 16, 2021

MIT 6.823 Spring 2021

L08-5

# CDC6600: Vector Addition

```
B1 ← -n  
loop: JZE B1, exit  
      A1 ← B1 + a1      load into X1  
      A2 ← B1 + b1      load into X2  
      X6 ← X1 + X2  
      A6 ← B1 + c1      store X6  
      B1 ← B1 + 1  
      jump loop
```

We will present complex  
pipelining issues more  
abstractly ...

A<sub>i</sub> = address register  
B<sub>i</sub> = index register  
X<sub>i</sub> = data register

more on vector processing later...

March 16, 2021

MIT 6.823 Spring 2021

L08-6 March 16, 2021

MIT 6.823 Spring 2021

L08-7

## Floating Point ISA

Interaction between the Floating point datapath and the Integer datapath is determined largely by the ISA

### MIPS ISA

- separate register files for FP and Integer instructions  
*the only interaction is via a set of move instructions (some ISAs don't even permit this)*
- separate load/store for FPR's and GPR's but both use GPR's for address calculation
- separate conditions for branches  
FP branches are defined in terms of condition codes

## Floating Point Unit

Much more hardware than an integer unit

Single-cycle floating point unit is a bad idea - *why?*

- it is common to have several floating point units
- it is common to have different types of FPUs  
*Fadd, Fmul, Fdiv, ...*
- an FPU may be pipelined, partially pipelined or not pipelined

*To operate several FPUs concurrently the register file needs to have more read and write ports*

March 16, 2021

MIT 6.823 Spring 2021

L08-8 March 16, 2021

MIT 6.823 Spring 2021

L08-9

## Functional Unit Characteristics



Functional units have internal pipeline registers

- ⇒ operands are latched when an instruction enters a functional unit
- ⇒ inputs to a functional unit (e.g., register file) can change during a long latency operation

## Realistic Memory Systems

Latency of access to the main memory is usually much higher than one cycle and often unpredictable

*Solving this problem is a central issue in computer architecture*

Common approaches to improving memory performance

- separate instruction and data memory ports  
⇒ no self-modifying code
- caches  
*single cycle except in case of a miss ⇒ stall*
- interleaved memory  
*multiple memory accesses ⇒ bank conflicts*
- split-phase memory operations  
⇒ out-of-order responses

March 16, 2021

MIT 6.823 Spring 2021

L08-10 March 16, 2021

MIT 6.823 Spring 2021

L08-11

## Complex Pipeline Structure



March 16, 2021

MIT 6.823 Spring 2021

L08-12 March 16, 2021

MIT 6.823 Spring 2021

L08-13

## Complex Pipeline Control Issues

- Structural hazards at the execution stage if some FPU or memory unit is not pipelined and takes more than one cycle
- Structural hazards at the write-back stage due to variable latencies of different function units
- Out-of-order write hazards due to variable latencies of different function units
- How to handle exceptions?

## Complex In-Order Pipeline



- Delay writeback so all operations have same latency to W stage
  - Write ports never oversubscribed (one inst. in & one inst. out every cycle)

*How to prevent increased writeback latency from slowing down single-cycle integer operations?*

March 16, 2021

MIT 6.823 Spring 2021

L08-14 March 16, 2021

MIT 6.823 Spring 2021

L08-15

## Complex In-Order Pipeline



*How should we handle data hazards for long-latency operations?*

*Exceptions?*

## Superscalar In-Order Pipeline



- Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating-point
- Inexpensive way of increasing throughput
  - Examples: Alpha 21064 (1992) MIPS R5000 series (1996)
- Can be extended to wider issue but register file ports and bypassing costs grow quickly
  - E.g., 4-issue UltraSPARC

March 16, 2021

MIT 6.823 Spring 2021

L08-16 March 16, 2021

MIT 6.823 Spring 2021

L08-17

## Dependence Analysis

Needed to Exploit Instruction-level Parallelism

## Types of Data Hazards

Consider executing a sequence of

$$r_k \leftarrow (r_i) \text{ op } (r_j)$$

type of instructions

Data-dependence

$$\begin{array}{l} r_3 \leftarrow (r_1) \text{ op } (r_2) \\ r_5 \leftarrow (r_3) \text{ op } (r_4) \end{array}$$

Read-after-Write (RAW) hazard

Anti-dependence

$$\begin{array}{l} r_3 \leftarrow (r_1) \text{ op } (r_2) \\ r_1 \leftarrow (r_4) \text{ op } (r_5) \end{array}$$

Write-after-Read (WAR) hazard

Output-dependence

$$\begin{array}{l} r_3 \leftarrow (r_1) \text{ op } (r_2) \\ r_3 \leftarrow (r_6) \text{ op } (r_7) \end{array}$$

Write-after-Write (WAW) hazard

March 16, 2021

MIT 6.823 Spring 2021

L08-18 March 16, 2021

MIT 6.823 Spring 2021

L08-19

## Register vs. Memory Data Dependences

- Data hazards due to register operands can be determined at the decode stage **but**
- Data hazards due to memory operands can be determined only after computing the effective address

$$\begin{array}{ll} \text{store} & M[(r1) + \text{disp1}] \leftarrow (r2) \\ \text{load} & r3 \leftarrow M[(r4) + \text{disp2}] \end{array}$$

Does  $(r1) + \text{disp1} == (r4) + \text{disp2}$  ?

March 16, 2021

MIT 6.823 Spring 2021

L08-20 March 16, 2021

MIT 6.823 Spring 2021

L08-21

## Instruction Scheduling

|       |       |                    |
|-------|-------|--------------------|
| $I_1$ | DIVD  | $f_6, f_6, f_4$    |
| $I_2$ | LD    | $f_2, 45(r3), f_4$ |
| $I_3$ | MULTD | $f_0, f_2, f_4$    |
| $I_4$ | DIVD  | $f_8, f_6, f_2$    |
| $I_5$ | SUBD  | $f_{10}, f_0, f_6$ |
| $I_6$ | ADDD  | $f_6, f_8, f_2$    |



Valid orderings:

in-order  $I_1, I_2, I_3, I_4, I_5, I_6$

out-of-order  $I_2, I_1, I_3, I_4, I_5, I_6$

out-of-order  $I_1, I_2, I_3, I_5, I_4, I_6$



March 16, 2021

MIT 6.823 Spring 2021

L08-22 March 16, 2021

MIT 6.823 Spring 2021

L08-23

## Detecting Data Hazards

Range and Domain of instruction  $i$

$R(i)$  = Registers (or other storage) modified by instruction  $i$

$D(i)$  = Registers (or other storage) read by instruction  $i$

Suppose instruction  $j$  follows instruction  $i$  in the program order. Executing instruction  $j$  before the effect of instruction  $i$  has taken place can cause a

RAW hazard if  $R(i) \cap D(j) \neq \emptyset$

WAR hazard if  $D(i) \cap R(j) \neq \emptyset$

WAW hazard if  $R(i) \cap R(j) \neq \emptyset$

L08-18 March 16, 2021

MIT 6.823 Spring 2021

L08-19

## Data Hazards: An Example

|       |       |
|-------|-------|
| $I_1$ | DIVD  |
| $I_2$ | LD    |
| $I_3$ | MULTD |
| $I_4$ | DIVD  |
| $I_5$ | SUBD  |
| $I_6$ | ADDD  |



RAW Hazards

WAR Hazards

WAW Hazards

L08-20 March 16, 2021

MIT 6.823 Spring 2021

L08-21

## Out-of-order Completion

In-order Issue

|       |       |                    |  |  |  |  |  |  |  |  |  |  |  | Latency |
|-------|-------|--------------------|--|--|--|--|--|--|--|--|--|--|--|---------|
| $I_1$ | DIVD  | $f_6, f_6, f_4$    |  |  |  |  |  |  |  |  |  |  |  | 4       |
| $I_2$ | LD    | $f_2, 45(r3), f_4$ |  |  |  |  |  |  |  |  |  |  |  | 1       |
| $I_3$ | MULTD | $f_0, f_2, f_4$    |  |  |  |  |  |  |  |  |  |  |  | 3       |
| $I_4$ | DIVD  | $f_8, f_6, f_2$    |  |  |  |  |  |  |  |  |  |  |  | 4       |
| $I_5$ | SUBD  | $f_{10}, f_0, f_6$ |  |  |  |  |  |  |  |  |  |  |  | 1       |
| $I_6$ | ADDD  | $f_6, f_8, f_2$    |  |  |  |  |  |  |  |  |  |  |  | 1       |

| cycle         | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---------------|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|
| in-order comp | 1 | 2 |   |   | 1 | 2 | 3 | 4 |   | 3  | 5  | 4  | 6  | 5  | 6  |

| out-of-order comp | 1 | 2 | 2 | 3 | 1 | 4 | 3 | 5 | 5 | 4 | 6 | 6 |
|-------------------|---|---|---|---|---|---|---|---|---|---|---|---|
|                   | 1 | 2 | 2 | 3 | 1 | 4 | 3 | 5 | 5 | 4 | 6 | 6 |

What problems can out-of-order comp cause?

March 16, 2021

MIT 6.823 Spring 2021

L08-22 March 16, 2021

MIT 6.823 Spring 2021

L08-23

## Scoreboard: A Hardware Data Structure to Detect Hazards Dynamically

March 16, 2021

MIT 6.823 Spring 2021

L08-24 March 16, 2021

MIT 6.823 Spring 2021

L08-25

## Complex Pipeline



## When is it Safe to Issue an Instruction?

- Approach: Stall issue until sure that issuing will cause no dependence problems...
- Suppose a data structure keeps track of all the instructions in all the functional units
- The following checks need to be made before the Issue stage can dispatch an instruction
  - Is the required function unit available?
  - Is the input data available?  $\Rightarrow$  RAW?
  - Is it safe to write the destination?  $\Rightarrow$  WAR? WAW?
  - Is there a structural conflict at the WB stage?

March 16, 2021

MIT 6.823 Spring 2021

L08-26 March 16, 2021

MIT 6.823 Spring 2021

L08-27

## A Data Structure for Correct Issues *Keeps track of the status of Functional Units*

| Name  | Busy | Op | Dest | Src1 | Src2 |
|-------|------|----|------|------|------|
| Int   |      |    |      |      |      |
| Mem   |      |    |      |      |      |
| Add1  |      |    |      |      |      |
| Add2  |      |    |      |      |      |
| Add3  |      |    |      |      |      |
| Mult1 |      |    |      |      |      |
| Mult2 |      |    |      |      |      |
| Div   |      |    |      |      |      |

*The instruction  $i$  at the Issue stage consults this table*

- FU available?
- RAW?
- WAR?
- WAW?

*An entry is added to the table if no hazard is detected;  
An entry is removed from the table after Write-Back*

## Simplifying the Data Structure Assuming In-order Issue

- Suppose the instruction is not dispatched by the Issue stage
  - If a RAW hazard exists
  - or if the required FU is busy
- Suppose operands are latched by the functional unit on issue

Can the dispatched instruction cause a  
WAR hazard?

WAW hazard?

March 16, 2021

MIT 6.823 Spring 2021

L08-28 March 16, 2021

MIT 6.823 Spring 2021

L08-29

## Simplifying the Data Structure

- No WAR hazard  
 $\Rightarrow$  no need to keep *src1* and *src2*
- The Issue stage does not dispatch an instruction in case of a WAW hazard  
 $\Rightarrow$  a register name can occur at most once in the *dest* column
- WP[reg#]: a bit-vector to record the registers for which writes are pending
  - These bits are set to true by the Issue stage and set to false by the WB stage
  - $\Rightarrow$  Each pipeline stage in the FU's must carry the *dest* field and a flag to indicate if it is valid “the (we, ws) pair”

March 16, 2021

MIT 6.823 Spring 2021

L08-28 March 16, 2021

MIT 6.823 Spring 2021

L08-29

## Scoreboard for In-order Issues

**Busy[FU#]** : a bit-vector to indicate FU's availability.  
(FU = Int, Add, Mult, Div)  
These bits are hardwired to FU's.

**WP[reg#]** : a bit-vector to record the registers for which writes are pending.  
These bits are set to true by the Issue stage and set to false by the WB stage

Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch

- FU available?
- RAW?
- WAR?
- WAW?

March 16, 2021

MIT 6.823 Spring 2021

L08-30 March 16, 2021

MIT 6.823 Spring 2021

L08-31

## Preview: Anatomy of a Modern Out-of-Order Superscalar Core



- L08 (Today): Complex pipes w/ in-order issue
- L09: Out-of-order exec & renaming
- L10: Branch prediction
- L11: Speculative execution and recovery
- L12: Advanced Memory Ops

March 16, 2021

MIT 6.823 Spring 2021

L08-32 March 18, 2021

MIT 6.823 Spring 2021

L09-1

## Complex Pipelining: Out-of-Order Execution, Register Renaming, and Exceptions

Daniel Sanchez  
Computer Science and Artificial Intelligence Laboratory  
M.I.T.

## CDC 6600-style Scoreboard

Instructions are issued in order.  
An instruction is issued only if

- It cannot cause a RAW hazard
- It cannot cause a WAW hazard

 $\Rightarrow$  There can be at most one instruction in the execute phase that can write to a particular register

WAR hazards are not possible

- Due to in-order issue + operands read immediately



Scoreboard:  
Two bit-vectors

**Busy[FU#]**: Indicates FU's availability  
These bits are hardwired to FU's.

**WP[reg#]**: Records if a write is pending for a register  
Set to true by the Issue stage and set to false by the WB stage

March 18, 2021

MIT 6.823 Spring 2021

L09-2 March 18, 2021

MIT 6.823 Spring 2021

L09-3

## Scoreboard Dynamics

|                          | Functional Unit Status |        |         |        |     | Registers Reserved for Writes |
|--------------------------|------------------------|--------|---------|--------|-----|-------------------------------|
|                          | Int(1)                 | Add(1) | Mult(3) | Div(4) | WB  |                               |
| t0 <i>I<sub>1</sub></i>  |                        |        |         | f6     |     | f6                            |
| t1 <i>I<sub>2</sub></i>  | f2                     |        |         | f6     |     | f6, f2                        |
| t2                       |                        |        |         | f6     | f2  | f6, f2                        |
| t3 <i>I<sub>3</sub></i>  |                        | f0     |         | f6     |     | f6, f0                        |
| t4                       |                        | f0     |         |        | f6  | f6, f0                        |
| t5 <i>I<sub>4</sub></i>  |                        | f0     | f8      |        |     | f0, f8                        |
| t6                       |                        |        |         | f8     |     | f0, f8                        |
| t7 <i>I<sub>5</sub></i>  | f10                    |        |         | f8     |     | f8, f10                       |
| t8                       |                        |        |         | f8     | f10 | f8, f10                       |
| t9                       |                        |        |         | f8     |     | f8                            |
| t10 <i>I<sub>6</sub></i> | f6                     |        |         |        |     | f6                            |
| t11                      |                        |        |         |        | f6  | f6                            |

|                      |       |      |        |    |
|----------------------|-------|------|--------|----|
| <i>I<sub>1</sub></i> | DIVD  | f6,  | f6,    | f4 |
| <i>I<sub>2</sub></i> | LD    | f2,  | 45(r3) | f4 |
| <i>I<sub>3</sub></i> | MULTD | f0,  | f2,    | f2 |
| <i>I<sub>4</sub></i> | DIVD  | f8,  | f6,    | f2 |
| <i>I<sub>5</sub></i> | SUBD  | f10, | f0,    | f6 |
| <i>I<sub>6</sub></i> | ADD   | f6,  | f8,    | f2 |

MIT 6.823 Spring 2021 L08-31

MIT 6.823 Spring 2021

## Reminder: Scoreboard Dynamics

| Issue time               | Functional Unit Status |        |         |        |     | WP      |
|--------------------------|------------------------|--------|---------|--------|-----|---------|
|                          | Int(1)                 | Add(1) | Mult(3) | Div(4) | WB  | WP      |
| t0 <i>I<sub>1</sub></i>  |                        |        |         | f6     |     | f6      |
| t1 <i>I<sub>2</sub></i>  | f2                     |        |         | f6     |     | f6, f2  |
| t2                       |                        |        |         | f6     | f2  | f6, f2  |
| t3 <i>I<sub>3</sub></i>  |                        | f0     |         | f6     |     | f6, f0  |
| t4                       |                        | f0     |         |        | f6  | f6, f0  |
| t5 <i>I<sub>4</sub></i>  |                        |        | f8      |        |     | f0, f8  |
| t6                       |                        |        |         | f8     |     | f0, f8  |
| t7 <i>I<sub>5</sub></i>  | f10                    |        |         | f8     |     | f8, f10 |
| t8                       |                        |        |         | f8     | f10 | f8, f10 |
| t9                       |                        |        |         | f8     |     | f8      |
| t10 <i>I<sub>6</sub></i> | f6                     |        |         |        |     | f6      |
| t11                      |                        |        |         |        | f6  | f6      |

|                      |       |      |        |    |
|----------------------|-------|------|--------|----|
| <i>I<sub>1</sub></i> | DIVD  | f6,  | f6,    | f4 |
| <i>I<sub>2</sub></i> | LD    | f2,  | 45(r3) | f4 |
| <i>I<sub>3</sub></i> | MULTD | f0,  | f2,    | f2 |
| <i>I<sub>4</sub></i> | DIVD  | f8,  | f6,    | f2 |
| <i>I<sub>5</sub></i> | SUBD  | f10, | f0,    | f6 |
| <i>I<sub>6</sub></i> | ADD   | f6,  | f8,    | f2 |

MIT 6.823 Spring 2021 L09-3

## In-Order Issue Limitations

An example

|   |       |      |        | latency |
|---|-------|------|--------|---------|
| 1 | LD    | F2,  | 34(R2) | 1       |
| 2 | LD    | F4,  | 45(R3) | long    |
| 3 | MULTD | F6,  | F4, F2 | 3       |
| 4 | SUBD  | F8,  | F2, F2 | 1       |
| 5 | DIVD  | F4,  | F2, F8 | 4       |
| 6 | ADDD  | F10, | F6, F4 | 1       |



In-order: 1 (2,1) . . . . . 2 3 4 4 3 5 . . . 5 6 6  
In-order restriction prevents instruction 4 from being dispatched

March 18, 2021

MIT 6.823 Spring 2021

L09-4 March 18, 2021

MIT 6.823 Spring 2021

L09-5

## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?



Find something else to do!

- Issue stage buffer holds multiple instructions waiting to issue.
  - Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.
  - Can issue any instruction in buffer whose RAW hazards are satisfied (*for now at most one dispatch per cycle*).
- Note:* A writeback (WB) may enable more instructions.

## In-Order Issue Limitations

An example

|   |       |      |        | latency |
|---|-------|------|--------|---------|
| 1 | LD    | F2,  | 34(R2) | 1       |
| 2 | LD    | F4,  | 45(R3) | long    |
| 3 | MULTD | F6,  | F4, F2 | 3       |
| 4 | SUBD  | F8,  | F2, F2 | 1       |
| 5 | DIVD  | F4,  | F2, F8 | 4       |
| 6 | ADDD  | F10, | F6, F4 | 1       |



In-order: 1 (2,1) . . . . . 2 3 4 4 3 5 . . . 5 6 6  
Out-of-order: 1 (2,1) 4 4 . . . . 2 3 5 . 3 . 5 6 6  
WAR/WAW hazards prevent instruction 5 from being dispatched

*Out-of-order execution did not produce a significant improvement!*

March 18, 2021

MIT 6.823 Spring 2021

L09-6 March 18, 2021

MIT 6.823 Spring 2021

L09-7

## How many Instructions can be in the pipeline

Throughput is limited by number of instructions in flight, but which feature of an ISA limits the number of instructions in the pipeline?

\_\_\_\_\_

Out-of-order dispatch by itself does not provide a significant performance improvement!

How can we better understand the impact of number of registers on throughput?

## Little's Law

$$\text{Throughput } (\bar{T}) = \frac{\text{Number in Flight } (\bar{N})}{\text{Latency } (\bar{L})}$$



Example:

4 floating point registers  
8 cycles per floating point operation

⇒

## Overcoming the Lack of Register Names

Floating Point pipelines often cannot be kept filled with small number of registers.

IBM 360 had only 4 Floating Point Registers

Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility?

Yes, Robert Tomasulo of IBM suggested an ingenious solution in 1967 based on on-the-fly register renaming

March 18, 2021

MIT 6.823 Spring 2021

L09-8 March 18, 2021

MIT 6.823 Spring 2021

L09-9





## Reminder: Exception Handling (*In-Order Five-Stage Pipeline*)



Hold exception flags in pipeline until commit point (M stage)  
If exception at commit

- If exception at commit:
    - update Cause/EPC registers
    - kill all stages
    - fetch at handler PC

- fetch at Handler PC

Inject external interrupts at commit point

March 18, 2021

MIT 6.823 Spring 2021

## Phases of Instruction Execution



MIT 6.823 Spring 2021

L09-23

# In-Order Commit for Precise Exceptions



- Instructions fetched and decoded into instruction reorder buffer in-order
  - Execution is out-of-order ( $\Rightarrow$  out-of-order completion)
  - *Commit* (write-back to architectural state, i.e., regfile & memory) is in-order

*Temporary storage needed to hold results before commit  
(shadow registers and store buffers)*

March 18, 2021

---

MIT 6.S23 Spring 2021

109-34 March 18, 2021

MIT 6.S23 Spring 2021

09-25

# Extensions for Precise Exceptions

| Inst#                      | use | exec | op | p1 | src1 | p2 | src2 | pd | dest | data | cause |
|----------------------------|-----|------|----|----|------|----|------|----|------|------|-------|
|                            |     |      |    |    |      |    |      |    |      |      |       |
| $\text{ptr}_2 \rightarrow$ |     |      |    |    |      |    |      |    |      |      |       |
| next to<br>commit          |     |      |    |    |      |    |      |    |      |      |       |
|                            |     |      |    |    |      |    |      |    |      |      |       |
| $\text{ptr}_1 \rightarrow$ |     |      |    |    |      |    |      |    |      |      |       |
| next<br>available          |     |      |    |    |      |    |      |    |      |      |       |

*Reorder buffer*

- add  $\langle \text{pd}, \text{dest}, \text{data}, \text{cause} \rangle$  fields in the instruction template
  - commit instructions to reg file and memory in program order  $\Rightarrow$  buffers can be maintained circularly
  - on exception, clear reorder buffer by resetting  $\text{ptr}_1 = \text{ptr}_2$   
*(stores must wait for commit before updating memory)*

## Rollback and Renaming



How does the decode stage find the tag of a source register?

March 18, 2021

MIT 6.823 Spring 2021

L09-26 March 18, 2021

## Renaming Table



Renaming table is a cache to speed up register name lookup.  
It needs to be cleared after each exception taken.

When else are valid bits cleared?

09-27

## Physical Register Files

- Reorder buffers are space inefficient – a data value may be stored in multiple places in the reorder buffer
- Idea: Keep all data values in a physical register file
  - Tag represents the name of the data value and name of the physical register that holds it
  - Reorder buffer contains only tags

Thus, 64-bit data values may be replaced by 8-bit tags for a 256-element physical register file

More on this in later lectures ...

March 18, 2021

MIT 6.823 Spring 2021

L09-28 March 18, 2021

MIT 6.823 Spring 2021

L09-29

## Branch Penalty

Next fetch started

How many instructions need to be killed on a misprediction?

Modern processors may have > 10 pipeline stages between nextPC calculation and branch resolution !

Branch executed

Next lecture:  
Branch prediction & Speculative execution



## Branch Prediction

Daniel Sanchez

Computer Science and Artificial Intelligence Laboratory  
M.I.T.

March 25, 2021

MIT 6.823 Spring 2021

L10-1 March 25, 2021

MIT 6.823 Spring 2021

L10-2

## Reminder: Phases of Instruction Execution



## Control Flow Penalty

Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution!

How much work is lost if pipeline doesn't follow correct instruction flow?



March 25, 2021

MIT 6.823 Spring 2021

L10-3 March 25, 2021

MIT 6.823 Spring 2021

L10-4

## Average Run-Length between Branches

Average dynamic instruction mix of SPEC CPU 2017 [Limaye and Adegbija, ISPASS'18]:

|          | SPECint | SPECfp |
|----------|---------|--------|
| Branches | 19 %    | 11 %   |
| Loads    | 24 %    | 26 %   |
| Stores   | 10 %    | 7 %    |
| Other    | 47 %    | 56 %   |

SPECint17: perlbench, gcc, mcf, omnetpp, xalancbmk, x264, deepsjeng, leela, exchange2, xz

SPECfp17: bwaves, cactus, lbm, wrf, pop2, imagick, nab, fotonik3d, roms

What is the average run length between branches?

# MIPS Branches and Jumps

Each instruction fetch depends on one or two pieces of information from the preceding instruction:

- 1) Is the preceding instruction a taken branch?
- 2) If so, what is the target address?

| Instruction | Taken known? | Target known? |
|-------------|--------------|---------------|
| J           |              |               |
| JR          |              |               |
| BEQZ/BNEZ   |              |               |

\*Assuming zero detect on register read

March 25, 2021

MIT 6.823 Spring 2021

L10-5 March 25, 2021

MIT 6.823 Spring 2021

L10-6

# Reducing Control Flow Penalty

- Software solutions
  - Eliminate branches – loop unrolling  
Increases run length between branches
  - Reduce resolution time – instruction scheduling  
Compute the branch condition as early as possible (of limited value)
- Hardware solutions
  - Bypass – usually results are used immediately
  - Change architecture – find something else to do  
*Delay slots* – replace pipeline bubbles with useful work (requires software cooperation)
  - Speculate – branch prediction  
*Speculative execution* of instructions beyond the branch

March 25, 2021

MIT 6.823 Spring 2021

L10-7 March 25, 2021

MIT 6.823 Spring 2021

L10-8

# Static Branch Prediction

Overall probability a branch is taken is ~60-70% but:



ISA can attach preferred direction semantics to branches,  
e.g., Motorola MC88110  
bne0 (*preferred taken*) beq0 (*not taken*)

ISA can allow arbitrary choice of statically predicted direction,  
e.g., HP PA-RISC, Intel IA-64  
typically reported as ~80% accurate

# Example Branch Penalties

UltraSPARC-III instruction fetch pipeline stages  
(in-order issue, 4-way superscalar, 750MHz, 2000)



MIT 6.823 Spring 2021

L10-6

# Branch Prediction

## Motivation:

Branch penalties limit performance of deeply pipelined processors

Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly

## Required hardware support:

### Prediction structures:

- Branch history tables, branch target buffers, etc.

### Mispredict recovery mechanisms:

- Keep result computation separate from commit
- Kill instructions following branch in pipeline
- Restore state to state following branch

MIT 6.823 Spring 2021

L10-8

# Dynamic Prediction



Prediction as a feedback control process

March 25, 2021

MIT 6.823 Spring 2021

L10-9 March 25, 2021

MIT 6.823 Spring 2021

L10-10

# Dynamic Branch Prediction

Learning based on past behavior

## Temporal correlation

The way a branch resolves may be a good predictor of the way it will resolve at the next execution

## Spatial correlation

Several branches may resolve in a highly correlated manner (*a preferred path of execution*)

## Predictor Primitive

Emer & Gloy, 1997

- Indexed table holding values

- Operations
  - Predict
  - Update



- Algebraic notation

$$\text{Prediction} = P[\text{Width}, \text{Depth}](\text{Index}; \text{Update})$$

March 25, 2021

MIT 6.823 Spring 2021

L10-11 March 25, 2021

MIT 6.823 Spring 2021

L10-12

## One-bit Predictor aka Branch History Table (BHT)

Simple temporal prediction



$$A21064(\text{PC}; T) = P[1, 2K](\text{PC}; T)$$

What happens on loop branches?

March 25, 2021

MIT 6.823 Spring 2021

L10-13 March 25, 2021

MIT 6.823 Spring 2021

L22-14

## Two-bit Predictor

Smith, 1981



$$\text{Counter}[W,D](I; T) = P[W, D](I; \text{if } T \text{ then } P+1 \text{ else } P-1)$$

$$A21164(\text{PC}; T) = \text{MSB}(\text{Counter}[2, 2K](\text{PC}; T))$$

## Two-bit Predictor

Smith, 1981

- Use two bits per entry instead of one bit
- Manage them as a saturating counter:

| On not-taken | On taken | 1 | 1 | Strongly taken     |
|--------------|----------|---|---|--------------------|
|              | ↑        | 1 | 0 | Weakly taken       |
|              | ↓        | 0 | 1 | Weakly not-taken   |
|              |          | 0 | 0 | Strongly not-taken |

- Direction prediction changes only after two wrong predictions

How many mispredictions per loop?

March 25, 2021

MIT 6.823 Spring 2021

L10-13 March 25, 2021

MIT 6.823 Spring 2021

L22-14

## Branch History Table



4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

March 25, 2021

MIT 6.823 Spring 2021

L10-15 March 25, 2021

MIT 6.823 Spring 2021

L10-16

# Exploiting Spatial Correlation

Yeh and Patt, 1992

```
if (x[i] < 7) then
    y += 1;
if (x[i] < 5) then
    c -= 4;
```

If first condition false, second condition also false

*History register* records the direction of the last N branches executed by the processor

## History Registers aka Pattern History Table (PHT)



$$\text{History}(\text{PC}; \text{T}) = \text{P}(\text{PC}; \text{P} \parallel \text{T})$$

March 25, 2021

MIT 6.823 Spring 2021

L10-17 March 25, 2021

MIT 6.823 Spring 2021

L10-18

## Global-History Predictor



$$\text{GHist}(\cdot; \text{T}) = \text{MSB}(\text{Counter}(\text{History}(0, \text{T}); \text{T}))$$

Can we take advantage of a pattern at a particular PC?

## Local-History Predictor



$$\text{LHist}(\text{PC}; \text{T}) = \text{MSB}(\text{Counter}(\text{History}(\text{PC}; \text{T}); \text{T}))$$

March 25, 2021

MIT 6.823 Spring 2021

L10-19 March 25, 2021

MIT 6.823 Spring 2021

L10-19

## Global-History Predictor with Per-PC Counters



$$\text{GHistPA}(\text{PC}; \text{T}) = \text{MSB}(\text{Counter}(\text{History}(0; \text{T}) \parallel \text{PC}; \text{T}))$$

Can we take advantage of the global pattern at a particular PC?

## Two-Level Branch Predictor (Pentium Pro, 1995)

Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct)



March 25, 2021

MIT 6.823 Spring 2021

L10-21 March 25, 2021

MIT 6.823 Spring 2021

L10-22

## Choosing Predictors



$$\begin{aligned} \text{Chooser} &= \text{MSB}(P(\text{PC}; P + (A==T) - (B==T)) \\ &\quad \text{or} \\ \text{Chooser} &= \text{MSB}(P(\text{GHist}(\text{PC}; T); P + (A==T) - (B==T))) \end{aligned}$$

March 25, 2021

MIT 6.823 Spring 2021

L10-23 March 25, 2021

MIT 6.823 Spring 2021

L10-24

## TAGE predictor Seznec & Michaud, 2006



$$\begin{aligned} \text{TAGE\_TREE}[L1, L2, L3](\text{PC}; T) = \\ \text{TAGE}[L3](\text{PC}, \text{TAGE}[L2](\text{PC}, \text{TAGE}[L1](\text{PC}, \text{Bimodal}(\text{PC}; T); T); T); T) \end{aligned}$$

March 25, 2021

MIT 6.823 Spring 2021

L10-25 March 25, 2021

MIT 6.823 Spring 2021

L10-25

## TAGE predictor component

$$\text{TAGE}[L](\text{PC}, \text{NEXT}; T) =$$

```
idx = hash(PC, GHIST[L](;T))
tag = hash'(PC, GHIST[L](;T))
```

```
TAGE.U = SA(idx, tag; ((TAGE == T) && (NEXT != T))?1:SA)
TAGE.Counter = SA(idx, tag; T?SA+1:SA-1)
```

```
use_me = TAGE.U && isStrong(TAGE.Counter)
TAGE = use_me?MSB(TAGE.Counter):NEXT
```

Notes:

SA is a set-associative structure  
SA allocation occurs on mispredict (not shown)  
TAGE.U cleared on global counter saturation

## Tournament Branch Predictor (Alpha 21264, 1996)



- Choice predictor learns whether best to use local or global branch history in predicting next branch
- Global history is speculatively updated but restored on mispredict
- Claim 90-100% success on range of applications

March 25, 2021

MIT 6.823 Spring 2021

L10-27 March 25, 2021

MIT 6.823 Spring 2021

L10-28

## TAGE component



MIT 6.823 Spring 2021

L10-26

## Limitations of branch predictors

Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined.

*Correctly predicted taken branch penalty*

*Jump Register penalty*

|                                                           |                                        |
|-----------------------------------------------------------|----------------------------------------|
| A                                                         | PC Generation/Mux                      |
| P                                                         | Instruction Fetch Stage 1              |
| F                                                         | Instruction Fetch Stage 2              |
| B                                                         | Branch Address Calc/Begin Decode       |
| I                                                         | Complete Decode                        |
| J                                                         | Steer Instructions to Functional units |
| R                                                         | Register File Read                     |
| E                                                         | Integer Execute                        |
| <i>Remainder of execute pipeline (+ another 6 stages)</i> |                                        |

*UltraSPARC-III fetch pipeline*

March 25, 2021

MIT 6.823 Spring 2021

L10-27 March 25, 2021

MIT 6.823 Spring 2021

L10-28

## Branch Target Buffer (un>tagged)



BP bits are stored with the predicted target address.

IF stage: If (BP=taken) then  $nPC=target$  else  $nPC=PC+4$   
 later: check prediction, if wrong then kill the instruction and update BTB & BPb, else update BPb

March 25, 2021

MIT 6.823 Spring 2021

L10-29 March 25, 2021

MIT 6.823 Spring 2021

L10-30

## Address Collisions

Assume a  
128-entry  
BTB



What will be fetched after the instruction at 1028?

BTB prediction =  
Correct target =

⇒

*Is this a common occurrence?  
Can we avoid these mispredictions?*

## BTB is only for Control Instructions

BTB contains useful information for branch and jump instructions only

→ Do not update it for other instructions

For all other instructions the next PC is  $(PC)+4$  !

How to achieve this effect without decoding the instruction?

March 25, 2021

MIT 6.823 Spring 2021

L10-31 March 25, 2021

MIT 6.823 Spring 2021

L10-32

## Branch Target Buffer (tagged)



- Keep both the branch PC and target PC in the BTB
- PC+4 is fetched if match fails
- Only *taken* branches and jumps held in BTB
- Next PC determined before branch fetched and decoded

## Consulting BTB Before Decoding



- The match for PC=1028 fails and  $1028+4$  is fetched  
 $\Rightarrow$  eliminates false predictions after ALU instructions
- BTB contains entries only for control transfer instructions  
 $\Rightarrow$  more room to store branch targets

March 25, 2021

MIT 6.823 Spring 2021

L10-33 March 25, 2021

MIT 6.823 Spring 2021

L10-34

## Combining BTB and BHT

- BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR)
- BHT can hold many more entries and is more accurate



*BTB/BHT only updated after branch resolves in E stage*

## Uses of Jump Register (JR)

- Switch statements (jump to address of matching case)
- Dynamic function call (jump to run-time function address)
- Subroutine returns (jump to return address)

How well does BTB work for each of these cases?

March 25, 2021

MIT 6.823 Spring 2021

L10-35 March 25, 2021

MIT 6.823 Spring 2021

L10-36

## Subroutine Return Stack

Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs.

fa() { fb(); }

fb() { fc(); }

fc() { fd(); }



MIT 6.823 Spring 2021

L10-37

## Line Prediction (Alpha 21[234]64)

- For superscalar, useful to predict next cache line(s) to fetch



- Line Predictor predicts line to fetch each cycle (tight loop)
  - Untagged BTB structure – Why?
  - 21464 was to predict 2 lines per cycle
- Icache fetches block, and predictors improve target prediction
- PC Calc checks accuracy of line prediction(s)

March 25, 2021

MIT 6.823 Spring 2021

L10-37 March 25, 2021

MIT 6.823 Spring 2021

L10-38

## Overview of Branch Prediction



MIT 6.823 Spring 2021

L10-39

*Next Lecture:  
Speculative Execution  
& Value Management*

## Speculative Execution

*Daniel Sanchez*  
Computer Science and Artificial Intelligence Laboratory  
M.I.T.

March 25, 2021

MIT 6.823 Spring 2021

L10-39 March 30, 2021

MIT 6.823 Spring 2021

L11-1

What does this WW2 poster have in common with pipelined processors?



March 30, 2021

## Overview of branch prediction



Must speculation check always be correct?

MIT 6.823 Spring 2021

L11-3

## Speculative Execution Recipe



Why might one use old values?

O-O-O WAR hazards

March 30, 2021

MIT 6.823 Spring 2021

L11-4 March 30, 2021

MIT 6.823 Spring 2021

L11-5

## Value Management Strategies

### Greedy (or Eager) Update:

- Update value in place, and
- Provide means to reconstruct old values for recovery
  - often this is a log of old values

### Lazy Update:

- Buffer new value, leaving old value in place
- Replace old value only at 'commit' time

Why leave an old value in place?

## Exception Handling (In-Order Five-Stage Pipeline)



Strategy for Registers?

Strategy for PC?

March 30, 2021

MIT 6.823 Spring 2021

L11-6 March 30, 2021

MIT 6.823 Spring 2021

L11-7

## Misprediction Recovery

### In-order execution machines:

- Guarantee no instruction issued after branch can write-back before branch resolves by keeping values in the pipeline
- Kill all values from all instructions in pipeline behind mispredicted branch

### Out-of-order execution?

- Multiple instructions following branch in program order can generate new values before branch resolves

## Data-Driven Execution



### Basic Operation:

Enter op and tag or data (if known) for each source  
Replace tag with data as it becomes available  
Issue instruction when all sources are available  
Save dest data when operation finishes

### Update strategy?

March 30, 2021

MIT 6.823 Spring 2021

L11-8 March 30, 2021

MIT 6.823 Spring 2021

L11-9

## Rollback and Renaming



Convert to lazy by holding data in ROB.

But how do we find values before they are committed?

## Renaming Table

Micro-architectural speculative cache to speed up tag look up.



What is the update policy of rename table?

What events cause mis-speculation?

How can we respond to mis-speculation?

After being cleared, when can instructions be added to ROB?

March 30, 2021

MIT 6.823 Spring 2021

L11-10 March 30, 2021

MIT 6.823 Spring 2021

L11-11

## Recovering ROB/Renaming Table



Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted

## Map Table Recovery - Snapshots

Speculative value management of microarchitectural state

|     | Reg Map V | Snap Map V | Snap Map V |
|-----|-----------|------------|------------|
| R0  | T20   X   | T20   X    | T20   X    |
| R1  | T73   X   | T73   X    | T08        |
| R2  | T45   X   | T45   X    | T45   X    |
| R3  | T128      | T128       | T128   X   |
| ⋮   | ⋮         | ⋮          | ⋮          |
| R30 | T54       | T54        | T54        |
| R31 | T88   X   | T88   X    | T88   X    |

What kind of value management is this?

March 30, 2021

MIT 6.823 Spring 2021

L11-12 March 30, 2021

MIT 6.823 Spring 2021

L11-13

## Branch Predictor Recovery

### 1-Bit Counter Recovery

|    |   |   |
|----|---|---|
| PC | ↓ | 0 |
|    |   | 1 |
|    |   | 0 |
|    |   | 1 |

### 2-Bit Counter Recovery

|    |   |    |
|----|---|----|
| PC | ↓ | 00 |
|    |   | 11 |
|    |   | 01 |
|    |   | 10 |

### Global History Recovery

10101010

### Local History Recovery

10101010  
01010101

# O-o-O Execution with ROB Data-in-ROB design



## Basic Operation:

- Enter op and tag or data (if known) for each source
- Replace tag with data as it becomes available
- Issue instruction when all sources are available
- Save dest data when operation finishes
- Commit saved dest data when instruction commits

March 30, 2021

MIT 6.823 Spring 2021

L11-14 March 30, 2021

MIT 6.823 Spring 2021

L11-15

# Unified Physical Register File (MIPS R10K, Alpha 21264, Pentium 4)



- One regfile for both *committed* and *speculative* values (no data in ROB)
- During decode, instruction result allocated new physical register, source regs translated to physical regs through rename table
- Instruction reads data from regfile at start of execute (not in decode)
- Write-back updates reg. busy bits on instructions in ROB (assoc. search)
- Snapshots of rename table taken at every branch to recover mispredicts
- On exception, renaming undone in reverse order of issue (MIPS R10000)

March 30, 2021

MIT 6.823 Spring 2021

L11-14 March 30, 2021

MIT 6.823 Spring 2021

L11-15

# Speculative & Out-of-Order Execution



March 30, 2021

MIT 6.823 Spring 2021

L11-16 March 30, 2021

MIT 6.823 Spring 2021

L11-17

# Lifetime of Physical Registers

- Physical regfile holds committed and speculative values
- Physical registers decoupled from ROB entries (no data in ROB)

- |                   |                |
|-------------------|----------------|
| a) Id r1, (r3)    | Id P1, (Px)    |
| b) add r3, r1, #4 | add P2, P1, #4 |
| c) sub r1, r3, r9 | sub P3, P2, Py |
| d) add r3, r1, r7 | add P4, P3, Pz |
| e) Id r6, (r1)    | Id P5, (P3)    |
| f) add r8, r6, r3 | add P6, P5, P4 |
| g) st r8, (r1)    | st P6, (P3)    |
| h) Id r3, (r11)   | Id P7, (Pw)    |
- Rename** →

When can we reuse a physical register?

# Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-18 March 30, 2021

MIT 6.823 Spring 2021

L11-19

# Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-18 March 30, 2021

MIT 6.823 Spring 2021

L11-19

## Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-20

March 30, 2021

MIT 6.823 Spring 2021

L11-21

## Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-22

March 30, 2021

MIT 6.823 Spring 2021

L11-23

## Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-23

## Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-24

March 30, 2021

MIT 6.823 Spring 2021

L11-25

## Physical Register Management



March 30, 2021

MIT 6.823 Spring 2021

L11-25

Execute & Commit

## Reorder Buffer Holds Active Instruction Window



Key: predecode, decoded, issued, executed, committed

March 30, 2021

MIT 6.823 Spring 2021

L11-26 March 30, 2021

MIT 6.823 Spring 2021

L11-27

## Issue Timing

|    |              |                    |                      |                    |                      |
|----|--------------|--------------------|----------------------|--------------------|----------------------|
| i1 | Add R1,R1,#1 | Issue <sub>1</sub> | Execute <sub>1</sub> |                    |                      |
| i2 | Sub R1,R1,#1 |                    |                      | Issue <sub>2</sub> | Execute <sub>2</sub> |

How can we issue earlier?

Using knowledge of execution latency (bypass)

|    |              |                    |                      |                    |                      |
|----|--------------|--------------------|----------------------|--------------------|----------------------|
| i1 | LD R1, (R3)  | Issue <sub>1</sub> | Execute <sub>1</sub> |                    |                      |
| i2 | Sub R1,R1,#1 |                    |                      | Issue <sub>2</sub> | Execute <sub>2</sub> |

What might make this schedule fail?

If execution latency wasn't as expected

## Issue Queue with latency prediction



- Fixed latency: latency included in queue entry ('bypassed')
- Predicted latency: latency included in queue entry (speculated)
- Variable latency: wait for completion signal (stall)

March 30, 2021

MIT 6.823 Spring 2021

L11-28 March 30, 2021

MIT 6.823 Spring 2021

L11-29

## Data-in-ROB vs. Unified RegFile

### Data-in-ROB style



### Unified-register-file style



How does issue speculation differ, e.g., on cache miss?

## Superscalar Register Renaming

- During decode, instructions allocated new physical destination register
- Source operands renamed to physical register with newest value
- Execution unit only sees physical register numbers



Does this work?

March 30, 2021

MIT 6.823 Spring 2021

L11-30 March 30, 2021

MIT 6.823 Spring 2021

L11-31

## Superscalar Register Renaming

Inst 1 Op Dest Src1 Src2      Inst 2 Op Dest Src1 Src2



(MIPS R10K renames 4 serially-RAW-dependent insts/cycle)

L11-32

## Split Issue and Commit Queues

- How large should the ROB be?
  - Think Little's Law...
- Can split ROB into issue and commit queues



- Commit queue: Allocate on decode, free on commit
- Issue queue: Allocate on decode, free on dispatch
- Pros: Smaller issue queue → simpler dispatch logic
- Cons: More complex mis-speculation recovery

March 30, 2021

MIT 6.823 Spring 2021

L11-32 March 30, 2021

MIT 6.823 Spring 2021

L11-33

March 30, 2021

MIT 6.823 Spring 2021

L11-32 March 30, 2021

MIT 6.823 Spring 2021

L11-33

*Thank you!*

**Advanced Memory Operations**

*Daniel Sanchez*

Computer Science and Artificial Intelligence Laboratory  
M.I.T.

March 30, 2021

MIT 6.823 Spring 2021

L11-34 April 1, 2021

MIT 6.823 Spring 2021

L12-1

## Reminder: Direct-Mapped Cache



April 1, 2021

MIT 6.823 Spring 2021

L12-2 April 1, 2021

MIT 6.823 Spring 2021

L12-3

## Write Performance



*How does write timing compare to read timing?*

## Reducing Write Hit Time

Problem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit

View: Treat as data dependence on micro-architectural value 'hit/miss'

Solutions:

- Wait – delivering data as fast as possible:
  - Fully associative (CAM Tag) caches: Word line only enabled if hit
- Speculate predicting hit with greedy data update:
  - Design data RAM that can perform read and write in one cycle
  - Restore old value after tag miss (abort)
- Speculate predicting miss with lazy data update:
  - Hold write data for store in single buffer ahead of cache
  - Write cache data during next idle data access cycle (commit)

April 1, 2021

MIT 6.823 Spring 2021

L12-4 April 1, 2021

MIT 6.823 Spring 2021

L12-5

## Pipelined/Delayed Write Timing

Problem: Need to commit lazily saved write data

Solution: Write data during idle data cycle of next store's tag check



## Pipelining Cache Writes

What if instruction needs data in delayed write buffer?



April 1, 2021

MIT 6.823 Spring 2021

L12-6 April 1, 2021

MIT 6.823 Spring 2021

L12-7

## Write Policy Choices

### • Cache hit:

- **Write-through:** write both cache & memory
  - generally higher traffic but simplifies multi-processor design
- **Write-back:** write cache only
  - (memory is written only when the entry is evicted)
  - a dirty bit per block can further reduce the traffic

### • Cache miss:

- **No-write-allocate:** only write to main memory
- **Write-allocate** (aka *fetch on write*): fetch into cache

### • Common combinations:

- write-through and no-write-allocate
- write-back with write-allocate

## Reducing Read Miss Penalty



**Problem:** Write buffer may hold updated value of location needed by a read miss – RAW data hazard

**Stall:** On a read miss, wait for the write buffer to go empty

**Bypass:** Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer

April 1, 2021

MIT 6.823 Spring 2021

L12-8 April 1, 2021

MIT 6.823 Spring 2021

L12-9

## O-o-O With Physical Register File (MIPS R10K, Alpha 21264, Pentium 4)



We've handled the register dependencies, but what about memory operations?

## Speculative Loads / Stores

- Problem: Just like register updates, stores should not permanently change the architectural memory state until after the instruction is committed
- Choice: Data update policy: greedy or lazy?  
*Lazy*: Add a speculative store buffer, a structure to lazily hold speculative store data.
- Choice: Handling of store-to-load data hazards:  
stall, bypass, speculate...?  
*Bypass*: ...

April 1, 2021

MIT 6.823 Spring 2021

L12-10 April 1, 2021

MIT 6.823 Spring 2021

L12-11

## Store Buffer – Lazy data management



- On store execute:
  - mark valid and speculative; save tag, data, and instruction number
- On store commit:
  - clear speculative bit and eventually move data to cache
- On store abort:
  - clear valid bit

April 1, 2021

MIT 6.823 Spring 2021

L12-12 April 1, 2021

MIT 6.823 Spring 2021

L12-13

## Memory Dependencies

For registers, we used tags or physical register numbers to determine dependencies. What about memory operations?

**st r1, (r2)**  
**ld r3, (r4)**

*When is the load dependent on the store?*

*Does our ROB know this at issue time?*

## Store Buffer Responsibilities

- **Lazy store of data:** Buffer new data values for stores
- **Commit/abort:** The data from the oldest instructions must either be committed to memory or forgotten
- **Bypass:** Data from older instructions must be provided (or forwarded) to younger instructions before the older instruction is committed

*Commits are generally done in order – why?*

April 1, 2021

MIT 6.823 Spring 2021

L12-14 April 1, 2021

MIT 6.823 Spring 2021

L12-15

## Store Buffer - Bypassing



- If data in both store buffer and cache, which should we use?
- If same address in store buffer twice, which should we use?
- Calculating entry needed in the store buffer can be considered a dependence on the index needed to access the store buffer. So store buffer bypassing can be managed speculatively by building a simple predictor that guesses that the specific entry in the store buffer the load needs. So what happens if we guessed the wrong entry?

MIT 6.823 Spring 2021

L12-16

## In-Order Memory Queue

**st r1, (r2)**  
**ld r3, (r4)**

Stall naively:

- Execute all loads and stores in program order  
=> Load and store cannot start execution until all previous loads and stores have completed execution
- Can still execute loads and stores speculatively, and out-of-order with respect to other instructions

April 1, 2021

MIT 6.823 Spring 2021

L12-15 April 1, 2021

MIT 6.823 Spring 2021

L12-16

## Conservative O-o-O Load Execution

**st r1, (r2)  
ld r3, (r4)**

Stall intelligently:

- Split execution of store instruction into two phases: address calculation and data write
- Can execute load before store, if addresses known and  $r4 \neq r2$
- Each load address compared with addresses of all previous uncommitted stores (*can use partial conservative check, e.g., bottom 12 bits of address*)
- Don't execute load if any previous store address not known

(MIPS R10K, 16 entry address queue)

April 1, 2021

MIT 6.823 Spring 2021

L12-16 April 1, 2021

MIT 6.823 Spring 2021

L12-17

## Speculative Load Buffer

**Speculation check:**  
Detect if a load has executed before an earlier store to the same address – missed RAW hazard



- On load execute:
  - mark entry valid, and instruction number and tag of data.
- On load commit:
  - clear valid bit
- On load abort:
  - clear valid bit

April 1, 2021

MIT 6.823 Spring 2021

L12-18 April 1, 2021

MIT 6.823 Spring 2021

L12-19

## Memory Dependence Prediction (Alpha 21264)

**st r1, (r2)  
ld r3, (r4)**

1. Guess that  $r4 \neq r2$  and execute load before store
2. If later find  $r4 == r2$ , squash load and all following instructions, but mark load instruction as *store-wait*
- Subsequent executions of the same load instruction will wait for all previous stores to complete
- Periodically clear *store-wait* bits

Notice the general problem of predictors that learn something but can't unlearn it

## Address Speculation

**st r1, (r2)  
ld r3, (r4)**

1. Guess that  $r4 \neq r2$ , and execute load before store address known
2. If  $r4 \neq r2$  commit...
3. But if  $r4 == r2$ , squash load and *all* following instructions
  - To support squash we need to hold all completed but uncommitted load/store addresses/data in program order

How do we resolve the speculation, i.e., detect when we need to squash?

April 1, 2021

MIT 6.823 Spring 2021

L12-16 April 1, 2021

MIT 6.823 Spring 2021

L12-17

## Speculative Load Buffer



- If data in load buffer with instruction younger than store:
  - Speculative violation – abort!

=> Large penalty for inaccurate address speculation

Does tag match have to be perfect?

April 1, 2021

MIT 6.823 Spring 2021

L12-18 April 1, 2021

MIT 6.823 Spring 2021

L12-19

## Store Sets (Alpha 21464)



April 1, 2021

MIT 6.823 Spring 2021

L12-20 April 1, 2021

MIT 6.823 Spring 2021

L12-21

# Memory Dependence Prediction using Store Sets

- A load must wait for any stores in its *store set* that have not yet executed
- The processor approximates each load's *store set* by initially allowing naïve speculation and recording memory-order violations

## The Store Set Map Table



April 1, 2021

MIT 6.823 Spring 2021

L12-22 April 1, 2021

MIT 6.823 Spring 2021

L12-23

## Store Set Sharing for Multiple Readers



April 1, 2021

MIT 6.823 Spring 2021

L12-24 April 1, 2021

MIT 6.823 Spring 2021

L12-25

## Store Set Map Table, cont.



## Prefetching

- Execution of a load 'depends' on the data it needs being in the cache...
- Speculate on future instruction and data accesses and fetch them into cache(s)
  - Instruction accesses easier to predict than data accesses
- Varieties of prefetching
  - Hardware prefetching
  - Software prefetching
  - Mixed schemes
- How does prefetching affect cache misses?*

Compulsory

Conflict

Capacity

## Issues in Prefetching

- Usefulness – should produce hits
- Timeliness – not late and not too early
- Cache and bandwidth pollution



April 1, 2021

MIT 6.823 Spring 2021

L12-26 April 1, 2021

MIT 6.823 Spring 2021

L12-27

# Hardware Instruction Prefetching

## Instruction prefetch in Alpha AXP 21064

- Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1)
- Requested block placed in cache, and next block in instruction stream buffer
- If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2)



April 1, 2021

MIT 6.823 Spring 2021

L12-28 April 1, 2021

MIT 6.823 Spring 2021

L12-29

Thank you!

Next lecture:  
Cache Coherence

Cache Coherence

Daniel Sanchez

Computer Science & Artificial Intelligence Lab  
M.I.T.

April 1, 2021

MIT 6.823 Spring 2021

L12-30 April 6, 2021

MIT 6.823 Spring 2021

L13-1

# The Shift to Multicore



- Since 2005, improvements in system performance mainly due to increasing cores per chip
- Why?

April 6, 2021

MIT 6.823 Spring 2021

L13-2 April 6, 2021

MIT 6.823 Spring 2021

L13-3

# Hardware Data Prefetching

## Prefetch-on-miss:

- Prefetch b + 1 upon miss on b

## One Block Lookahead (OBL) scheme

- Initiate prefetch for block b + 1 when block b is accessed
- Why is this different from doubling block size?
- Can extend to N-block lookahead (called *stream prefetching*)

## Strided prefetch

- If observe sequence of accesses to block b, b+N, b+2N etc.

**Example:** IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access

April 1, 2021

MIT 6.823 Spring 2021

L12-28 April 1, 2021

MIT 6.823 Spring 2021

L12-29

# Multicore Performance



What factors may limit multicore performance?

April 6, 2021

MIT 6.823 Spring 2021

L13-2 April 6, 2021

MIT 6.823 Spring 2021

L13-3

## Amdahl's Law

- Speedup = time<sub>without enhancement</sub> / time<sub>with enhancement</sub>
- Suppose an enhancement speeds up a fraction  $f$  of a task by a factor of  $S$

$$\text{time}_{\text{new}} = \text{time}_{\text{old}} \cdot ((1-f) + f/S)$$

$$S_{\text{overall}} = 1 / ((1-f) + f/S)$$



### Corollary: Make the common case fast

April 6, 2021

MIT 6.823 Spring 2021

L13-4 April 6, 2021

MIT 6.823 Spring 2021

L13-5

What  $f$  do you need to use a 1000-core machine well?

## Communication Models

- Shared memory:
  - Single address space
  - Implicit communication by reading/writing memory
    - Data
    - Control (semaphores, locks, barriers, ...)
  - Low-level programming model: threads
- Message passing:
  - Separate address spaces
  - Explicit communication by send/rcv messages
    - Data
    - Control (blocking msgs, barriers, ...)
  - Low-level programming model: processes + inter-process communication (e.g., MPI)
- Pros/cons of each model?



April 6, 2021

MIT 6.823 Spring 2021

L13-6 April 6, 2021

MIT 6.823 Spring 2021

L13-7

## Cache Coherence Avoids Stale Data



- A **cache coherence protocol** controls cache contents to avoid stale cache lines

## Amdahl's Law and Parallelism

- Say you write a program that can do 90% of the work in parallel, but the other 10% is sequential
- What is the maximum speedup you can get by running on a multicore machine?

$$S_{\text{overall}} = 1 / ((1-f) + f/S)$$

$$f = 0.9, S=\infty \rightarrow S_{\text{overall}} = 10$$

What  $f$  do you need to use a 1000-core machine well?

## Coherence & Consistency

- Shared memory systems:
  - Have **multiple private caches** for performance reasons
  - Need to provide the illusion of a single shared memory
- Intuition: A read should return the most recently written value
  - What is "most recent"?
- Formally:
  - Coherence: What values can a read return?
    - Concerns reads/writes to a single memory location
  - Consistency: When do writes become visible to reads?
    - Concerns reads/writes to multiple memory locations

MIT 6.823 Spring 2021

L13-8

## Implementing Cache Coherence

- Coherence protocols must enforce two rules:
  - *Write propagation*: Writes eventually become visible to all processors
  - *Write serialization*: Writes to the same location are serialized (all processors see them in the same order)
- How to ensure write propagation?
  - *Write-invalidate protocols*: Invalidate all other cached copies before performing the write
  - *Write-update protocols*: Update all other cached copies after performing the write
- How to track sharing state of cached data and serialize requests to the same address?
  - *Snooping-based protocols*: All caches observe each other's actions through a shared bus (bus is the serialization point)
  - *Directory-based protocols*: A coherence directory tracks contents of private caches and serializes requests (directory is the serialization point)

April 6, 2021

MIT 6.823 Spring 2021

L13-8 April 6, 2021

MIT 6.823 Spring 2021

L13-9

# Snooping-Based Coherence (Goodman, 1983)



Caches watch (snoop on) bus to keep all processors' view of memory coherent

April 6, 2021

MIT 6.823 Spring 2021

L13-10 April 6, 2021

MIT 6.823 Spring 2021

L13-11

# Snooping-Based Coherence

- Bus provides serialization point
  - Broadcast, totally ordered
- Controller
  - One cache controller for each core "snoops" all bus transactions
  - Controller
    - Responds to requests from core and the bus
    - changes state of the selected cache block
    - generates bus transactions to access data or invalidate
- Snoopy protocol (FSM)
  - State-transition diagram
  - Actions
- Handling writes:
  - Write-invalidate
  - Write-update



## A Simple Protocol: Valid/Invalid (VI)



April 6, 2021

MIT 6.823 Spring 2021

L13-12 April 6, 2021

MIT 6.823 Spring 2021

L13-13

## Valid/Invalid Example



## Valid/Invalid Example



Additional loads satisfied locally, without BusRd

April 6, 2021

MIT 6.823 Spring 2021

L13-14 April 6, 2021

MIT 6.823 Spring 2021

L13-15

## Valid/Invalid Example



## Valid/Invalid Example



VI Problems?

April 6, 2021

MIT 6.823 Spring 2021

L13-16 April 6, 2021

MIT 6.823 Spring 2021

L13-17

## MSI Example



April 6, 2021

MIT 6.823 Spring 2021

L13-18 April 6, 2021

MIT 6.823 Spring 2021

L13-19

## MSI Example



Additional loads satisfied locally, without BusRd  
(like in VI)

## MSI Example



Additional loads *and stores* from core 0 satisfied locally,  
without bus transactions (unlike in VI)

April 6, 2021

MIT 6.823 Spring 2021

L13-20 April 6, 2021

MIT 6.823 Spring 2021

L13-21

## Modified/Shared/Invalid (MSI) Protocol

- Allows writeback caches + satisfying writes locally



## MSI Example



Additional loads satisfied locally, without BusRd  
(like in VI)

## MSI Example





## Non-Atomicity → Transient States

- Protocol must handle lack of atomicity
- Two types of states
  - Stable (e.g. MSI)
  - Transient
- Split + race transitions
- More complex

| Actions                 |               |
|-------------------------|---------------|
| Bus Request<br>(BusReq) | PrWr / BusReq |
| Bus Grant<br>(BusGnt)   | PrWr / BusReq |



April 6, 2021

MIT 6.823 Spring 2021

L13-28 April 6, 2021

MIT 6.823 Spring 2021

L13-29

## Directory-Based Coherence



- Route all coherence transactions through a directory
  - Tracks contents of private caches → No broadcasts
  - Serves as ordering point for conflicting requests → Unordered networks

(more on next lecture)

April 6, 2021

MIT 6.823 Spring 2021

L13-30 April 6, 2021

MIT 6.823 Spring 2021

L13-31

## Coherence and Synchronization Performance Issue #2



Cache coherence protocols will cause **mutex** to *ping-pong* between P1's and P2's caches.

Ping-ponging can be reduced by first reading the **mutex** location (*non-atomically*) and executing a swap only if it is found to be zero (*test&test&set*).

April 6, 2021

MIT 6.823 Spring 2021

L13-32 April 6, 2021

MIT 6.823 Spring 2021

L13-33

## Scaling Cache Coherence

- Can implement ordered interconnects that scale better than buses...



Starfire E10000 (drawn with only eight processors for clarity). A coherence request is *unicast* up to the root, where it is serialized, before being *broadcast* down to all processors

- ... but broadcast is fundamentally unscaleable
  - Bandwidth, energy of transactions with 100s of cache snoops?

## Coherence and False Sharing Performance Issue #1

state blk addr data0 data1 ... dataN

A cache block contains more than one word and cache coherence is done at the block-level and not word-level

Suppose  $P_1$  writes  $word_i$  and  $P_2$  writes  $word_k$  and both words have the same block address.

What can happen?

How to address this problem?

April 6, 2021

MIT 6.823 Spring 2021

L13-30 April 6, 2021

MIT 6.823 Spring 2021

L13-31

## Coherence and Bus Occupancy Performance Issue #3

- In general, an *atomic read-modify-write* instruction requires two memory (bus) operations without intervening memory operations by other processors
- In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation
  - ⇒ expensive for simple buses
  - ⇒ very expensive for split-transaction buses
- modern processors use
  - load-reserve*
  - store-conditional*

April 6, 2021

MIT 6.823 Spring 2021

L13-32 April 6, 2021

MIT 6.823 Spring 2021

L13-33

## Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve R, (a):  
 $\langle \text{flag}, \text{adr} \rangle \leftarrow \langle 1, a \rangle;$   
 $R \leftarrow M[a];$

Store-conditional (a), R:  
if  $\langle \text{flag}, \text{adr} \rangle == \langle 1, a \rangle$   
then cancel other procs'  
reservation on a;  
 $M[a] \leftarrow \langle R \rangle;$   
status  $\leftarrow$  succeed;  
else status  $\leftarrow$  fail;

If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0

- Several processors may reserve 'a' simultaneously
- These instructions are like ordinary loads and stores with respect to the bus traffic

Thank you!

Next lecture: Directory-based Cache Coherence

## Directory-Based Cache Coherence

Daniel Sanchez

Computer Science and Artificial Intelligence Lab  
M.I.T.

## Maintaining Cache Coherence

It is sufficient to have hardware such that

- only one processor at a time has write permission for a location
- no processor can load a stale copy of the location after a write

⇒ A correct approach could be:

write request:

The address is *invalidated* in all other caches *before* the write is performed

read request:

If a dirty copy is found in some cache, a write-back is performed before the memory is read

## Performance: Load-reserve & Store-conditional

The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional:

- increases bus utilization (and reduces processor stall time), especially in split-transaction buses
- reduces cache ping-pong effect because processors trying to acquire a mutex do not have to perform stores each time

## Directory-Based Cache Coherence (Censier and Feautrier, 1978)

### Snoopy Protocols



### Directory Protocols



- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests
- Directory schemes send messages to only those caches that might have the line
- Can scale to large numbers of processors
- Requires extra directory storage to track possible sharers

## An MSI Directory Protocol



- Cache states: Modified (M) / Shared (S) / Invalid (I)
- Directory states:
  - Uncached (Un): No sharers
  - Shared (Sh): One or more sharers with read permission (S)
  - Exclusive (Ex): A single sharer with read & write permissions (M)
- Transient states not drawn for clarity; for now, assume no racing requests

April 8, 2021

MIT 6.823 Spring 2021

L14-4 April 8, 2021

MIT 6.823 Spring 2021

L14-5

## MSI Protocol: Caches (1/3)

Transitions initiated by processor accesses:



| Actions                   |
|---------------------------|
| Processor Read (PrRd)     |
| Processor Write (PrWr)    |
| Shared Request (ShReq)    |
| Exclusive Request (ExReq) |

## MSI Protocol: Caches (2/3)

Transitions initiated by directory requests:



April 8, 2021

MIT 6.823 Spring 2021

L14-6 April 8, 2021

MIT 6.823 Spring 2021

L14-7

## MSI Protocol: Caches (3/3)

Transitions initiated by evictions:



## MSI Protocol: Caches

- Transitions initiated by processor accesses  
 → Transitions initiated by directory requests  
 → Transitions initiated by evictions



April 8, 2021

MIT 6.823 Spring 2021

L14-8 April 8, 2021

MIT 6.823 Spring 2021

L14-9

## MSI Protocol: Directory (1/2)

Transitions initiated by data requests:



April 8, 2021

MIT 6.823 Spring 2021

L14-9 April 8, 2021

MIT 6.823 Spring 2021

L14-9

## MSI Protocol: Directory (2/2)

Transitions initiated by writeback requests:



WbReq / Sharers = {}; WbResp

WbReq && |Sharers| > 1 / Sharers = Sharers - {P}; WbResp

WbReq && |Sharers| == 1 / Sharers = {}; WbResp

April 8, 2021

MIT 6.823 Spring 2021

L14-10 April 8, 2021

MIT 6.823 Spring 2021

L14-11

## MSI Directory Protocol Example



April 8, 2021

MIT 6.823 Spring 2021

L14-12 April 8, 2021

MIT 6.823 Spring 2021

L14-12

## MSI Directory Protocol Example



April 8, 2021

MIT 6.823 Spring 2021

L14-12 April 8, 2021

MIT 6.823 Spring 2021

L14-13

## MSI Directory Protocol Example



Why are 0xA's wb and 0xB's req serialized?

Possible solutions?

April 8, 2021

MIT 6.823 Spring 2021

L14-14 April 8, 2021

MIT 6.823 Spring 2021

L14-15

## MSI Directory Protocol Example



April 8, 2021

MIT 6.823 Spring 2021

L14-10 April 8, 2021

MIT 6.823 Spring 2021

L14-11

## MSI Directory Protocol Example



April 8, 2021

MIT 6.823 Spring 2021

L14-12 April 8, 2021

MIT 6.823 Spring 2021

L14-13

## Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache

MSHR entry

|   |   |      |      |
|---|---|------|------|
| V | X | Addr | Data |
|---|---|------|------|

- On eviction/writeback

- No free MSHR entry: stall
- Allocate new MSHR entry
- When channel available send WBReq and data
- Deallocate entry on WBResp

April 8, 2021

MIT 6.823 Spring 2021

L14-14 April 8, 2021

MIT 6.823 Spring 2021

L14-15

## Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache



- On cache load miss

- No free MSHR entry: stall
- Allocate new MSHR entry
- Send ShReq (or ExReq)
- On \*Resp forward data to CPU and cache
- Deallocate MSHR

## Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache



- On cache load miss

- Look for matching address in MSHR
  - If not found
    - If no free MSHR entry: stall
    - Allocate new MSHR entry and fill in
  - If found, just fill in per Id/st slot
- Send ShReq (or ExReq)
- On \*Resp forward data to CPU and cache
- Deallocate MSHR

Per Id/st slots allow servicing multiple requests with one entry

## Directory Organization

- Requirement: Directory needs to keep track of all the cores that are sharing a cache block
- Challenge: For each block, the space needed to hold the list of sharers grows with number of possible sharers...

## Flat, Memory-based Directories

- Dedicate a few bits of main memory to store the state and sharers of every line
- Encode sharers using a bit-vector



- ✓ Simple
- ✗ Slow
- ✗ Very inefficient with many processors (~P bits/line)

## Sparse Full-Map Directories

- Not every line in the system needs to be tracked – only those in private caches!
- Idea: Organize directory as a cache



- ✓ Low latency, energy-efficient
- ✗ Bit-vectors grow with # cores → Area scales poorly
- ✗ Limited associativity → Directory-induced invalidations

## Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory



How many entries should the directory have?

## Inexact Representations of Sharer Sets

- Coarse-grain bit-vectors (e.g., 1 bit per 4 cores)



- Limited pointers: Maintain a few sharer pointers, on overflow mark 'all' and broadcast (or invalidate another sharer)



- Allow false positives (e.g., Bloom filters)

- ✓ Reduced area & energy
- ✗ Overheads still not scalable (these techniques simply play with constant factors)
- ✗ Inexact sharers → Broadcasts, invalidations or spurious invalidations and downgrades

April 8, 2021

MIT 6.823 Spring 2021

L14-22 April 8, 2021

MIT 6.823 Spring 2021

L14-23

## Extra Hops and 3-Hop Protocols

Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly



April 8, 2021

MIT 6.823 Spring 2021

L14-24 April 8, 2021

MIT 6.823 Spring 2021

L14-25

## In-Cache Directories

- Common multicore memory hierarchy:

- 1+ levels of private caches
- A shared last-level cache
- Need to enforce coherence among private caches

- Idea: Embed the directory information in shared cache tags

- Shared cache must be inclusive



- ✓ Avoids tag overheads & separate lookups
- ✗ Can be inefficient if shared cache size >> sum(private cache sizes)

April 8, 2021

MIT 6.823 Spring 2021

L14-26 April 8, 2021

MIT 6.823 Spring 2021

L14-27

## Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race



Caches 0 and 1 issue simultaneous ExReqs  
Directory starts serving cache 0's ExReq, queues cache 1's

Cache 1 expected ExResp, but got InvReq!  
Cache 1 should transition from S->M to I->M and send InvResp

MIT 6.823 Spring 2021

L14-26

## Coherence in Multi-Level Hierarchies

- Can use the same or different protocols to keep coherence across multiple levels
- Key invariant: Ensure sufficient permissions in all intermediate levels
- Example: 8-socket Xeon E7 (8 cores/socket)



MESIF protocol

Snooping (QPI)

MESI protocol

L3 in-cache directory

MIT 6.823 Spring 2021

L14-27

## Avoiding Protocol Deadlock

- Protocols can cause deadlocks even if network is deadlock-free! (more on this later)



Example: Both nodes saturate all intermediate buffers with requests to each other, blocking responses from entering the network

- Solution: Separate virtual networks

- Different sets of virtual channels and endpoint buffers
- Same physical routers and links

- Most protocols require at least 2 virtual networks (for requests and replies), often >2 needed

# Implementing Atomic Instructions

- In general, an *atomic read-modify-write* instruction requires two memory operations without intervening memory operations by other processors
- Implementation options:
  - With snoopy coherence, lock the bus -> expensive
  - With directory-based coherence, lock the line in the cache (prevent invalidations or evictions until atomic op finishes)  
-> complex
- Modern processors often use
  - load-reserve*
  - store-conditional*

April 6, 2021

MIT 6.823 Spring 2021

L13-28 April 8, 2021

MIT 6.823 Spring 2021

L14-29

# Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve R, (a):  
 $\langle \text{flag}, \text{adr} \rangle \leftarrow \langle 1, a \rangle;$   
 $R \leftarrow M[a];$

Store-conditional (a), R:  
*if*  $\langle \text{flag}, \text{adr} \rangle == \langle 1, a \rangle$   
*then* cancel other procs' reservation on a;  
 $M[a] \leftarrow \langle R \rangle;$   
status  $\leftarrow$  succeed;  
*else* status  $\leftarrow$  fail;

If the cache receives an invalidation to the address in the reserve register, the reserve bit is set to 0

- Several processors may reserve 'a' simultaneously
- These instructions are like ordinary loads and stores with respect to the bus traffic

## Load-Reserve/Store-Conditional

Swap implemented with Ld-Reserve/St-Conditional

# Swap(R1, mutex):

L: Ld-Reserve R2, (mutex)  
St-Conditional (mutex), R1  
if (status == fail) goto L  
R1  $\leftarrow$  R2

April 8, 2021

MIT 6.823 Spring 2021

L14-30 April 8, 2021

MIT 6.823 Spring 2021

L14-31

## Performance: Load-reserve & Store-conditional

The total number of coherence transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional:

- increases utilization (and reduces processor stall time), especially in split-transaction buses and directories
- reduces cache ping-pong effect because processors trying to acquire a semaphore do not have to perform stores each time

Thank you!

Next Lecture:  
Consistency and  
Relaxed Memory Models

## Memory Consistency Models

Daniel Sanchez  
Computer Science and Artificial Intelligence Lab  
M.I.T.

April 8, 2021

MIT 6.823 Spring 2021

L14-32 April 13, 2021

MIT 6.823 Spring 2021

L15-1

## Coherence vs Consistency

- Cache coherence makes private caches invisible to software
  - Concerns reads/writes to a single memory location
- Memory consistency models precisely specify how memory behaves with respect to read and write operations from multiple processors
  - Concerns reads/writes to multiple memory locations

April 13, 2021

MIT 6.823 Spring 2021

L15-2 April 13, 2021

MIT 6.823 Spring 2021

L15-3

## Why Consistency Matters

### Initial memory contents

a: 0  
flag: 0

#### Processor 1

Store (a), 10;  
Store (flag), 1;

#### Processor 2

L: Load r1, (flag);  
if r1 == 0 goto L;  
Load r2, (a);

- What value does r2 hold after both processors finish running this code?

## Sequential Consistency

A Straightforward Memory Model



"A system is *sequentially consistent* if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program"

Leslie Lamport

Sequential Consistency =  
arbitrary order-preserving interleaving  
of memory references of sequential programs

April 13, 2021

MIT 6.823 Spring 2021

L15-4 April 13, 2021

MIT 6.823 Spring 2021

L15-5

## Sequential Consistency

#### Processor 1

Store (a), 10;  
Store (flag), 1;

#### Processor 2

L: Load r1, (flag);  
if r1 == 0 goto L;  
Load r2, (a);

- In-order instruction execution
- Atomic loads and stores

*SC is easy to understand, but architects and compiler writers want to violate it for performance*

## Memory Model Issues

*Architectural optimizations that are correct for uniprocessors often violate sequential consistency and result in a new memory model for multiprocessors*

April 13, 2021

MIT 6.823 Spring 2021

L15-6 April 13, 2021

MIT 6.823 Spring 2021

L15-7

## Consistency Models

- Sequential Consistency
  - All reads and writes in order

- Relaxed Consistency (one or more of the following)
  - Loads may be reordered after loads
    - e.g., PA-RISC, Power, Alpha
  - Loads may be reordered after stores
    - e.g., PA-RISC, Power, Alpha
  - Stores may be reordered after stores
    - e.g., PA-RISC, Power, Alpha, PSO
  - Stores may be reordered after loads
    - e.g., PA-RISC, Power, Alpha, PSO, TSO
  - Other more esoteric characteristics
    - e.g., Alpha

MIT 6.823 Spring 2021

L15-7

## Committed Store Buffers

- CPU can continue execution while earlier committed stores are still propagating through memory system
  - Processor can commit other instructions (including loads and stores) while first store is committing to memory
  - Committed store buffer can be combined with speculative store buffer in an out-of-order CPU
- Local loads can bypass values from buffered stores to same address



April 13, 2021

MIT 6.823 Spring 2021

L15-8 April 13, 2021

MIT 6.823 Spring 2021

L15-9

## Example 1: Store Buffers

| Process 1                                                                                                                    | Process 2                                                                                                                    |
|------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| Store (flag <sub>1</sub> ), 1;<br>Load r <sub>1</sub> , (flag <sub>2</sub> );<br>Load r <sub>2</sub> , (flag <sub>1</sub> ); | Store (flag <sub>2</sub> ), 1;<br>Load r <sub>2</sub> , (flag <sub>1</sub> );<br>Load r <sub>1</sub> , (flag <sub>2</sub> ); |

Question: Is it possible that  $r_1=0$  and  $r_2=0$ ?

- Sequential consistency:
- Suppose Loads can go ahead of Stores waiting in the store buffer:

Total Store Order (TSO):  
Sun SPARC, IBM 370

Initially, all memory locations contain zeros

## Example 2: Store-Load Bypassing

| Process 1                                                                                                                    | Process 2                                                                                                                    |
|------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| Store (flag <sub>1</sub> ), 1;<br>Load r <sub>3</sub> , (flag <sub>1</sub> );<br>Load r <sub>1</sub> , (flag <sub>2</sub> ); | Store (flag <sub>2</sub> ), 1;<br>Load r <sub>4</sub> , (flag <sub>2</sub> );<br>Load r <sub>2</sub> , (flag <sub>1</sub> ); |

Question: Do extra Loads have any effect?

- Sequential consistency:
- Suppose Store-Load bypassing is permitted in the store buffer
  - No effect in Sparc's TSO model, still not SC
  - In IBM 370, a load cannot return a written value until it is visible to other processors => implicitly adds a memory fence, looks like SC

April 13, 2021

MIT 6.823 Spring 2021

L15-10 April 13, 2021

MIT 6.823 Spring 2021

L15-11

## Interleaved Memory System

- Achieve greater throughput by spreading memory addresses across two or more parallel memory subsystems
  - In snooping system, can have two or more snoops in progress at same time (e.g., Sun UE10K system has four interleaved snooping busses)
  - Greater bandwidth from main memory system as two memory modules can be accessed in parallel



## Example 3: Non-FIFO Store buffers

| Process 1                         | Process 2                                                   |
|-----------------------------------|-------------------------------------------------------------|
| Store (a), 1;<br>Store (flag), 1; | Load r <sub>1</sub> , (flag);<br>Load r <sub>2</sub> , (a); |

Question: Is it possible that  $r_1=1$  but  $r_2=0$ ?

- Sequential consistency:
- With non-FIFO store buffers:

Sparc's PSO memory model

April 13, 2021

MIT 6.823 Spring 2021

L15-12 April 13, 2021

MIT 6.823 Spring 2021

L15-13

## Example 4: Non-Blocking Caches

| Process 1                         | Process 2                                                   |
|-----------------------------------|-------------------------------------------------------------|
| Store (a), 1;<br>Store (flag), 1; | Load r <sub>1</sub> , (flag);<br>Load r <sub>2</sub> , (a); |

Question: Is it possible that  $r_1=1$  but  $r_2=0$ ?

- Sequential consistency:
- Assuming stores are ordered:

Alpha, Sparc's RMO, PowerPC's WO

## Example 5: Register Renaming



## Example 6: Speculative Execution



Question: Is it possible that  $r_1=1$  but  $r_2=0$ ?

- Sequential consistency: No
- With speculative loads:

## Example 7: Address Speculation



## Example 8: Store Atomicity



Question: Is it possible that  $r_1=1$  and  $r_2=2$  but  $r_3=2$  and  $r_4=1$  ?

- Sequential consistency:
- Even if Loads on a processor are ordered, the different ordering of stores can be observed if the Store operation is not atomic.

## Example 9: Causality



Question: Is it possible that  $r_1=1$  and  $r_2=1$  but  $r_3=0$  ?

- Sequential consistency:
- With load/load reordering:

Alpha

## Weaker Memory Models & Memory Fence Instructions

- Architectures with weaker memory models provide memory fence instructions to prevent otherwise permitted reorderings of loads and stores

Store (a<sub>1</sub>), r2;  
**Fence<sub>wr</sub>**  
Load r1, (a<sub>2</sub>);

The Load and Store can be reordered if a<sub>1</sub> /= a<sub>2</sub>. Insertion of Fence<sub>wr</sub> will disallow this reordering

Similarly: Fence<sub>rr</sub>; Fence<sub>rw</sub>; Fence<sub>ww</sub>;

SUN's Sparc: MEMBAR;  
MEMBARRR; MEMBARRW; MEMBARWR; MEMBARWW  
PowerPC: Sync; EIEIO

## Enforcing Ordering using Fences

Processor 1  
Store (a),10;  
Store (flag),1;

Processor 2  
L: Load r<sub>1</sub>, (flag);  
if r<sub>1</sub> == 0 goto L;  
Load r<sub>2</sub>, (a);

Processor 1  
Store (a),10;  
**Fence<sub>ww</sub>**;  
Store (flag),1;

Processor 2  
L: Load r<sub>1</sub>, (flag);  
if r<sub>1</sub> == 0 goto L;  
**Fence<sub>rr</sub>**;  
Load r<sub>2</sub>, (a);

Weak ordering

April 13, 2021

MIT 6.823 Spring 2021

L15-20 April 13, 2021

MIT 6.823 Spring 2021

L15-21

## Weaker (Relaxed) Memory Models



- Hard to understand and remember
- Unstable - *Modèle de l'année*
- Abandon weaker memory models in favor of implementing SC

## Implementing SC

1. The memory operations of each individual processor appear to all processors in the order the requests are made to the memory

- Provided by cache coherence, which ensures that all processors observe the same order of loads and stores to an address

2. Any execution is the same as if the operations of all the processors were executed in some sequential order

- Provided by enforcing a dependence between each memory operation and the following one

April 13, 2021

MIT 6.823 Spring 2021

L15-22 April 13, 2021

MIT 6.823 Spring 2021

L15-23

## SC Data Dependence

- *Stall*
  - Use in-order execution and blocking caches
  - Cache coherence plus allowing a processor to have only one request in flight at a time will provide SC
- *Change architecture  $\Rightarrow$  Relaxed memory models*
  - Use OOO and non-blocking caches
    - Cache coherence and allowing multiple concurrent requests (to different addresses) gives high performance
    - Add fence operations to force ordering when needed
- *Speculate...*

## Sequential Consistency Speculation

- Local load-store ordering uses standard OOO mechanism
- Globally non-speculative stores
  - Stores execute at commit  $\rightarrow$  stores are in-order!
- Globally speculative loads
  - **Guess** at issue that the memory location used by a load will not change between issue and commit of the instruction
    - this is equivalent to loads happening in-order at commit
  - **Check** at commit by remembering all loads addresses starting at issue and watching for writes to that location.
  - **Data Management** for rollback relies on the basic out-of-order speculative data management used for uni-processor rollback and instruction re-execution.

## SC Speculative Behavior



April 13, 2021

MIT 6.823 Spring 2021

L15-24 April 13, 2021

MIT 6.823 Spring 2021

L15-25

## Properly Synchronized Programs

- Very few programmers do programming that relies on SC; instead, they use higher-level synchronization primitives
  - locks, semaphores, monitors, atomic transactions
- A “properly synchronized program” is one where each shared writable variable is protected (say, by a lock) so that there is no race in updating the variable
  - There is still race to get the lock
  - There is no way to check if a program is properly synchronized
- For properly synchronized programs, instruction reordering does not matter as long as updated values are committed before leaving a locked region

April 13, 2021

MIT 6.823 Spring 2021

L15-26 April 13, 2021

MIT 6.823 Spring 2021

L15-27

## Takeaways

- SC is too low level a programming model. High-level programming should be based on critical sections & locks, atomic transactions, monitors, ...
- High-level parallel programming should be oblivious of memory model issues
  - Programmer should not be affected by changes in the memory model
- ISA definition for Load, Store, Memory Fence, synchronization instructions should
  - Be precise
  - Permit maximum flexibility in hardware implementation
  - Permit efficient implementation of high-level parallel constructs

April 13, 2021

MIT 6.823 Spring 2021

L15-28 April 13, 2021

MIT 6.823 Spring 2021

L15-29

*Thank you!*

*Next Lecture:  
On-Chip Networks*

## On-Chip Networks I: Topology/Flow Control

*Daniel Sanchez*  
Computer Science & Artificial Intelligence Lab  
M.I.T.

## Release Consistency [Garachorloo 1990]

- Only care about inter-processor memory ordering at thread synchronization points, not in between
- Can treat all synchronization instructions as the only ordering points

...  
Acquire(lock) // All following loads get most recent written values  
... Read and write shared data ..  
Release(lock) // All preceding writes are globally visible before  
// lock is freed.  
...

...

MIT 6.823 Spring 2021

L15-27

**History: From interconnection networks to on-chip networks**

Box-to-box  
networks



Board-to-board  
networks



Chip-to-chip  
networks



On-chip  
networks



**Focus on on-chip networks connecting caches in shared-memory processors**

Multi-Chip: Supercomputers, Data Centers, Internet Routers, Servers  
On-Chip: Servers, Laptops, Phones, HDTVs, Access routers

April 15, 2021

MIT 6.823 Spring 2021

L16-1 April 15, 2021

MIT 6.823 Spring 2021

L16-2

## What's an on-chip network?

E.g. Cache-coherent chip multiprocessor



April 15, 2021

MIT 6.823 Spring 2021

L16-3 April 15, 2021

MIT 6.823 Spring 2021

L16-4

## Designing an on-chip network



- Topology
- Flow control
- Router microarchitecture
- Routing

## Interconnection Network Architecture

- *Topology*: How to connect the nodes up? (processors, memories, router line cards, ...)
- *Routing*: Which path should a message take?
- *Flow control*: How is the message actually forwarded from source to destination?
- *Router microarchitecture*: How to build the routers?
- *Link microarchitecture*: How to build the links?

## Topology

April 15, 2021

MIT 6.823 Spring 2021

L16-5 April 15, 2021

MIT 6.823 Spring 2021

L16-6

## Topological Properties

- *Diameter*
- *Average Distance*
- *Bisection Bandwidth*

## Topological Properties

- *Routing Distance* - number of links on route
- *Diameter* - maximum routing distance
- *Average Distance*
- A network is *partitioned* by a set of links if their removal disconnects the graph
- *Bisection Bandwidth* is the bandwidth crossing a minimal cut that divides the network in half

April 15, 2021

MIT 6.823 Spring 2021

L16-7 April 15, 2021

MIT 6.823 Spring 2021

L16-8

## Linear Arrays and Rings



Linear Array

Torus

Torus arranged to use short wires

Route A -> B given by relative address R = B-A

Linear Array    Ring (1-D Torus)

Diameter?

Average distance?

Bisection bandwidth?

- **Torus Examples:**

- FDDI, SCI, FiberChannel Arbitrated Loop, Intel Xeon

April 15, 2021

MIT 6.823 Spring 2021

L16-9 April 15, 2021

MIT 6.823 Spring 2021

L16-10

## Multidimensional Meshes and Tori



- $d$ -dimensional array

- $n = k_{d-1} \times \dots \times k_0$  nodes

- described by  $d$ -vector of coordinates  $(i_{d-1}, \dots, i_0)$

- $d$ -dimensional  $k$ -ary mesh:  $N = k^d$

- $k = \sqrt[d]{N}$

- described by  $d$ -vector of radix  $k$  coordinate

- $d$ -dimensional  $k$ -ary torus (or  $k$ -ary  $d$ -cube)

April 15, 2021

MIT 6.823 Spring 2021

L16-9 April 15, 2021

MIT 6.823 Spring 2021

L16-10

## Routing & Flow Control Overview

April 15, 2021

MIT 6.823 Spring 2021

L16-11 April 15, 2021

MIT 6.823 Spring 2021

L16-12

## Routing vs Flow Control

- Routing algorithm chooses path that packets should follow to get from source to destination
- Flow control schemes allocate resources (buffers, links, control state) to packets traversing the network
- Our approach: Bottom-up
  - Today: Flow control, assuming routes are set
  - Next lecture: Routing algorithms

April 15, 2021

MIT 6.823 Spring 2021

L16-13 April 15, 2021

MIT 6.823 Spring 2021

L16-14

## Properties of Routing Algorithms

- Deterministic/Oblivious
  - Route determined by (source, dest), not intermediate state (i.e. traffic)
- Adaptive
  - Route influenced by traffic along the way
- Minimal
  - Only selects shortest paths
- Deadlock-free
  - No traffic pattern can lead to a situation where no packets move forward

(more in next lecture)

## Contention



## Flow Control

- Two packets trying to use the same link at the same time
  - Limited or no buffering
- Problem arises because we are sharing resources
  - Sharing bandwidth and buffers

April 15, 2021

MIT 6.823 Spring 2021

L16-15 April 15, 2021

MIT 6.823 Spring 2021

L16-16

## Flow Control Protocols

- Bufferless
  - Circuit switching
  - Dropping
  - Misrouting
- Buffered
  - Store-and-forward
  - Virtual cut-through
  - Wormhole
  - Virtual-channel



## Circuit Switching

- Form a circuit from source to dest
- Probe to set up path through network
- Reserve all links
- Data sent through links
- Bufferless

April 15, 2021

MIT 6.823 Spring 2021

L16-17 April 15, 2021

MIT 6.823 Spring 2021

L16-18

## Time-space View: Circuit Switching



- Why is this good?
- Why is it not?

## Speculative Flow Control: Dropping

- If two things arrive and I don't have resources, drop one of them
- Flow control protocol on the Internet



April 15, 2021

MIT 6.823 Spring 2021

L16-19 April 15, 2021

MIT 6.823 Spring 2021

L16-20

## Time-space Diagram: Dropping



## Less Simple Flow Control: Misrouting

- If only one message can enter the network at each node, and one message can exit the network at each node, the network can never be congested. Right?



- Philosophy behind misrouting: intentionally route away from congestion
- No need for buffering
- Problems?

April 15, 2021

MIT 6.823 Spring 2021

L16-21 April 15, 2021

MIT 6.823 Spring 2021

L16-22

## Buffered Routing



- Link-level flow control:

- Given that you can't drop packets, how to manage the buffers?  
When can you send stuff forward, when not?

- Metrics of interest:

- Throughput/Latency
- Buffer utilization (turnaround time)

April 15, 2021

MIT 6.823 Spring 2021

L16-23 April 15, 2021

MIT 6.823 Spring 2021

L16-24

## Store-and-Forward (packet-based, no flits)

- Strategy:

- Make intermediate stops and wait until the entire packet has arrived before you move on

- Advantage:

- Other packets can use intermediate links

## Techniques for link backpressure

- Naïve stall-based (on/off):
  - Can source send or not?
- Sophisticated stall-based (credit-based):
  - How many flits can be sent to the next node?
- Speculative (ack/nack):
  - Guess can always send, but keep copy
  - Resolve if send was successful (ack/nack)
    - On ack – drop copy
    - On nack - resend

April 15, 2021

MIT 6.823 Spring 2021

L16-23 April 15, 2021

MIT 6.823 Spring 2021

L16-24

## Time-space View: Store-and-Forward



Could be allocated at a much later time without packet dropping

- Buffering allows packet to wait for channel
- Drawback?

April 15, 2021

MIT 6.823 Spring 2021

L16-25 April 15, 2021

MIT 6.823 Spring 2021

L16-26

## Virtual Cut-through (packet-based)

- Why wait till entire message has arrived at each intermediate stop?
- The head flit of the packet can dash off first
- When the head gets blocked, whole packet gets blocked at one intermediate node
- Used in Alpha 21364

## Time-space View: Virtual Cut-through



April 15, 2021

MIT 6.823 Spring 2021

L16-27 April 15, 2021

MIT 6.823 Spring 2021

L16-28

## Flit-Buffer Flow Control: Wormhole

- When a packet blocks, just block wherever the pieces (flits) of the message are at that time.
- Operates like cut-through but with channel and buffers allocated to flits rather than packets
  - Channel state (virtual channel) allocated to **packet** so body flits can follow head flit

## Time-space View: Wormhole



- Advantages?

- Disadvantages?

April 15, 2021

MIT 6.823 Spring 2021

L16-29 April 15, 2021

MIT 6.823 Spring 2021

L16-29

## Virtual-Channel (VC) Flow Control

- When a message blocks, instead of holding on to links so others can't use them, hold on to **virtual** links
- Multiple queues in buffer storage
  - Like lanes on the highway
- Virtual channel can be thought of as channel state and flit buffers

## Time-space View: Virtual-Channel



- Advantages?

- Disadvantages?

April 15, 2021

MIT 6.823 Spring 2021

L16-31 April 15, 2021

MIT 6.823 Spring 2021

L16-32

Thank you!

Next Lecture:  
Router (Switch) Microarchitecture  
Routing Algorithms

## On-Chip Networks II: Router Microarchitecture & Routing

Daniel Sanchez

Computer Science & Artificial Intelligence Lab  
M.I.T.

April 15, 2021

MIT 6.823 Spring 2021

L16-33 April 22, 2021

MIT 6.823 Spring 2021

L17-1

## Recap: Wormhole Flow Control

- Each router manages buffers in flits
- Each packet is sent through output link as soon as possible (without waiting for all its flits to arrive)
- Router buffers are not large enough to hold full packet → on congestion, packet's flits often buffered across routers
- Problem: On congestion, links assigned to a blocked packet cannot be used by other packets

### Wormhole



April 22, 2021

MIT 6.823 Spring 2021

L17-2 April 22, 2021

MIT 6.823 Spring 2021

L17-3

## Time-Space View: Virtual-Channel Flow Control



- Advantages?
- Disadvantages?

April 22, 2021

MIT 6.823 Spring 2021

L17-4 April 22, 2021

MIT 6.823 Spring 2021

L17-5

## Recap: Virtual-Channel Flow Control

- When a packet blocks, instead of holding on to channel, hold on to **virtual channel**
- Virtual channel (VC) = channel state + flit buffers
- Multiple virtual channels reduce blocking
- Ex: Wormhole (=1 VC/channel) vs 2 VCs/channel



MIT 6.823 Spring 2021

L17-3

## Interconnection Network Architecture

- **Topology:** How to connect the nodes up? (processors, memories, router line cards, ...)
- **Routing:** Which path should a message take?
- **Flow control:** How is the message actually forwarded from source to destination?
- **Router microarchitecture:** How to build the routers?
- **Link microarchitecture:** How to build the links?

April 22, 2021

MIT 6.823 Spring 2021

L17-4 April 22, 2021

MIT 6.823 Spring 2021

L17-5

## Ring-based Interconnect

### Router Microarchitecture



April 22, 2021

MIT 6.823 Spring 2021

L17-6 April 22, 2021

MIT 6.823 Spring 2021

L17-7

### Ring Stop



April 22, 2021

MIT 6.823 Spring 2021

L17-8 April 22, 2021

MIT 6.823 Spring 2021

L17-9

### Ring Flow Control: Priorities



Rotary Rule – traffic in ring has priority

### Ring Flow Control: Bounces

- What if traffic on the ring cannot get delivered, e.g., if output FIFO is full?
- One alternative: Continue on ring (bounce)
- *What are the consequences of such bounces?*

### General Interconnect Tilera, Knights Landing...



April 22, 2021

MIT 6.823 Spring 2021

L17-10 April 22, 2021

MIT 6.823 Spring 2021

L17-11

# What's In A Router?

- It's a system as well
    - Logic – State machines, Arbiters, Allocators
      - Control data movement through router
      - Idle, Routing, Waiting for resources, Active
    - Memory – Buffers
      - Store flits before forwarding them
      - SRAMs, registers, processor memory
    - Communication – Switches
      - Transfer flits from input to output ports
      - Crossbars, multiple crossbars, fully-connected, bus

April 22, 2021

MIT 6.823 Spring 2021

117-12 April 22, 2021

## Virtual-channel Router



MIT 6.S23 Spring 2021

17-13

# Router Pipeline vs. Processor Pipeline

- Logical stages:
    - BW
    - RC
    - VA
    - SA
    - BR
    - ST
    - LT
  - Different flits go through different stages
  - Different routers have different variants
    - E.g. speculation, lookahead heads, bypassing
  - Different implementations of each pipeline stage
  - Logical stages:
    - IF
    - ID
    - EX
    - MEM
    - WB
  - Different instructions go through different stages
  - Different processors have different variants
    - E.g. speculation, ISA
  - Different implementations of each pipeline stage

April 22, 2021

MIT 6.S23 Spring 2021

| 17-14 April 22 2021

## Baseline Router Pipeline



- Route computation performed once per packet
  - Virtual channel allocated once per packet
  - Body and tail flits inherit this info from head flit

## Allocators In Routers

- VC Allocator
    - Input VCs requesting for a range of output VCs
    - Example: A packet of VC0 arrives at East input port. It's destined for west output port, and would like to get any of the VCs of that output port.
  - Switch Allocator
    - Input VCs of an input port request for different output ports (e.g., One's going North, another's going West)
  - “Greedy” algorithms used for efficiency
  - What happens if allocation fails on a given cycle?

April 22, 2021

MIT 6.823 Spring 2021

L17-16 April 22, 2021

## VC & Switch Allocation Stalls



## Pipeline Optimizations: Lookahead Routing [Galles, SGI Spider Chip]

- At current router, perform route computation for next router



- Head flit already carries output port for next router
- RC just has to read output → fast, can be overlapped with BW
- Precomputing route allows flits to compete for VCs immediately after BW
- Routing computation for the next hop (NRC) can be computed in parallel with VA

- Or simplify RC (e.g., X-Y routing is very fast)

April 22, 2021

MIT 6.823 Spring 2021

L17-18 April 22, 2021

MIT 6.823 Spring 2021

L17-19

## DOR – Turns allowed

- One way of looking at whether a routing algorithm is deadlock free is to look at the turns allowed.
- Deadlocks may occur if turns can form a cycle



XY Model



YX Model

## Allowing more turns

- Allowing more turns may allow adaptive routing, but also **deadlock**



April 22, 2021

MIT 6.823 Spring 2021

L17-24 April 22, 2021

MIT 6.823 Spring 2021

L17-25

## Turn Model [Glass and Ni, 1994]

- A systematic way of generating **deadlock-free routes** with small number of prohibited turns
- Deadlock-free if routes conform to at least **ONE** of the turn models (acyclic channel dependence graph)



West-First Turn Model



North-Last Turn Model

## 2-D Mesh and CDG

Can create a *channel dependency graph (CDG)* of the network.

Disallowing  
180° turns, e.g.,  
 $AB \rightarrow BA$



MIT 6.823 Spring 2021 L17-27

## Cycles in CDG

The channel dependency graph D derived from the network topology may contain many cycles



Flow routed through links AB, BE, EF  
Flow routed through links EF, FA, AB  
Deadlock!

## Key Insight

If routes of flows conform to acyclic CDG, then there will be no possibility of deadlock!



Disallow/Delete certain edges in CDG  
Edges in CDG correspond to turns in network!

April 22, 2021

MIT 6.823 Spring 2021

L17-28 April 22, 2021

MIT 6.823 Spring 2021

L17-29

## Acyclic CDG → Deadlock-free routes



April 22, 2021

MIT 6.823 Spring 2021

L17-30 April 22, 2021

L17-31

## Resource Conflicts → Deadlock



Routing deadlocks in wormhole routing result from Structural hazard at router resources, e.g., buffers.

How can structural hazards be avoided?

April 22, 2021

MIT 6.823 Spring 2021

L17-32 April 22, 2021

MIT 6.823 Spring 2021

L17-33

## CDG and Virtual Channels



April 22, 2021

MIT 6.823 Spring 2021

L17-34 April 22, 2021

L17-35

## West-first → Deadlock-free routes



April 22, 2021

MIT 6.823 Spring 2021

L17-30 April 22, 2021

L17-31

## Virtual Channels

- Virtual channels can be used to avoid deadlock by restricting VC allocation



MIT 6.823 Spring 2021

L17-33

## Randomized Routing: Valiant

- Route each packet through a randomly chosen intermediate node



A packet, going from node SA to node DA, is first routed from SA to a randomly chosen intermediate node IA, before going from IA to final destination DA.

It helps load-balance the network and has a good worst-case performance at the expense of locality.

MIT 6.823 Spring 2021

L17-35

# ROMM: Randomized, Oblivious Multi-phase Minimal Routing



Thank you!

Next Lecture: Multithreading

April 22, 2021

MIT 6.823 Spring 2021

L17-36 April 22, 2021

MIT 6.823 Spring 2021

L17-37

## Multithreading Architectures

Daniel Sanchez  
Computer Science & Artificial Intelligence Lab  
M.I.T.

## Pipeline Hazards

|                  | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 | t10 | t11 | t12 | t13 | t14 |
|------------------|----|----|----|----|----|----|----|----|----|----|-----|-----|-----|-----|-----|
| LW r1, 0(r2)     | F  | D  | X  | M  | W  |    |    |    |    |    |     |     |     |     |     |
| LW r5, 12(r1)    |    | F  | D  | D  | D  | D  | X  | M  | W  |    |     |     |     |     |     |
| ADDI r5, r5, #12 |    |    | F  | F  | F  | F  | D  | D  | D  | D  | X   | M   | W   |     |     |
| SW 12(r1), r5    |    |    |    | F  | F  | F  | F  | D  | D  | D  | D   | D   | D   | D   | D   |

- Each instruction may depend on the previous one
- What can be done to cope with this?
- Even bypassing, speculation and finding something else to do (via O-O-O) does not eliminate all delays

April 27, 2021

MIT 6.823 Spring 2021

L18-1 April 27, 2021

MIT 6.823 Spring 2021

L18-2

## Multithreading

How can we guarantee no dependencies between instructions in a pipeline?

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

|                      | t0 | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 |
|----------------------|----|----|----|----|----|----|----|----|----|----|
| T1: LW r1, 0(r2)     | F  | D  | X  | M  | W  |    |    |    |    |    |
| T2: ADD r7, r1, r4   |    | F  | D  | X  | M  | W  |    |    |    |    |
| T3: XORI r5, r4, #12 |    |    | F  | D  | X  | M  | W  |    |    |    |
| T4: SW 0(r7), r5     |    |    |    | F  | D  | X  | M  | W  |    |    |
| T1: LW r5, 12(r1)    |    |    |    |    | F  | D  | X  | M  | W  |    |

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

## CDC 6600 Peripheral Processors (Cray, 1964)



- First commercial multithreaded hardware
- 10 "virtual" I/O processors
- Fixed interleave on simple pipeline
- Pipeline has 100ns cycle time
- Each virtual processor executes one instruction every 1000ns

April 27, 2021

MIT 6.823 Spring 2021

L18-3 April 27, 2021

MIT 6.823 Spring 2021

L18-4

## Simple Multithreaded Pipeline



Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

## Multithreading Costs

- Each thread needs its own user architectural state
  - PC
  - GPRs (CDC6600 PPUs – accumulator-based architecture)
- Also, needs its own system architectural state
  - Virtual memory page table base register
  - Exception handling registers
- *Other costs?*
- Appears to software (including OS) as multiple, albeit slower, CPUs

April 27, 2021

MIT 6.823 Spring 2021

L18-5 April 27, 2021

MIT 6.823 Spring 2021

L18-6

## Thread Scheduling Policies

- Fixed interleave (*CDC 6600 PPUs, 1965*)
  - Each of N threads executes one instruction every N cycles
  - If thread not ready to go in its slot, insert pipeline bubble
- Software-controlled interleave (*TI ASC PPUs, 1971*)
  - OS allocates S pipeline slots among N threads
  - Hardware performs fixed interleave over S slots, executing whichever thread is in that slot
- 
- Hardware-controlled thread scheduling (*HEP, 1982*)
  - Hardware keeps track of which threads are ready to go
  - Picks next thread to execute based on hardware priority scheme

April 27, 2021

MIT 6.823 Spring 2021

L18-7 April 27, 2021

MIT 6.823 Spring 2021

L18-8

## Denelcor HEP (Burton Smith, 1982)



First commercial machine to use hardware threading in main CPU

- 120 threads per processor
- 10 MHz clock rate
- Up to 8 processors
- Precursor to Tera MTA (Multithreaded Architecture)

## Tera MTA (1990-97)



- Up to 256 processors
- Up to 128 active threads per processor
- Processors and memory modules populate a sparse 3D torus interconnection fabric
- Flat, shared main memory
  - No data cache
  - Sustains one main memory access per cycle per processor
- GaAs logic in prototype, 1KW/processor @ 260MHz
  - CMOS version, MTA-2, 50W/processor

April 27, 2021

MIT 6.823 Spring 2021

L18-9 April 27, 2021

MIT 6.823 Spring 2021

L18-10

## MTA Architecture

- Each processor supports 128 active hardware threads
  - $1 \times 128 = 128$  stream status word (SSW) registers,
  - $8 \times 128 = 1024$  branch-target registers,
  - $32 \times 128 = 4096$  general-purpose registers
- Three operations packed into 64-bit instruction (short VLIW)
  - One memory operation,
  - One arithmetic operation, plus
  - One arithmetic or branch operation
- Thread creation and termination instructions
- Explicit 3-bit "lookahead" field in instruction gives number of subsequent instructions (0-7) that are independent of this one
  - Allows fewer threads to fill machine pipeline
  - Used for variable-sized branch delay slots

## MTA Pipeline



- Every cycle, one instruction from one active thread is launched into pipeline
- Instruction pipeline is 21 cycles long
- Memory operations incur ~150 cycles of latency

Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz...

What is single thread performance?

April 27, 2021

MIT 6.823 Spring 2021

L18-11 April 27, 2021

MIT 6.823 Spring 2021

L18-12

## Multithreading Design Choices

- Fine-grained multithreading
  - Context switch among threads every cycle
- Coarse-grained multithreading
  - Context switch among threads every few cycles, e.g., on:
    - Function unit data hazard,
    - L1 miss,
    - L2 miss...
- Why choose one style over another?
- Choice depends on
  - Context-switch overhead
  - Number of threads supported (due to per-thread state)
  - Expected application-level parallelism...

April 27, 2021

MIT 6.823 Spring 2021

L18-13 April 27, 2021

MIT 6.823 Spring 2021

L18-14

## MIT Alewife (1990)



- Modified SPARC chips
  - Register windows hold different thread contexts
- Up to four threads per node
- Thread switch on local cache miss

April 27, 2021

MIT 6.823 Spring 2021

L18-15 April 27, 2021

MIT 6.823 Spring 2021

L18-16

## Coarse-Grain Multithreading

Tera MTA designed for supercomputing applications with large data sets and low locality

- No data cache
- Many parallel threads needed to hide large memory latency

Other applications are more cache friendly

- Few pipeline bubbles when cache getting hits
- Just add a few threads to hide occasional cache miss latencies
- Swap threads on cache misses

April 27, 2021

L18-11 April 27, 2021

MIT 6.823 Spring 2021

L18-12

## TX-2: Multi-sequence computer (Wes Clark, Lincoln Labs, 1956)

32 Instruction sequences (threads) with

- a fixed priority order among the threads, and
- executes many instructions in a thread - switches mediated by:
  - Instruction "break"/"dismiss" bits
  - Attention request from I/O



- Start-Over
- In-out alarms
- Arithmetic alarms (overflows, etc.)
- Magnetic tape units (multiple)
- High-speed printer
- Analog-to-digital converter
- Paper tape readers (multiple)
- Light pen
- Display (multiple)
- Memory Test Computer
- TX-0
- Digital-to-analog converter
- Paper tape punch
- Flexowriters (multiple)
- \*Main sequences (three)

April 27, 2021

L18-13 April 27, 2021

MIT 6.823 Spring 2021

L18-14

## IBM PowerPC RS64-IV (2000)

- Commercial coarse-grain multithreading CPU
- Based on PowerPC with quad-issue in-order five-stage pipeline
- Each physical CPU supports two virtual CPUs
- On L2 cache miss, pipeline is flushed and execution switches to second thread
  - Short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency
  - Flush pipeline to simplify exception handling

April 27, 2021

MIT 6.823 Spring 2021

L18-15 April 27, 2021

MIT 6.823 Spring 2021

L18-16

## Superscalar Machine Efficiency



- Why horizontal waste?
- Why vertical waste?

April 27, 2021

MIT 6.823 Spring 2021

L18-17 April 27, 2021

MIT 6.823 Spring 2021

L18-18

## Chip Multiprocessing



- What is the effect of splitting into multiple processors?

April 27, 2021

MIT 6.823 Spring 2021

L18-19 April 27, 2021

MIT 6.823 Spring 2021

L18-19

## O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]

- Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously
- Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads
- OOO instruction window already has most of the circuitry required to schedule from multiple threads
- Any single thread can utilize whole machine

## Vertical Multithreading



- What is the effect of cycle-by-cycle interleaving?

April 27, 2021

MIT 6.823 Spring 2021

L18-17 April 27, 2021

MIT 6.823 Spring 2021

L18-18

## Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]



- Interleave multiple threads to multiple issue slots with no restrictions

April 27, 2021

MIT 6.823 Spring 2021

L18-19 April 27, 2021

MIT 6.823 Spring 2021

L18-20

## Basic Out-of-order Pipeline



[EV8 – Microprocessor Forum, Oct 1999]

L18-21 April 27, 2021

MIT 6.823 Spring 2021

L18-22

April 27, 2021

MIT 6.823 Spring 2021

# SMT Pipeline



[EV8 – Microprocessor Forum, Oct 1999]

April 27, 2021

MIT 6.823 Spring 2021

L18-23 April 27, 2021

MIT 6.823 Spring 2021

L18-24

# Icount Choosing Policy

Fetch from thread with the least instructions in flight.



*Why does this enhance throughput?*

## Why Does Icount Make Sense?

$$T = \frac{N}{L}$$

Assuming latency (L) is unchanged with the addition of threading.  
For each thread i with original throughput  $T_i$  (and 4 threads):

$$T_i/4 = \frac{N/4}{L}$$

April 27, 2021

MIT 6.823 Spring 2021

L18-25 April 27, 2021

MIT 6.823 Spring 2021

L18-26

## SMT Fetch Policies (Locks)

- Problem:  
Spin looping thread consumes resources
- Solution:  
Provide quiescing operation that allows a thread to sleep until a memory location changes



## Adaptation to parallelism type

For regions with high thread level parallelism (TLP) entire machine width is shared by all threads



April 27, 2021

MIT 6.823 Spring 2021

L18-27 April 27, 2021

MIT 6.823 Spring 2021

L18-27

## Pentium-4 Hyperthreading (2002)

- First commercial SMT design (2-way SMT)
  - Hyperthreading == SMT
- Logical processors share nearly all resources of the physical processor
  - Caches, execution units, branch predictors
- Die area overhead of hyperthreading ~ 5%
- When one logical processor is stalled, the other can make progress
  - No logical processor can use all entries in queues when two threads are active
- Processor running only one active software thread runs at approximately same speed with or without hyperthreading

## Pentium-4 Hyperthreading Front End



[ Intel Technology Journal, Q1 2002 ]

April 27, 2021

MIT 6.823 Spring 2021

L18-29 April 27, 2021

MIT 6.823 Spring 2021

L18-30

## Pentium-4 Branch Predictor

- Separate return address stacks per thread  
*Why?*
- Separate first-level global branch history table  
*Why?*
- Shared second-level branch history table, tagged with logical processor IDs

## Pentium-4 Hyperthreading Execution Pipeline



[ Intel Technology Journal, Q1 2002 ]

April 27, 2021

MIT 6.823 Spring 2021

L18-31 April 27, 2021

MIT 6.823 Spring 2021

L18-32

## Summary: Multithreading Styles



Thank you!

## Microcoded and VLIW Processors

Daniel Sanchez  
Computer Science & Artificial Intelligence Lab  
M.I.T.

## Hardwired vs Microcoded Processors

- All processors we have seen so far are hardwired: The microarchitecture directly implements all the instructions in the ISA
- Microcoded processors add a layer of interpretation: Each ISA instruction is executed as a sequence of simpler *microinstructions*
  - Simpler implementation*
  - Lower performance than hardwired ( $CPI > 1$ )*
- Microcoding common until the 80s, still in use today (e.g., complex x86 instructions are decoded into multiple “micro-ops”)

April 29, 2021

MIT 6.823 Spring 2021

L19-2 April 29, 2021

MIT 6.823 Spring 2021

L19-3

## Microcontrol Unit [Maurice Wilkes, 1954]

Embed the control logic state table in a read-only memory array



## Microcoded Microarchitecture



April 29, 2021

MIT 6.823 Spring 2021

L19-4 April 29, 2021

MIT 6.823 Spring 2021

L19-5

## A Bus-based Datapath for MIPS



## Memory Module



- Assumption: Memory operates asynchronously and is slow compared to Reg-to-Reg transfers

April 29, 2021

MIT 6.823 Spring 2021

L19-6 April 29, 2021

MIT 6.823 Spring 2021

L19-7

## Microcode Controller



# Jump Logic

$\mu\text{PCSrc} = \text{Case } \mu\text{JumpTypes}$

|          |                                                                                  |
|----------|----------------------------------------------------------------------------------|
| next     | $\Rightarrow \mu\text{PC} + 1$                                                   |
| spin     | $\Rightarrow \text{if (busy) then } \mu\text{PC} \text{ else } \mu\text{PC} + 1$ |
| fetch    | $\Rightarrow \text{absolute}$                                                    |
| dispatch | $\Rightarrow \text{op-group}$                                                    |
| feqz     | $\Rightarrow \text{if (zero) then absolute else } \mu\text{PC} + 1$              |
| fnez     | $\Rightarrow \text{if (zero) then } \mu\text{PC} + 1 \text{ else absolute}$      |

April 29, 2021

MIT 6.823 Spring 2021

L19-8 April 29, 2021

MIT 6.823 Spring 2021

L19-9

# Instruction Execution

Execution of a MIPS instruction involves

1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)  
+ the computation of the  
*next instruction address*

# Instruction Fetch

State Control points next-state

|                    |                                             |  |
|--------------------|---------------------------------------------|--|
| fetch <sub>0</sub> | MA $\leftarrow$ PC                          |  |
| fetch <sub>1</sub> | IR $\leftarrow$ Memory                      |  |
| fetch <sub>2</sub> | A $\leftarrow$ PC                           |  |
| fetch <sub>3</sub> | PC $\leftarrow A + 4$                       |  |
| ...                |                                             |  |
| ALU <sub>0</sub>   | A $\leftarrow \text{Reg}[rs]$               |  |
| ALU <sub>1</sub>   | B $\leftarrow \text{Reg}[rt]$               |  |
| ALU <sub>2</sub>   | Reg[rd] $\leftarrow \text{func}(A, B)$      |  |
| ALUi <sub>0</sub>  | A $\leftarrow \text{Reg}[rs]$               |  |
| ALUi <sub>1</sub>  | B $\leftarrow \text{sExt}_{16}(\text{Imm})$ |  |
| ALUi <sub>2</sub>  | Reg[rd] $\leftarrow \text{Op}(A, B)$        |  |



April 29, 2021

MIT 6.823 Spring 2021

L19-10 April 29, 2021

MIT 6.823 Spring 2021

L19-11

# Load & Store

| State           | Control points                              | next-state |
|-----------------|---------------------------------------------|------------|
| LW <sub>0</sub> | A $\leftarrow \text{Reg}[rs]$               | next       |
| LW <sub>1</sub> | B $\leftarrow \text{sExt}_{16}(\text{Imm})$ | next       |
| LW <sub>2</sub> | MA $\leftarrow A+B$                         | next       |
| LW <sub>3</sub> | Reg[rt] $\leftarrow \text{Memory}$          | spin       |
| LW <sub>4</sub> |                                             | fetch      |
| SW <sub>0</sub> | A $\leftarrow \text{Reg}[rs]$               | next       |
| SW <sub>1</sub> | B $\leftarrow \text{sExt}_{16}(\text{Imm})$ | next       |
| SW <sub>2</sub> | MA $\leftarrow A+B$                         | next       |
| SW <sub>3</sub> | Memory $\leftarrow \text{Reg}[rt]$          | spin       |
| SW <sub>4</sub> |                                             | fetch      |

# Branches

| State             | Control points                                    | next-state |
|-------------------|---------------------------------------------------|------------|
| BEQZ <sub>0</sub> | A $\leftarrow \text{Reg}[rs]$                     | next       |
| BEQZ <sub>1</sub> |                                                   | fnez       |
| BEQZ <sub>2</sub> | A $\leftarrow \text{PC}$                          | next       |
| BEQZ <sub>3</sub> | B $\leftarrow \text{sExt}_{16}(\text{Imm} \ll 2)$ | next       |
| BEQZ <sub>4</sub> | PC $\leftarrow A+B$                               | fetch      |
| BNEZ <sub>0</sub> | A $\leftarrow \text{Reg}[rs]$                     | next       |
| BNEZ <sub>1</sub> |                                                   | feqz       |
| BNEZ <sub>2</sub> | A $\leftarrow \text{PC}$                          | next       |
| BNEZ <sub>3</sub> | B $\leftarrow \text{sExt}_{16}(\text{Imm} \ll 2)$ | next       |
| BNEZ <sub>4</sub> | PC $\leftarrow A+B$                               | fetch      |

# Jumps

| State             | Control points                        | next-state |
|-------------------|---------------------------------------|------------|
| J <sub>0</sub>    | A $\leftarrow \text{PC}$              | next       |
| J <sub>1</sub>    | B $\leftarrow \text{IR}$              | next       |
| J <sub>2</sub>    | PC $\leftarrow \text{JumpTarg}(A, B)$ | fetch      |
| JR <sub>0</sub>   | A $\leftarrow \text{Reg}[rs]$         | next       |
| JR <sub>1</sub>   | PC $\leftarrow A$                     | fetch      |
| JAL <sub>0</sub>  | A $\leftarrow \text{PC}$              | next       |
| JAL <sub>1</sub>  | Reg[31] $\leftarrow A$                | next       |
| JAL <sub>2</sub>  | B $\leftarrow \text{IR}$              | next       |
| JAL <sub>3</sub>  | PC $\leftarrow \text{JumpTarg}(A, B)$ | fetch      |
| JALR <sub>0</sub> | A $\leftarrow \text{PC}$              | next       |
| JALR <sub>1</sub> | B $\leftarrow \text{Reg}[rs]$         | next       |
| JALR <sub>2</sub> | Reg[31] $\leftarrow A$                | next       |
| JALR <sub>3</sub> | PC $\leftarrow B$                     | fetch      |

April 29, 2021

MIT 6.823 Spring 2021

L19-12 April 29, 2021

MIT 6.823 Spring 2021

L19-13

## VAX 11-780 Microcode (1978)

```

P1WUD_I [1601,1205] MICKO2_1F[125] 26-May-91 14:58:51 VAX11/780 Microcode 1 PCB 01, FPLA 00, WCS122 Page 771
J CALL2_M, [16,1205] Procedure call I CALLG, CALLS

;-----+
; J29744 JHERE FOR CALLG OR CALLS, AFTER PROBING THE EXTENT OF THE STACK
;-----+
; J29745 JCALL 0
;-----+
; J29747 CALL,7,0 D,AND,RC(72) ;CALL SITE FOR PUSH
;-----+
; J29748 CALL,J/PUSH ;ISTRIP MASK TO BITS 11-0
;-----+
; J29749 CALL,J/PUSHREGS ;PUSH REGISTERS
;-----+
; J29750 CALL,J/PUSHSP ;PUSH SP
;-----+
; J29751 CACHE_DILONGC ;RETURN FROM PUSH
;-----+
; J29752 LAB_R,[SP] ;I = SP
;-----+
; J29753 JCALL,7,0 D,AND,RC(72) ;CALL SITE FOR PUSHPC
;-----+
; J29754 CALL,R1,RESPVAL,A,LAA-K,[8] ;UPDATE SP FOR PUSH OF PC &
;-----+
; J29755 JCALL,7,0 D,FFP,RC(72) ;READY TO PUSH FRAME POINTER
;-----+
; J29756 JCALL,7,0 D,FFP,RC(72) ;CALL SITE FOR PSHSP
;-----+
; J29761 CACHE_DILONGC ;STORE FP
;-----+
; J29762 LAB_R,[SP] ;GET SP AGAIN
;-----+
; J29763 RC,K1,FFP01 ;I=16 TO SC
;-----+
; J29764 CALL,J/PUSHSP ;CALL J/PUSHSP
;-----+
; J29765 JCALL,7,0 D,AND,RC(72) ;READY TO PUSH AP
;-----+
; J29766 CALL,J/DIPS1,RC(72) ;READY TO PUSH AP
;-----+
; J29767 JCALL,7,0 D,AND,RC(72) ;READY TO PUSH AP
;-----+
; J29771 CACHE_DILONGC ;STORE GL_AP
;-----+
; J29772 LAB_R,[SP] ;CLEAR PSHAP INTO V.CD
;-----+
; J29773 JCALL,7,0 D,AND,RC(72) ;CLEAR PSHAP INTO LATCHES AGAIN
;-----+
; J29775 JCALL,7,0 D,AND,RC(72) ;LOAD NEW PC AND CLEAR OUT
;-----+
; J29776 PCWCA,RC(71),FLUSH,IR ;LOAD PC
;-----+
; J29778 JCALL,7,0 D,AND,RC(72) ;PUSH TO D31H>IR
;-----+
; J29779 JCALL,7,0 D,AND,RC(72) ;RECOVER MASK
;-----+
; J29780 SC,BCKC(1) ;PUT -13 IN SC
;-----+
; J29781 SC,DALC(2) ;START FETCHING SUBROUTINE I
;-----+
; J29782 SC,DALC(2) ;PUT -13 IN SC
;-----+
; J29783 SC,DALC(2) ;MASK AND PMS IN D31H>IR
;-----+
; J29784 SC,BCKC(1) ;GET LOAD PMS IN D31H>IR
;-----+
; J29785 SC,BCKC(1) ;CLEAR D31H>IR
;-----+
; J29786 SC,BCKC(1) ;PUT -17 IN SC
;-----+
; J29787 SC,BCKC(1) ;PUT -17 IN SC
;-----+
; J29788 SC,BCKC(1) ;PUT -17 IN SC

```

# Very Long Instruction Word (VLIW) Processors

April 29, 2021

MIT 6.823 Spring 2021

119-14 April 29, 2021

MIT 6.823 Spring 2021

19-15

## Sequential ISA Bottleneck



April 29 2021

MIT 6.823 Spring 2021

119-16 April 29 2021

MIT 6.823 Spring 2021

| 19-17

# VLIW: Very Long Instruction Word



- Multiple operations packed into one instruction
  - Each operation slot is for a fixed function
  - Constant operation latencies are specified

# VLIW Design Principles

## The architecture:

- Allows operation parallelism within an instruction
    - No cross-operation RAW check
  - Provides deterministic latency for all operations
    - Latency measured in ‘instructions’
    - No data use allowed before specified latency with no data interlocks

## The compiler:

- Schedules (reorders) to maximize parallel execution
  - Guarantees intra-instruction parallelism
  - Schedules to avoid data hazards (no interlocks)
    - Typically separates operations with explicit NOPs

## Early VLIW Machines

- FPS AP120B (1976)
    - scientific attached array processor
    - first commercial wide instruction machine
    - hand-coded vector math libraries using software pipelining and loop unrolling
  - Multiflow Trace (1987)
    - commercialization of ideas from Fisher's Yale group including "trace scheduling"
    - available in configurations with 7, 14, or 28 operations/instruction
    - 28 operations packed into a 1024-bit instruction word
  - Cydrome Cydra-5 (1987)
    - 7 operations encoded in 256-bit instruction word
    - rotating register file

April 29, 2021

MIT 6.823 Spring 2021

L19-18 April 29, 2021

MIT 6.823 Spring 2021

L19-19



# Trace Scheduling

[Fisher, Ellis]



- Pick string of basic blocks, a trace, that represents most frequent branch path
- Schedule whole "trace" at once
- Add fixup code to cope with branches jumping out of trace

*How do we know which trace to pick?*

April 29, 2021

MIT 6.823 Spring 2021

L19-26 April 29, 2021

MIT 6.823 Spring 2021

L19-27

# VLIW Instruction Encoding

- Schemes to reduce effect of unused fields
  - Compressed format in memory, expand on I-cache refill
    - used in Multiflow Trace
    - introduces instruction addressing challenge
  - Provide a single-op VLIW instruction
    - Cydra-5 UniOp instructions
  - Mark parallel groups
    - used in TMS320C6x DSPs, Intel IA-64



April 29, 2021

MIT 6.823 Spring 2021

L19-28 April 29, 2021

MIT 6.823 Spring 2021

L19-29

# IA-64 Predicated Execution

Problem: Mispredicted branches limit ILP

Solution: Eliminate hard-to-predict branches with predicated execution

- Almost all IA-64 instructions can be executed conditionally under predicate
- Instruction becomes NOP if predicate register false



April 29, 2021

MIT 6.823 Spring 2021

L19-30 April 29, 2021

MIT 6.823 Spring 2021

L19-31

# Problems with "Classic" VLIW

- Knowing branch probabilities
  - Profiling requires significant extra step in build process
- Scheduling for statically unpredictable branches
  - Optimal schedule varies with branch path
- Object code size
  - Instruction padding wastes instruction memory/cache
  - Loop unrolling/software pipelining replicates code
- Scheduling memory operations
  - Caches and/or memory bank conflicts impose statically unpredictable variability
  - Uncertainty about addresses limit code reordering
- Object-code compatibility
  - Have to recompile all code for every machine, even for two machines in same generation

# Cydra-5: Memory Latency Register (MLR)

- Problem: Loads have variable latency
- Solution: Let software choose desired memory latency
- Compiler schedules code for maximum load-use distance
- Software sets MLR to latency that matches code schedule
- Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline
  - Hardware buffers loads that return early
  - Hardware stalls processor if loads return late

# Fully Bypassed Datapath



Where does predication fit in?

April 29, 2021

MIT 6.823 Spring 2021

L19-31 April 29, 2021

MIT 6.823 Spring 2021

L19-32

## IA-64 Speculative Execution

Problem: Branches restrict compiler code motion

Solution: Speculative operations that don't cause exceptions



Particularly useful for scheduling long latency loads early

April 29, 2021

MIT 6.823 Spring 2021

L19-32 April 29, 2021

MIT 6.823 Spring 2021

L19-33

## Clustered VLIW



- Divide machine into clusters of local register files and local functional units
- Lower bandwidth/higher latency interconnect between clusters
- Software responsible for mapping computations to minimize communication overhead
- Common in commercial embedded processors, examples include TI C6x series DSPs, and HP Lx processor
- Exists in some superscalar processors, e.g., Alpha 21264

April 29, 2021

MIT 6.823 Spring 2021

L19-34 April 29, 2021

MIT 6.823 Spring 2021

L19-35

Thank you!

Next Lecture: Vector Processors

## Vector Processors

Daniel Sanchez  
Computer Science & Artificial Intelligence Lab  
M.I.T.

April 29, 2021

MIT 6.823 Spring 2021

L19-36 May 4, 2021

MIT 6.823 Spring 2021

L20-1

## IA-64 Data Speculation

Problem: Possible memory hazards limit code scheduling

Solution: Instruction-based speculation with hardware monitor to check for pointer hazards



Can't move load above store because store might be to same address



Requires associative hardware in address check table

April 29, 2021

MIT 6.823 Spring 2021

L19-32 April 29, 2021

MIT 6.823 Spring 2021

L19-33

# Supercomputers

Definition of a supercomputer:

- Fastest machine in the world at given task
- A device to turn a compute-bound problem into an I/O bound problem
- Any machine costing \$30M+
- Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

# Supercomputer Applications

Typical application areas:

- Military research (nuclear weapons, cryptography)
- Scientific research
- Weather forecasting
- Oil exploration
- Industrial design (car crash simulation)
- Bioinformatics

All involve huge computations on large data sets

*In 70s-80s, Supercomputer = Vector Machine*

May 4, 2021

MIT 6.823 Spring 2021

L20-2 May 4, 2021

MIT 6.823 Spring 2021

L20-3

## Loop Unrolled Code Schedule

```
for (i=0; i<N; i++)  
    B[i] = A[i] + C;  
  
loop:  
    Id f1, 0(r1)  
    Id f2, 8(r1)  
    Id f3, 16(r1)  
    Id f4, 24(r1)  
    add r1, 32  
    fadd f5, f0, f1  
    fadd f6, f0, f2  
    fadd f7, f0, f3  
    fadd f8, f0, f4  
    sd f5, 0(r2)  
    sd f6, 8(r2)  
    sd f7, 16(r2)  
    sd f8, 24(r2)  
    add r2, 32  
    bne r1, r3, loop
```

|       | Int1   | Int 2 | M1      | M2 | FP+ | FPx |
|-------|--------|-------|---------|----|-----|-----|
| loop: |        | Id f1 |         |    |     |     |
|       |        | Id f2 |         |    |     |     |
|       |        | Id f3 |         |    |     |     |
|       | add r1 | Id f4 | fadd f5 |    |     |     |
|       |        |       | fadd f6 |    |     |     |
|       |        |       | fadd f7 |    |     |     |
|       |        |       | fadd f8 |    |     |     |
|       |        | sd f5 |         |    |     |     |
|       |        | sd f6 |         |    |     |     |
|       |        | sd f7 |         |    |     |     |
|       | add r2 | bne   | sd f8   |    |     |     |
|       |        |       |         |    |     |     |
|       |        |       |         |    |     |     |

May 4, 2021

MIT 6.823 Spring 2021

L20-4 May 4, 2021

MIT 6.823 Spring 2021

L20-5

## Cray-1 (1976)



May 4, 2021

MIT 6.823 Spring 2021

L20-6 May 4, 2021

## Vector Supercomputers

Epitomized by Cray-1, 1976:

- Scalar Unit
  - Load/Store Architecture
- Vector Extension
  - Vector Registers
  - Vector Instructions
- Implementation
  - Hardwired Control
  - Highly Pipelined Functional Units
  - No Data Caches
  - Interleaved Memory System
  - No Virtual Memory

MIT 6.823 Spring 2021

L20-6

## Cray-1 (1976)



MIT 6.823 Spring 2021

L20-7

# Vector Programming Model



May 4, 2021

MIT 6.823 Spring 2021

L20-8 May 4, 2021

MIT 6.823 Spring 2021

L20-9

# Vector Programming Model



May 4, 2021

MIT 6.823 Spring 2021

L20-9

# Compiler-based Vectorization



May 4, 2021

MIT 6.823 Spring 2021

L20-10 May 4, 2021

MIT 6.823 Spring 2021

L20-10

# Vector Code Example

| # C code                                                    | # Scalar code                                                                                                                                                                 | # Vector code                                                             |
|-------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| <pre> for (i=0; i&lt;N; i++)   C[i] = A[i] + B[i];   </pre> | <pre> LI R4, 64 loop:   L.D F0, 0(R1)   L.D F2, 0(R2)   ADD.D F4, F2, F0   S.D F4, 0(R3)   DADDIU R1, 8   DADDIU R2, 8   DADDIU R3, 8   DSUBIU R4, 1   BNEZ R4, loop   </pre> | <pre> LI VLR, 64 LV V1, R1 LV V2, R2 ADD.VD V3, V1, V2 SV V3, R3   </pre> |

# Vector ISA Attributes

- Compact
  - One short instruction encodes N operations
- Expressive, tells hardware that these N operations:
  - Are independent
  - Use the same functional unit
  - Access disjoint elements in vector registers
  - Access registers in same pattern as previous instructions
  - Access a contiguous block of memory (unit-stride load/store)
  - Access memory in a known pattern (strided load/store)

May 4, 2021

MIT 6.823 Spring 2021

L20-12 May 4, 2021

MIT 6.823 Spring 2021

L20-13

# Vector ISA Hardware Implications

- Large amount of work per instruction
  - Less instruction fetch bandwidth requirements
  - Allows simplified instruction fetch design
- No data dependence within a vector
  - Amenable to deeply pipelined/parallel designs
- Disjoint vector element accesses
  - Banked rather than multi-ported register files
- Known regular memory access pattern
  - Allows for banked memory for higher bandwidth

May 4, 2021

MIT 6.823 Spring 2021

L20-12 May 4, 2021

MIT 6.823 Spring 2021

L20-13

## Vector Arithmetic Execution

- Use deep pipeline ( $\Rightarrow$  fast clock) to execute element operations
- Simplifies control of deep pipeline because elements in vector are independent ( $\Rightarrow$  no hazards!)



*Given 64-element registers, how long does it take to compute V3?*

May 4, 2021

MIT 6.823 Spring 2021

L20-14 May 4, 2021

ADD C,A,B

Execution using one pipelined functional unit

Execution using four pipelined functional units



MIT 6.823 Spring 2021

L20-15

## Vector Unit Structure



May 4, 2021

MIT 6.823 Spring 2021

L20-16 May 4, 2021

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes



MIT 6.823 Spring 2021

L20-17

## Vector Chaining

Problem: Long latency for RAW register dependencies



- Vector version of register bypassing
  - introduced with Cray-1

May 4, 2021

MIT 6.823 Spring 2021

L20-18 May 4, 2021

MIT 6.823 Spring 2021

L20-19

## Vector Chaining Advantage

- Without chaining, must wait for last element of result to be written before starting dependent instruction



- With chaining, can start dependent instruction as soon as first result appears



L20-20

# Vector Memory System

Cray-1: 16 banks, 4 cycle bank busy time, 12 cycle latency

- **Bank busy time:** Cycles between accesses to same bank
- Allows 16 parallel accesses (if data in different banks)



May 4, 2021

MIT 6.823 Spring 2021

L20-20 May 4, 2021

MIT 6.823 Spring 2021

L20-21

# Vector Stripmining

Problem: Vector registers have finite length

Solution: Break loops into pieces that fit in registers, "Strip mining"



# Vector Conditional Execution

Problem: Want to vectorize loops with conditional code:

```

for (i = 0; i < N; i++)
    if (A[i] > 0) then
        A[i] = B[i];
    
```

Solution: Add vector *mask* (or *flag*) registers

- vector version of predicate registers, 1 bit per element
- ...and *maskable* vector instructions
- vector operation becomes NOP at elements where mask bit is clear

Code example:

```

CVM          # Turn on all elements
LV vA, rA    # Load entire A vector
SGTFS.D vA, F0 # Set bits in mask register where A>0
LV vA, rB    # Load B vector into A under mask
SV vA, rA    # Store A back to memory under mask
    
```

May 4, 2021

MIT 6.823 Spring 2021

L20-22 May 4, 2021

MIT 6.823 Spring 2021

L20-23

# Masked Vector Instructions

Simple implementation

- execute all N operations, turn off result writeback according to mask



Density-time implementation

- scan mask vector and only execute elements with non-zero masks



$$C[i] = A[i] + B[i]$$

# Vector Scatter/Gather

Want to vectorize loops with indirect accesses:

```

for (i = 0; i < N; i++)
    A[i] = B[i] + C[D[i]];
    
```

Indexed load instruction (Gather)

```

LV vD, rD          # Load indices in D vector
LVI vC, rC, vD     # Load indirect from rC base
LV vB, rB          # Load B vector
ADDV.D vA, vB, vC # Do add
SV vA, rA          # Store result
    
```

*Is this a correct translation?*

May 4, 2021

MIT 6.823 Spring 2021

L20-24 May 4, 2021

MIT 6.823 Spring 2021

L20-25

*Is the following a correct translation?*

```

LV vB, rB          # Load indices in B vector
LVI vA, rA, vB     # Gather initial A values
ADDV vA, vA, 1     # Increment
SVI vA, rA, vB    # Scatter incremented values
    
```

# A Later-Generation Vector Super: NEC SX-6 (2003)

- CMOS Technology
  - 500 MHz CPU, fits on single chip
  - SDRAM main memory (up to 64GB)
- Scalar unit
  - 4-way superscalar
  - with out-of-order and speculative execution
  - 64KB I-cache and 64KB data cache
- Vector unit
  - 8 foreground VRegs + 64 background VRegs (256x64-bit elements/VReg)
  - 1 multiply unit, 1 divide unit, 1 addshift unit, 1 logical unit, 1 mask unit
  - 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
  - 1 load & store unit (32x8 byte accesses/cycle)
  - 32 GB/s memory bandwidth per processor
- SMP structure
  - 8 CPUs connected to memory through crossbar
  - 256 GB/s shared memory bandwidth (4096 interleaved banks)



May 4, 2021

MIT 6.823 Spring 2021

L20-26 May 4, 2021

MIT 6.823 Spring 2021

L20-27

## Multimedia/SIMD Extensions

- Short vectors added to existing general-purpose ISAs
- Idea first used on Lincoln Labs TX-2 computer in 1957, with 36b datapath split into 2x18b or 4x9b
- Recent incarnations initially reused existing registers
  - e.g., 64-bit registers split into 2x32bits or 4x16bits or 8x8bits
- Trend towards larger vector support in microprocessors
  - e.g. x86:
    - MMX (64 bits)
    - SSE (128 bits)
    - AVX (256 bits)
    - AVX-512 (512 bits/masks)



Figure 1. MMX technology data types: packed byte (a), packed word (b), packed doubleword (c), and quadword (d).

## Intel SIMD Evolution

### Implementations:

- Intel MMX (1996) – 64bits
  - Eight 8-bit integer ops, or
  - Four 16-bit integer ops
  - Two 32-bit integer ops
- Streaming SIMD Extensions (SSE) (1999) – 128bits
  - Four 32-bit integer ops (and smaller integer types)
  - Four 32-bit integer/fp ops, or
  - Two 64-bit integer/fp ops
- Advanced Vector Extensions (2010) – 256bits
  - Four 64-bit integer/fp ops (and smaller fp types)
- AVX-512 (2017) – 512bits
  - New instructions: scatter/gather, mask registers

May 4, 2021

MIT 6.823 Spring 2021

L20-28 May 4, 2021

MIT 6.823 Spring 2021

L20-29

## Multimedia Extensions vs Vectors

- Limited instruction set
  - No vector length control
  - Usually no masks
  - Up until recently, no strided load/store or scatter/gather
  - Unit-stride loads must be aligned to 64/128-bits
- Limited vector register length
  - requires superscalar dispatch to keep units busy
  - loop unrolling to hide latencies increases register pressure
- Trend towards fuller vector support
  - Better support for misaligned memory accesses
  - Support of double-precision (64-bit floating-point)
  - Support for masked operations

## Knights Landing (KNL) CPU



- 2-wide decode/retire
- 6-wide execute
- 72-entry ROB
- 64B cache ports
- 2 load/1 store ports
- Fast unaligned access
- Fast scatter/gather
- OoO int/fp RS
- In-order mem RS
- 4 thread SMT
- Many shared resources
  - ROB, rename buffer, RS: dynamically partitioned
- Several thread choosers

Source: IEEE Micro, 2016

MIT 6.823 Spring 2021

L20-30 May 4, 2021

## Knights Landing (KNL) Mesh



Source: IEEE Micro, 2016

- Mesh of Rings
  - Rows/columns (half) ring
  - YX routing
  - Message arbitration on:
    - Injection
    - Turns

- Cache Coherent Interconnect
  - MESIF protocol
  - Distributed directory
    - to filter snoops

### Partitioning modes

- All-to-all
- Quadrant
- Sub-NUMA

MIT 6.823 Spring 2021

L20-31

Thank you!

# Graphics Processing Units (GPUs)

Next Lecture: GPUs

Daniel Sanchez

Computer Science & Artificial Intelligence Lab  
M.I.T.

May 4, 2021

MIT 6.823 Spring 2021

L20-32 May 6, 2021

MIT 6.823 Spring 2021

L21-1

## Why Study GPUs?

- Very successful commodity accelerator/co-processor
- GPUs combine two strategies to increase efficiency
  - Massive parallelism
  - Specialization
- Illustrates tension between performance and programmability in accelerators

## Graphics Processors Timeline

- Until mid-90s
  - Most graphics processing in CPU
  - VGA controllers used to accelerate some display functions
- Mid-90s to mid-2000s
  - Fixed-function accelerators for 2D and 3D graphics
    - triangle setup & rasterization, texture mapping & shading
  - Programming:
    - OpenGL and DirectX APIs

May 6, 2021

MIT 6.823 Spring 2021

L21-2 May 6, 2021

MIT 6.823 Spring 2021

L21-3

## Contemporary GPUs



### Modern GPUs

- Some fixed-function hardware (texture, raster ops, ...)
- Plus programmable data-parallel multiprocessors
- Programming:
  - OpenGL/DirectX
  - Plus more general purpose languages (CUDA, OpenCL, ...)

## GPUs in Modern Systems

- Discrete GPUs
  - PCIe-based accelerator
  - Separate GPU memory



- Integrated GPUs
  - CPU and GPU on same die
  - Shared main memory and last-level cache



- Pros/cons?



May 6, 2021

MIT 6.823 Spring 2021

L21-4 May 6, 2021

MIT 6.823 Spring 2021

L21-5

## Single Instruction Multiple Thread



May 6, 2021

MIT 6.823 Spring 2021

L21-6 May 6, 2021

MIT 6.823 Spring 2021

L21-7

## Multithreading + Single Instruction Multiple Thread



## Streaming Multiprocessor Overview



- Each SM supports 10s of warps (e.g., 64 in Kepler) with 32 threads/warp
- Fetch 1 instr/cycle
- Issue 1 ready instr/cycle
  - Simple scoreboarding: all warp elements must be ready
- Instruction broadcast to all lanes
- Multithreading is the main latency-hiding mechanism

What average latency is needed to keep busy?

May 6, 2021

MIT 6.823 Spring 2021

L21-8 May 6, 2021

MIT 6.823 Spring 2021

L21-9

## Context Size vs Number of Contexts

- SMs support a variable number of contexts based on required registers (and shared memory)
  - Few large contexts → Fewer register spills
  - Many small contexts → More latency tolerance
  - Choice left to the compiler
- Example: Kepler supports up to 64 warps
  - Max: 64 warps @ <=32 registers/thread
  - Min: 8 warps @ 256 registers/thread

## Many Memory Types



May 6, 2021

MIT 6.823 Spring 2021

L21-10 May 6, 2021

## Private Per Thread Memory



- Private memory
  - No cross-thread sharing
  - Small, fixed size memory
    - Can be used for constants
  - Multi-bank implementation (can be in global memory)

MIT 6.823 Spring 2021

L21-11

## Shared Scratchpad Memory



- Shared scratchpad memory (threads share data)
  - Small, fixed size memory (16K-64K per SM = 'core')
  - Banked for high bandwidth
  - Fed with address coalescing unit (ACU) + crossbar
    - ACU can buffer/coalesce requests

May 6, 2021

MIT 6.823 Spring 2021

L21-12 May 6, 2021

MIT 6.823 Spring 2021

L21-13

## Shared Global Memory



- Shared global memory
  - Large shared memory
  - Will suffer also from memory divergence

May 6, 2021

MIT 6.823 Spring 2021

L21-14 May 6, 2021

MIT 6.823 Spring 2021

L21-15

## Serialized cache access



- Trade latency for power/flexibility
  - Only access data bank that contains data
  - Facilitate more sophisticated cache organizations
    - e.g., greater associativity

May 6, 2021

MIT 6.823 Spring 2021

L21-16 May 6, 2021

MIT 6.823 Spring 2021

L21-17

## Memory Access Divergence

- All loads are gathers, all stores are scatters
- Address coalescing unit detects sequential and strided patterns, coalesces memory requests, but complex patterns can result in multiple lower bandwidth requests (memory divergence)
- Writing efficient GPU code requires most accesses to not conflict, even though programming model allows arbitrary patterns!

May 6, 2021

MIT 6.823 Spring 2021

L21-18

## Shared Global Memory



- Memory hierarchy with caches
  - Cache to save memory bandwidth
  - Caches also enable compression/decompression of data

May 6, 2021

MIT 6.823 Spring 2021

L21-19

## Handling Branch Divergence

- Similar to vector processors, but masks are handled internally
  - Per-warp stack stores PCs and masks of non-taken paths
- On a conditional branch
  - Push the current mask onto the stack
  - Push the mask and PC for the non-taken path
  - Set the mask for the taken path
- At the end of the taken path
  - Pop mask and PC for the non-taken path and execute
- At the end of the non-taken path
  - Pop the original mask before the branch instruction
- If a mask is all zeros, skip the block

May 6, 2021

MIT 6.823 Spring 2021

L21-20

## Example: Branch Divergence

Assume 4 threads/warp,  
initial mask 1111

```
if (m[i] != 0) {
    if (a[i] > b[i]) {
        y[i] = a[i] - b[i];
    } else {
        y[i] = b[i] - a[i];
    }
} else {
    y[i] = 0;
}
```

- ① Push mask 1111  
Push mask 0011  
Set mask 1100
- ② Push mask 1100  
Push mask 0100  
Set mask 1000
- ③ Pop mask 0100
- ④ Pop mask 1100
- ⑤ Pop mask 0011
- ⑥ Pop mask 1111

*Optimization for branches that all go same way?*

May 6, 2021

MIT 6.823 Spring 2021

L21-18 May 6, 2021

MIT 6.823 Spring 2021

L21-19

## CUDA GPU Thread Model



- Single-program multiple data (SPMD) model
- Each context is a thread
  - Threads have registers
  - Threads have local memory
- Parallel threads packed in blocks
  - Blocks have shared memory
  - Threads synchronize with barrier
  - Blocks run to completion (or abort)
- Grids include independent blocks
  - May execute concurrently
  - Share global memory, but
  - Have limited inter-block synchronization

May 6, 2021

MIT 6.823 Spring 2021

L21-20 May 6, 2021

MIT 6.823 Spring 2021

L21-21

## GPU Kernel Execution



- Data transfers can dominate execution time
- Integrated GPUs with unified address space  
→ no copies, but CPU & GPU contend for memory

May 6, 2021

MIT 6.823 Spring 2021

L21-22 May 6, 2021

MIT 6.823 Spring 2021

L21-23

## Branch divergence and locking

- Consider the following executing in multiple threads in a warp:

```
if (condition[i]) {
    while (locked(map0[i])){}
    lock(locks[map0[i]]);
} else {
    unlock(locks[map1[i]]);
}
```

where *i* is a thread id and *map0[]*, *map1[]* are permutations of thread ids.

*What can go wrong here?*

MIT 6.823 Spring 2021

L21-19

## Code Example: DAXPY

### C Code

```
// Invoke DAXPY
daxpy(n, 2.0, x, y);
// DAXPY in C
void daxpy(int n, double a, double *x, double *y)
{
    for (int i = 0; i < n; ++i)
        y[i] = a*x[i] + y[i];
}
```

### CUDA Code

```
// Invoke DAXPY with 256 threads per block
__host__
int nblocks = (n+255) / 256;
__DAXPY in CUDA
// DAXPY in CUDA
__device__
void daxpy(int n, double a, double *x, double *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}
```

- CUDA code launches 256 threads per block
- CUDA vs vector terminology:
  - Thread = 1 iteration of scalar loop (1 element in vector loop)
  - Block = Body of vectorized loop (VL=256 in this example)
  - Grid = Vectorizable loop

MIT 6.823 Spring 2021

L21-21

## Hardware Scheduling



- Grids can be launched by CPU or GPU
  - Work from multiple CPU threads and processes
- HW unit schedules grids on SMs
  - Priority-based scheduling
- Multi-level scheduling
  - Limited number of active grids
  - More queued/paused

MIT 6.823 Spring 2021

L21-23

# Synchronization

- Barrier synchronization within a thread block (`__syncthreads()`)
  - Tracking simplified by grouping threads into warps
  - Counter tracks number of warps that have arrived to barrier
- Atomic operations to global memory
  - Read-modify-write operations (add, exchange, compare-and-swap, ...)
  - Performed at the memory controller or at the L2
- Limited inter-block synchronization!
  - Can't wait for other blocks to finish

May 6, 2021

MIT 6.823 Spring 2021

L21-24 May 6, 2021

MIT 6.823 Spring 2021

L21-25

# System-Level Issues

- Instruction semantics
  - Exceptions
- Scheduling
  - Each kernel is non-preemptive (but can be aborted)
  - Resource management and scheduling left to GPU driver, opaque to OS
- Memory management
  - First GPUs had no virtual memory
  - Recent support for basic virtual memory (protection among grids, no paging)
  - Host-to-device copies with separate memories (discrete GPUs)

May 6, 2021

MIT 6.823 Spring 2021

L21-26 May 6, 2021

MIT 6.823 Spring 2021

L21-27

# Pascal Streaming Multiprocessor (SM)



- Execution units
  - 64 FUs (int and FP)
  - 16 load-store FUs
  - 16 special FUs (e.g., sqrt, sin, cos, ...)
- Memory structures
  - 64K 32-bit registers
  - 64KB shared memory
- Contexts
  - 2048 threads
  - 32 blocks

May 6, 2021

MIT 6.823 Spring 2021

L21-28 May 6, 2021

MIT 6.823 Spring 2021

L21-29

# GPU ISA and Compilation

- GPU microarchitecture and instruction set change very frequently
- To achieve compatibility:
  - Compiler produces intermediate pseudo-assembler language (e.g., Nvidia PTX)
  - GPU driver JITs kernel, tailoring it to specific microarchitecture
- In practice, little performance portability
  - Code is often tuned to specific GPU architecture

MIT 6.823 Spring 2021

L21-25

# GPU: Multithreaded Multicore Chip

- Example: Nvidia Pascal GP100 (2016)
  - 60 streaming multiprocessors (SMs)
  - 4MB Shared L2 cache
  - 8 memory controllers
    - 720 GB/s (HBM2)
  - Fixed-function logic for graphics (texture units, raster ops, ...)
  - Scalability → change number of cores and memory channels
  - Scheduling mostly controlled by hardware



MIT 6.823 Spring 2021

L21-27

# Vector vs GPU Terminology

| Type                 | More descriptive name            | Closest old term outside of GPU         | Official CUDA/ NVIDIA GPU term | Book definition                                                                                                                                                    |
|----------------------|----------------------------------|-----------------------------------------|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Program abstractions | Vectorizable Loop                | Vectorizable Loop                       | Grid                           | A vectorizable loop, executed on the GPU, made up of one or more Thread Blocks (bodies of vectorizable loops) and a grid of threads.                               |
|                      | Body of Vectorized Loop          | Body of a (Strip-Mined) Vectorized Loop | Thread Block                   | A vectorized loop executed on a multithreaded SIMD Processor, made up of one or more threads of SIMD instructions. They can communicate via Local Memory.          |
|                      | Sequence of SIMD Lane Operations | One iteration of a Scalar Loop          | CUDA Thread                    | A SIMD lane is a thread of SIMD instructions corresponding to one element executed by one SIMD Lane. Result is stored depending on mask and predicate register.    |
| Machine objects      | A Thread of SIMD Instructions    | Thread of Vector Instructions           | Warp                           | A SIMD lane is a thread, but it contains just SIMD instructions that are executed on a multithreaded SIMD Processor. Results stored depending on per-element mask. |
|                      | SIMD Instruction                 | Vector Instruction                      | PTX Instruction                | A SIMD lane is a SIMD instruction executed across SIMD Lanes.                                                                                                      |
| Processing hardware  | Multibladed SIMD Processor       | (Multibladed) Vector Processor          | Streaming Multiprocessor       | A multibladed SIMD Processor executes threads of SIMD instructions, independent of other SIMD Processors.                                                          |
|                      | Thread Block Scheduler           | Scalar Processor                        | Giga Thread Engine             | Assigns multiple Thread Blocks (bodies of vectorizable loop) to multithreaded SIMD Thread execution.                                                               |
|                      | SIMD Thread Scheduler            | Thread scheduler in a Multithreaded CPU | Warp Scheduler                 | Hardware unit that schedules and issues threads of SIMD instructions when they are ready to execute, including a scoreboard to track SIMD Thread execution.        |
|                      | SIMD Lane                        | Vector Lane                             | Thread Processor               | A SIMD Lane executes the operations in a thread of SIMD instructions on a single element. Results stored depending on mask.                                        |
| Memory hardware      | GPU Memory                       | Main Memory                             | Global Memory                  | DRAM memory accessible by all multithreaded SIMD Processors in a GPU.                                                                                              |
|                      | Private Memory                   | Stack or Thread Local Storage (OS)      | Local Memory                   | Portion of DRAM memory private to each SIMD Lane.                                                                                                                  |
|                      | Local Memory                     | Local Memory                            | Shared Memory                  | Fast local SRAM for one multithreaded SIMD Processor, unavailable to other SIMD Processors.                                                                        |
|                      | SIMD Lane Registers              | Vector Lane Registers                   | Thread Processor Registers     | Registers in a single SIMD Lane allocated across a full thread block (body of vectorized loop).                                                                    |

[H&P5, Fig 4.25]

MIT 6.823 Spring 2021

L21-29

*Thank you!*

*Next Lecture:*  
*Transactional Memory*