

## Pipeline Stages of LC2K



## Miscellaneous

- Start at blue and look for the reds above it for DH
- writeData for ADD, NOR; ALUres, LW: Mem[ALU]
- fill X will get decoded to add 0 0 X in a pipeline
- 1 instr away means top's at ID/EX and bottom's at IF/ID
- Modified DNF(MEM×2) for LW (H1: 2, H2: 1 noop)



## Data Hazards

1. Avoid Insert noops by compiler
2. DetectNStall [MEM/WB → ID, H1: 2noop, H2: H1-1]

Add,Nor,LW: IF ID EX MEM WB  
H1: IF ID ID EX MEM WB  
H2: HF HF IF ID EX MEM WB



3. DetectNForward [?/2→EX, 1 noop after LW]

LW: IF ID EX MEM WB  
H1: IF ID ID EX MEM WB



- After LW finish ID (at ID/EX) and H1 finish IF (at IF/ID), hazard is detected at the start of next cycle, so 1 noop is passed ontop of H1 at IF/ID. During the remaining cycle, LW finish EX(flow to EX/MEM) while the noop finish ID(flow to ID/EX). EndofCycle: [IF/ID:H1, ID/EX:noop, EX/MEM: LW]

(All three just proceed to next pipeline register in next cycle)

- At start of next cycle, wdata from LW at MEM/WB gets routed back to input MUX at EX, where stale value at ID/EX is another input (control to select prev. one)

ADD, NOR: (0 noop) [H1: EX/MEM→EX, H2: MEM/WB→EX]

LW: (1 noop) [H1: MEM/WB→EX]

| Data H              | Avoid                                                                   | DNS                               | DNF                                      |
|---------------------|-------------------------------------------------------------------------|-----------------------------------|------------------------------------------|
| Pros                | Less Hardware Implement.                                                | Backward Compatibility            |                                          |
| CPI very close to 1 | , fetching more/MemAccesses, produce smaller executables                |                                   | Best Perf. among 3                       |
| Cons                | Backward Incompatibility, Previous 1 as noops are inserted (S HitRate.) | CPI ↑ whenever hazard is detected | Higher Hardware Cost (New datapath, MUX) |

## Control Hazards(MEM/WB→IF)

1. DetectNStall [MEM/WB→IF, 3 noop whenever BEQ is detected]
  - BEQ at IF/ID and nextinstr. is ready to be fetched. At the start of cycle, hazard is detected, so 1 noop is inserted at IF/ID while BEQ flow to ID/EX, more new noop is inserted at IF/ID
  - TargetAddr is figured out and get routed from EX/MEM back to PC; BEQ flow to MEM/WB; two old noops keeps propagating; 1 more new noop is inserted at IF/ID
  - [IF/ID, ID/EX, EX/MEM]: noop, MEM/WB; BEQ] and PC finally can start fetching the correct TargetAddr
2. SpeculateNSquash [3 noops if incorrect]
  - Keep fetching and passing instrs that are below BEQ in pipe trace until BEQ reached EX/MEM and is actually taken
  - Then at the start of cycle, targetAddr get routed back to PC; BEQ flow to MEM/WB; 3 noops are simultaneously inserted at IF/ID, ID/EX, EX/MEM. (Fetch taken path in next cycle)

## Speculate & Squash Branch Taken

| beq 6 7 1 | IF | ID | EX | MEM | WB  |
|-----------|----|----|----|-----|-----|
| 1         | IF | IF | IF |     |     |
| 2         |    |    | IF | ID  | EX  |
| 3         |    |    |    | EX  | MEM |
| 4         |    |    |    |     | WB  |

  

| beq 6 7 2 | IF | ID | EX | MEM | WB  |
|-----------|----|----|----|-----|-----|
| 1         | IF | IF | IF |     |     |
| 2         |    |    | IF | ID  | EX  |
| 3         |    |    |    | EX  | MEM |
| 4         |    |    |    |     | WB  |

  

| beq 6 7 3 | IF | ID | EX | MEM | WB  |
|-----------|----|----|----|-----|-----|
| 1         | IF | IF | IF |     |     |
| 2         |    |    | IF | ID  | EX  |
| 3         |    |    |    | EX  | MEM |
| 4         |    |    |    |     | WB  |

  

| beq 6 7 4 | IF | ID | EX | MEM | WB  |
|-----------|----|----|----|-----|-----|
| 1         | IF | IF | IF |     |     |
| 2         |    |    | IF | ID  | EX  |
| 3         |    |    |    | EX  | MEM |
| 4         |    |    |    |     | WB  |

## Detect & Stall: Branch Not Taken

| beq 6 7 1 | IF | ID | EX | MEM | WB  |
|-----------|----|----|----|-----|-----|
| 1         | IF | IF | IF | ID  | EX  |
| 2         |    |    | IF | ID  | MEM |
| 3         |    |    |    | MEM | WB  |
| 4         |    |    |    |     |     |

## Branch Prediction (Local/Global)

1-bit Last-Time Predictor: Start with given T/NT, copy the Global/Local

Sequence → I to the rest

2-bit Counter Predictor: Start with given T/NT, check upward to determine next T/NT (ST/WT/WNT/SNT) and write it as the next (↑ ↓)

Summary on Regular/Lecture Pipeline:

| Data Hazards (RAW within 2 instr. window) | DNS: Insert noops so that WB→ID on same col.<br>DNF: 1 noop ONLY when LW-Use case |
|-------------------------------------------|-----------------------------------------------------------------------------------|
| Control Hazards (every branch instr.)     | DNS: 3 noops whenever see BEQ<br>SNS: 0 if correct, else 3 noops                  |

## High Level Performance Calculation

+ #Stages-1 IS NEEDED when instr. is FINITE

If didn't say #instructions, assume it's infinite, so no nd +4

#noop=FinishingStage - StartStage - #instr\_avay

- For DNSdata, DNScon, SNScon the FINISHING STAGE (stage that can actually route back) is I after the Textbook/Q-given stage, DNSdata doesn't nd this bc register file uses internal forwarding.

- But, Finishing Stage ≠ FwdStage, then FinishingStage is the first stage after that which allows data-fwding

- Another way to check #noop

1. Look at the PathAG: X → Y

2. Put left finger on Y, right on H1: Y+1, H2: Y+2

3. Count #noops needed for right finger to reach X

4. Equivalent to #stages in between

Exc time = #Instrs × CPI(cycle) × Clock Period(T/cycle) = #Cycles × Clock Period

Clock Frequency f(cycle) = 10 MHz = 10^7 = 100 ns; 100 MHz = 10^8 = 10ns;

1 kHz = 1GHz = 10^9 = 1ns

#instr. (IPS):  $f_{IPS} = \frac{1}{CPI} = \frac{1}{CPI \times T}$

#Cycle = (#Stage - 1) + #Instr. + #load\_stall + #branch\_stall

#Instr. includes halt, and is diff. for Taken/NT branch

#stalls\_per\_memory\_access(MA) =  $\frac{MA \text{ latency}}{\text{Clock Period}}$

CPI =  $1 + \frac{\#Stage-1 + \#DH\_stall + \#branch\_stall}{\#instr.}$

CNSdata: %ADDNORLW × Σ[%dep. × #noop(2)]

DNSdata: %ADDNORLW × Σ[%dep. × #noop(1)]

DNScon: %branch × #noop(3)

CNScon: %branch × [MissRate × #noop(3)] + %Taken × (1 - MissRate) × #noop

## Cache(SRAM)

### Facts

- MEM of LC2K: 2<sup>18</sup>; MIPS: 2<sup>32</sup>; x86/ARM64: 2<sup>64</sup> DRAM

- SRAM(within processor) & DRAM(outside) are volatile

- DRAM is also a cache as it stores data from hard disk

- Register is physically closest to ALU and has highest access frequency

- FA cache size increase, #tagbits never increase

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

### Terms

- Temporal Locality(TL): Reaccess the SAME memory

- Spatial Locality(SL): Re-access NEARBY memory

- Track LRU: Set new accessed blk = 0; ++Remaining; Then replace cache line with highest LRU

- LRU needs update even it's a hit

- Cache line has tag(Addr,CAM) & block(Data,SRAM)

- Cache size is the total amount of data it can store

- Byte-addressable Memory size indicates how many possible memory locations(lines) there are

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

1KB  $2^{10} = 1024$   $10^3 = 1,000$

1MB  $2^{20} = 1,000,000$

1GB  $2^{30} = 10^9$

1TB  $2^{40} = 10^{12}$

SRAM fast cheap main mem

## Virtual vs Physical Addressed Cache

Virtual: Fast, Complex, Clear when switching process

Physical: Slow, Simple, No need cleared b/w process



| Virtual Memory (ISA) |                       |
|----------------------|-----------------------|
| Virtual Page 0       | PageOfs 0   PageOfs 1 |
| VP 1                 |                       |
| VP 2                 |                       |
| VP 3                 |                       |

  

| Physical Memory (DRAM) |                       |
|------------------------|-----------------------|
| Physical Page 0        | PageOfs 0   PageOfs 1 |
| PP 1                   |                       |

#Virtual Pages = 4

Page Size = 2 Byte

Virtual Memory Size = 4 \* 2 = 8 Bytes

=  $2^X$ , where X is X-bit byte-addressable

```

while (opcode(newState, MEMWB.instr) != HALT) {
    printState(newState);
    newState.pc += state.pc;
    newState.FID, pcPlus1 = state.FID, pcPlus1;
    newState.state = state;
    newState.cycle += 1;

    /* ID stage */
    newState.IDEX.instr = state.instrMem(newState.pc);
    newState.pc = state.pc;
    newState.FID, pcPlus1 = state.FID, pcPlus1;
    // stall
    if (opcode(newState, IDEX.instr) == NOP) {
        newState.IDEX.readOp = state.readOp(newState.IDEX.instr);
        newState.IDEX.readRegB = state.readReg(newState.IDEX.instr);
        if (opcode(newState, IDEX.instr) == ADD || opcode(newState, IDEX.instr) == NOR) {
            newState.IDEX.readRegA = convertNum(newState.IDEX.instr);
        }
    }

    /* dependency level ID is 10 */
    if(opcode(newState, IDEX.instr) == LW) {
        if(opcode(newState, IDEX.instr) == LW) {
            newState.IDEX.readOp = state.readOp(newState.IDEX.instr);
            newState.IDEX.readRegB = state.readReg(newState.IDEX.instr);
            if(newState.IDEX.readOp == field(newState.IDEX.instr)) {
                newState.IDEX.instr = NOPINSTRUCTION;
                newState.pc = state.pc;
                newState.FID = state.FID;
            }
        }
    }

    else if(opcode(newState, IDEX.instr) == NOOP) {
        if(opcode(newState, IDEX.instr) == NOOP) {
            if(opcode(newState, IDEX.instr) == field(newState.IDEX.instr)) {
                newState.IDEX.instr = NOPINSTRUCTION;
                newState.pc = state.pc;
                newState.FID = state.FID;
            }
        }
    }

    int regB = newState.IDEX.readOp;
    int regA = newState.IDEX.readRegB;
}

/* dependency */
if(newState.MEMWB.instr == LW) {
    if(opcode(newState, EXMEM.instr) == NOP) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
        }
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
            regB = state.WBEND.writeData;
        }
    }
}

/* immediate */
if(newState.MEMWB.instr == LW) {
    if(opcode(newState, EXMEM.instr) == NOP) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
                newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
            }
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
                regB = state.WBEND.writeData;
            }
        }
    }
}

/* 2-way */
if(opcode(newState, EXMEM.instr) == LW) {
    if(opcode(newState, EXMEM.instr) == NOOP) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
                newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
            }
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
                regB = state.WBEND.writeData;
            }
        }
    }
}

/* 2-way */
if(opcode(newState, EXMEM.instr) == LW) {
    if(opcode(newState, EXMEM.instr) == NOOP) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
                newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
            }
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
                regB = state.WBEND.writeData;
            }
        }
    }
}

```

```

/* dependency */
if(newState.MEMWB.instr == ADD || opcode(newState, MEMWB.instr) == NOR) {
    if(opcode(newState, EXMEM.instr) == ADD) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
        }
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
            regB = state.WBEND.writeData;
        }
    }
}

/* immediate */
if(newState.MEMWB.instr == ADD || opcode(newState, MEMWB.instr) == NOR) {
    if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
        }
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
            regB = state.WBEND.writeData;
        }
    }
}

/* 2-way */
if(opcode(newState, EXMEM.instr) == ADD || opcode(newState, MEMWB.instr) == NOR) {
    if(opcode(newState, EXMEM.instr) == ADD) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
                newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
            }
            if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
                regB = state.WBEND.writeData;
            }
        }
    }
}

/* 2-way */
if(opcode(newState, EXMEM.instr) == ADD || opcode(newState, MEMWB.instr) == NOR) {
    if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr)) {
            newState.MEMWB.readData = state.dataMem(newState.MEMWB.aluResult);
        }
        if(newState.MEMWB.readOp == field(newState.MEMWB.instr) & field(newState.MEMWB.instr) == field(newState.MEMWB.readOp)) {
            regB = state.WBEND.writeData;
        }
    }
}

```

overhead of cache:  
+  
bytes  
( $l + k + n + \text{tag}$ ) bits / block \* #blocks

mem of cache  $\rightarrow$  bytes  
associativity  $\rightarrow$  bytes  
 $x$  blocks/ $set$

a. When data is in the data-cache?

Ans: 1 cycle

b. When data must be fetched from memory?

Ans: 31 and 91 cycles

Hi TLB: 1 (dcache/tlb) + 30 (memory)

Miss TLB, hit page table: 1 (dcache/tlb) + 30 (1st level pt) + 30 (2nd level pt) + 30 (memory)

c. When data must be fetched from disk?

Ans: 100,031/100,061 cycles

miss in 1st level page table: 1 (dcache/tlb) + 30 (1st level pt) + 100,000 (disk)

miss in 2nd level page table: 1 (dcache/tlb) + 30 (1st level pt) + 30 (2nd level pt) + 100,000 (disk)

## Leftover from Midterm

- X-bit Two's Compl.  $[-2^{X-1}, 2^{X-1}]$  [1: 0001, 0: 0000, -1: 1111, -2: 1110]
- $A = \sim A + 1 = A$  nor  $A + 1 = 0$  nor  $A + 1 = 1$ ;  $A$  nor  $A = -A - 1$
- Check even/odd: use mask -2 (11...0)
- Byte Addressable: each address refers to one byte in memory
- Word Addressable: each address refers to one word in memory. It can't access individual bytes in a word (32 bit ISA: 4 bytes & 64 bit ISA: 8 bytes)
- The PC increment is instr. length instead of 1 (LC2K is bc it's 1 word; if 64 byte then jump to i+64 block)

| Optimization                                                                                       | Decrease CPI | Decrease # instructions executed | Decrease clock period |
|----------------------------------------------------------------------------------------------------|--------------|----------------------------------|-----------------------|
| Introducing a cache into a previously cache-less system                                            | ○            | ○                                | ○                     |
| Switching from a static "predict-not-taken" branch predictor to a more accurate, dynamic predictor | ○            | ○                                | ○                     |
| Converting a single-cycle machine into a 5-stage pipeline                                          | ○            | ○                                | ○                     |
| Switching from avoidance to detect-and-forward to handle data hazards                              | ○            | ○                                | ○                     |



| Virtually Addressed Cache                                     | Physically Addressed Cache                          |
|---------------------------------------------------------------|-----------------------------------------------------|
| 缓存 (LCache)                                                   | L(Cache) + L(Cache)                                 |
| 内存 (LCache) + L(TLB)                                          | L(TLB) + L(Cache) + L(Memory)                       |
| TLB H 内存 (L(Cache) + L(TLB)) * (levels + 1) * L(Memory)       | L(TLB) * #levels * L(Memory) + L(Cache) + L(Memory) |
| TLB M Disk (L(Cache) + L(TLB) + [1..#lvs] * L(Mem) + L(Disk)) | L(TLB) + [1..#lvs] * L(Mem) + L(Disk)               |

Page Fault Rate = TLB Miss Rate \* Page Table Miss Rate

Page Fault Rate = Cache Miss Rate \* Page Table Miss Rate



虚拟内存推演

Virtual Address VPN P.O. Page Fault PPN P.O. Physical Address

缓存命中

Address Breakdown:

| #sets =        | 1 | 2 | 4 | stride = | 1 block | 2 block | 4 block | Address Breakdown:                                                                                               |
|----------------|---|---|---|----------|---------|---------|---------|------------------------------------------------------------------------------------------------------------------|
| 0              | 0 | 0 | 0 | size =   | —       | —       | —       | • Tag Bits - Bits in our address that we use to refer to the memory block that address corresponds to            |
| 1 block size   | 0 | 0 | 1 | —        | —       | —       | —       | • Set Bits - Bits in our address that we use to determine what set this address belongs to                       |
| 2 block size   | 0 | 0 | 2 | —        | —       | —       | —       | • Block Offset Bits - Bits in our address that we use to index the value of the memory address from a data block |
| 3 block size   | 0 | 0 | 3 | —        | —       | —       | —       | • Data Block - The block of data from memory, indexed using the address's block offset bits                      |
| ...            |   |   |   |          |         |         |         |                                                                                                                  |
| Last block - 3 | 0 | 0 | 0 | —        | —       | —       | —       |                                                                                                                  |
| Last block - 2 | 0 | 0 | 1 | —        | —       | —       | —       |                                                                                                                  |
| Last block - 1 | 0 | 0 | 2 | —        | —       | —       | —       |                                                                                                                  |
| Last block     | 0 | 0 | 3 | —        | —       | —       | —       |                                                                                                                  |

```

if(opcode(newState, EXMEM.instr) != NOP) {
    newState.EXMEM.readRegB = regB;
}

if (opcode(newState, EXMEM.instr) == ADD) {
    newState.EXMEM.aluResult = regA + regB;
}

else if (opcode(newState, EXMEM.instr) == NOR) {
    newState.EXMEM.aluResult = ~regA | regB;
}

else if (opcode(newState, EXMEM.instr) == BEQ) {
    newState.EXMEM.eq = (regA == regB);
    newState.EXMEM.branchTarget = state.IDEX.pcPlus1 + state.IDEX.offset;
}

else if (opcode(newState, EXMEM.instr) == NOOP) {
    newState.EXMEM.aluResult = regA + state.IDEX.offset;
}

```

缓存与编程

— improves spatial locality since struct members are more likely to be in the same block.  
— improves temporal locality since fewer evictions will occur and so data will stay in the cache longer.

$$\text{Stride Miss Rate} = \frac{\#\text{missed block accesses}}{\#\text{total block accesses}} + \min\left(\frac{\text{Cache Block Size}}{\text{stride}}, 1\right)$$

physically



virtually



hardware component

TLB

Data Cache

Main Memory

Disk

Access Time

1 cycle

1 cycle

30 cycles

100,000 cycles

misses

100,000,000 cycles

idle

idle