

- Latency : time for a job.
  - Throughput : Jobs in unit time.



# ECE437: Introduction to Digital Computer Design

## Mithuna Thottethodi Chapter 2

Spring 2016

- Instructions are the "words" of a computer
  - ISA is its "vocabulary"
  - With a few other things, this defines the interface of computers
    - But implementations vary
  - We will study MIPS™ ISA
    - but the next ISA is not too hard after the first
  - We won't write programs - EE362

## Instructions

Updating code for  
this freq is  
hard

EECE433, Spring 2016

(2)



## Computer

- What does it do?
    - Fetch instruction from address in Program Counter (PC)
    - Execute instruction
    - Update PC to point to next instruction



ECE437, Spring 2016

(3)

- # Outline

  - Instructions
    - Basics
    - Registers and ALU ops
    - Memory and load/store
    - Branches and jumps
    - And more ...
  - Read Chapter 2 and skim through Appendix B for MIPS ISA

14

## Basics

- C statement
  - $f = (g + h) - (i + j)$
  - MIPS instructions\*
    - add t0, g, h //  $t0 = g + h$
    - add t1, i, j //  $t1 = i + j$
    - sub f, t0, t1 //  $f = t0 - t1$
  - Opcodes/mnemonic, operands, source/destination

ECE437, Spring 2016

(5)

- Opcode: Specifies the kind of operation (mnemonic)
  - Operands: input and output data (source/destination)
  - Operands t0, t1 are temps
  - One operation, two inputs, one output
  - Multiple instructions for one C statement

ECE437, Spring 2016

(6)

In MIPS, first is destination as convention.

→ compilers became super good  
that it was used everywhere.  
BUT, they use simple instructions.  
Cannot use complex f<sup>n</sup> like FFT.  
⇒ not used.  
They are clock and power/area kills.

## Why not have bigger instructions?

- Why not have "f = (g + h) - (i + j)" as one instruction?
- Church's thesis: A very primitive computer can compute anything that a fancy computer can compute
  - need only logical functions, read and write memory and data-dependent decisions
- Therefore, ISA selected for practical reasons: performance and cost, not computability
- Regularity tends to improve both
  - E.g., H/W to handle arbitrary number of operands
  - complex, slow and rarely usable and NOT NECESSARY

ECE437, Spring 2016

(7)

## Registers and ALU ops

- Ok, I lied!
- Operands must be registers, not variables
  - add \$8, \$17, \$18
  - add \$9, \$19, \$20
  - sub \$16, \$8, \$9
- MIPS has 32 registers \$0-\$31 (figure next slide)
- Registers \$8, \$9 are temps, \$16 - f, \$17 - g, \$18 - h, \$19 - i, \$20 - j
- MIPS also allows one constant called "immediate"
  - later we will see immediate is 16 bits [-32768, 32767]

ECE437, Spring 2016

(8)

## Registers and ALU



ECE437, Spring 2016

(9)

## ALU ops

- Some ALU ops: *i*   *u*
  - add, addi, addu, addiu (immediate, unsigned)
  - sub ...
  - mul, div
  - and, andi
  - or, ori
  - sll, srl, ...
- Why registers? fit in instructions, smaller is faster
- Are registers enough?

ECE437, Spring 2016

(10)

## Memory and load/store

- But need more than 32 words of storage
- An array of locations M[addr], indexed by addr (figure next slide)
- Data movement (on words or integers)
  - load word for reg ←-- memory
  - lw \$17, 1002 // get input g
  - store word for reg →-- memory
  - sw \$16, 1001 // save output f
    - Note: destination LAST for stores!

load/store  
memory  
to/from  
registers

ECE437, Spring 2016

(11)

## Memory and load/store



ECE437, Spring 2016

(12)

## Byte vs. Word addresses

- I lied again!
  - we need to address bytes for character strings
  - words take 4 bytes
  - therefore, word addresses must be multiples of 4
- figure next slide

ECE437, Spring 2016

(13)

## Memory and load/store



ECE437, Spring 2016

(14)

## Memory and load/store

- Important for arrays
  - $A[i] = A[i] + h$  (figure next slide) *imp*
  - $\# \$8 \rightarrow \text{temp}, \$18 \rightarrow h, \$21 \rightarrow (i \times 4)$
  - Astart is 8000, A is an array of words NOT bytes
- $lw \$8, Astart(\$21)$  // or  $8000(\$21)$   
 $add \$8, \$18, \$8$   
 $sw \$8, Astart(\$21)$  *addr = Val<sub>\$21</sub> + 0x8000*
- MIPS has other load/store for bytes and halfwords

ECE437, Spring 2016

(15)

## Memory and load/store



ECE437, Spring 2016

(16)

## Endian

- Word storage
- Word = bytes (0x400,0x401,0x402,0x403)?
- Word = bytes (0x403,0x402,0x401,0x400)?
  - It depends...
- Big endian: MSB at address xxxxxxx0
  - e.g., IBM, SPARC
- Little endian: MSB at address xxxxxxx11
  - e.g., Intel x86
- Mode selectable
  - e.g., PowerPC, MIPS

ECE437, Spring 2016

(17)

In case they must talk,  
a network endian is set.  
One sends raw & other converts.

## Branches and Jumps

- ```
While ( i != j ) {
    j = j + i;
    i = i + 1;
}
• HLL-> Assembly,  $\$8$  is  $i$ ,  $\$9$  is  $j$ 
Loop: beq $8, $9, Exit
// note that the condition is reversed
add $9, $9, $8
addi $8, $8, 1
j Loop
Exit:
```

ECE437, Spring 2016

(18)

## Branches and Jumps

- Better yet:

```
beq $8, $9, Exit      // not !=  
Loop: add $9, $9, $8  
        addi $8, $8, 1  
        bne $8, $9, Loop
```

Exit:

- Let compilers worry about such optimizations

ECE437, Spring 2016

(19)

## Branches and Jumps

- What does bne do really?

- read \$8; read \$9, compare  
- set PC = PC + 4 or PC = Target

- To do compares other than "equal" or "not equal"

- e.g., blt \$8, \$9, Target // assembler pseudo-instruction

- expands to

{ slt \$1, \$8, \$9 // \$1 = (\$8 < \$9) = (\$8 - \$9 < 0)  
 bne \$1, \$0, Target // \$0 is always 0

ECE437, Spring 2016

(20)

assembler reserves this.

*not implemented in hardware.*

*converts to this*

*assembler*

*you could instead not use assembler and do what you wish.*

*That means no shortcut like blt.*

## Branches and Jumps

- Other MIPS branches/jumps

- beq \$8, \$9, imm  
- // if (\$8 == \$9) PC = PC + imm<<2 else PC += 4

- bne ...

- slt, sle, sgt, sge

- with immediate, unsigned

- j addr // PC = addr

- jr \$12 // PC = \$12

- jal addr // \$31 = PC+4; PC = addr; used for procedure calls

*done by hardware itself. Not software convention like blt.*

ECE437, Spring 2016

(21)

*jr \$31 can return from a jump.*

## Assembly Exercise

- HLL → assembly

if (a==b)

  a = a+b

else

  b = a+b

ECE437, Spring 2016

(23)

*why branch better than jump?*

## Assembly Exercise

|                    |                  |
|--------------------|------------------|
| Init:              | mov \$8,0        |
| Test:              | slt \$5,\$8,\$9  |
| for(i=0;i<n;i++) { | beq \$5,\$0,Exit |
| blah blah          | blahblah         |
| }                  | blahblah         |
| Loop:              | addi \$8,\$8,1   |
| Ind:               | b Test           |
| Exit:              |                  |

ECE437, Spring 2016

(22)

→ stack : software construct.

*say fn calls a fn.*

*\$31 gets overwritten.*

*This is bad.*

*hence store value in memory* [the stack]

Can we \$4 & \$5 to pass param to f.  
\$2 is return addr.

(Always good to make space before writing. ↴)

## Procedure Calls

- See section 2.7 for more details
  - save registers
  - set up parameters
  - call procedure
  - get results
  - restore registers
- jal is the only special instruction: the rest is software convention
- jal addr // \$31 = PC+4; PC = addr; used for procedure calls

ECE437, Spring 2016 (25)

## Procedure Example

- HLL code for swap
 

```
{ temp = v[k];
    v[k] = v[k+1];
    v[k+1] = temp;
    return (temp); }
```
- Corresponding assembly code
 

```
swap: addi $29, $29, -8 // save registers
          sw $15, 4($29)
          sw $16, 8($29)
```

ECE437, Spring 2016 (27)

## Procedure Call Example

- Now, calling the procedure
    - HLL code
- ```
return_value = swap(arr_A, i); /* call swap */

        mov $4, $18 // first parameter
        mov $5, $17 // second parameter
        jal swap
        mov $20, $2 // copy return value
```

ECE437, Spring 2016 (28)

## Procedure Calls

- An important data structure is the stack
- Stack grows from larger to smaller addresses
- \$29 is the stack pointer, it points just beyond valid data. Why?
- Push \$2:
  - ↑ word in memory
  - lw \$2, 4(\$29)
  - addi \$29, \$29, 4
- The order cannot be changed. Why? interrupts

ECE437, Spring 2016 (26)

if we switch to have sw first.  
 → seg sw finished. To add yet.  
 → interrupt comes now.  
 → will overwrite memory ⇒ lose data  
 hence this order

## Procedure Example

```
$4 is v
$5 is k
for memory word size
multi* $2, $5($2) // procedure body: $2 == k*4
add $2, $4, $2 // $2 == v + k*4 == address of v[k]
lw $15, 0($2) // $15 == v[k]
lw $16, 4($2) // $16 == v[k+1]
sw $16, 0($2) // v[k] == $16
sw $15, 4($2) // v[k+1] == $15
mov* $2, $15 // return value
lw $15, 4($29) // restore registers
lw $16, 8($29)
addi $29, $29, 8
jr $31 // return
```

ECE437, Spring 2016 (28)

## Layers of Software

- Notation - program: input data → output data
  - executable: input data → output data
  - loader: executable file → executable in memory
  - linker: object files → executable file
  - assembler: assembly file → object file
  - compiler: HLL file → assembly file
  - editor: editor commands → HLL file
- Only possible because programs can be manipulated as data
  - "stored program computer", "Von Neumann architecture".

ECE437, Spring 2016 (30)

↳ allows to create exec by user.

## MIPS Machine Language

- All 32-bit instructions
- Assembly: add \$1, \$2, \$3
  - 14 ASCII (or Unicode) codes
- Machine code: 1 word

|                                                                                                                                                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 000000 00010 00001 00000 010000                                                                                                                                       |
| alu-rr 2 3 1 zero add/signed<br>(6) (5) (5) (5) (5) (6)<br>alu operation register to register<br>ECE437, Spring 2016 dest 2 8 bits<br>shift amt. here is a signed add |

## Instruction Format

- R-format:
 

|        |    |    |    |       |       |
|--------|----|----|----|-------|-------|
| opcode | rs | rt | rd | shamt | funct |
| 6      | 5  | 5  | 5  | 5     | 6     |
- Digression:
  - How do you store the number 4,392,976?
  - Same as add \$1, \$2, \$3
- Stored program: instructions are represented as numbers
  - programs can be read/written in memory like numbers

ECE437, Spring 2016 (32)

## Instruction Format

- Other R-format: addu, sub\*, and, or, etc
- Assembly: lw \$1, 100(\$2)  $\rightarrow$  limits to  $2^{16}$
- Machine:

100011 00010 00001 0000000001100100  
lw 2 1 100 (in binary)

### I-format:

|        |    |    |                   |
|--------|----|----|-------------------|
| opcode | rs | rt | address/immediate |
| 6      | 5  | 5  | 16                |

Source dest.

ECE437, Spring 2016

(33)

## Instruction Format

- I-format also used for ALU ops with immediates
  - addi \$1, \$2, 100
  - 001000 00010 00001 0000000001100100
- What about constants larger than 16 bits = [-32768, 32767]?
  - e.g., 0000 0000 0000 1100 0000 0000 0000 1111? lui \$4, 12 // \$4  $\leftarrow$  0x000C0000
  - ori \$4,\$4,15 // \$4  $\leftarrow$  0x000C000F
- All loads and stores use I-format

ECE437, Spring 2016

(34)

is done for first 16, then does ori

## Instruction Format

- beq \$1, \$2, 7  $\rightarrow$  no words left  $\rightarrow$  offset
  - 000100 00001 00010 0000 0000 0000 0111
  - PC = PC + (0000 0111 << 2) // word offset
  - Finally, J-format:
- |        |      |
|--------|------|
| opcode | addr |
| 6      | 26   |
- addr is weird in MIPS:
    - 4 MSB of PC:addr 00 (4+26+2 = 32)  $\rightarrow$  concat
    - o word size is 4.

ECE437, Spring 2016

(35)

## MIPS: "branch" vs "jump"

### Branch: 16 bit displacement

- Current word +/-  $2^{15}$
- Range is symmetric
  - $\text{Range}_{\text{forward}} = \text{Range}_{\text{backward}} = 2^{15}$

### Jump: 26 lower-bit concatenation in word address

- Jump anywhere within a  $2^{26}$  segment
- Range may be asymmetric
  - $\text{Range}_{\text{forward}} + \text{Range}_{\text{backward}} = 2^{26}$

ECE437, Spring 2016

(36)

branch allows jump in  $2^{15} \pm$  above/below current location.



Jump allows movement anywhere provided the top 4 bits are fixed by PC.

If PC = 0xA::::::::

↳ then can jump in a region only

2

## Summary: Instruction Formats

- R-format: opcode rs rt rd shamt funct

|   |   |   |   |   |   |
|---|---|---|---|---|---|
| 6 | 5 | 5 | 5 | 5 | 6 |
|---|---|---|---|---|---|

- I-format: opcode rs rt address/immediate

|   |   |   |    |
|---|---|---|----|
| 6 | 5 | 5 | 16 |
|---|---|---|----|

- J-format: opcode addr

|   |    |
|---|----|
| 6 | 26 |
|---|----|

- Instr decode

- Theory: Inst bits → identify instrs → control signals
- Practice: Instruction bits → control signals

ECE437, Spring 2016

(37)

## Memory

- While one could do both I fetch and D-access done one after the other in one (slow) CPU clock, our specific FPGA memory needs two clocks -one each for I-fetch and D-access
  - So our lab design is "mostly single-cycle CPU": one cycle for all instructions except ld and st, and two cycles for ld/st
  - Still you need arbitration so memory does either I-fetch or D-access at a time

ECE437, Spring 2016

(39)

## Memory arbitration

- How is this timing realized? Via request-ready handshake + arbitration
- The CPU has a hardware block called "request block" where the CPU generates I-request and D-request signals - non ld/st instructions generate only I-request and ld/st generate both I-request (for I-fetch) and D-request (for ld/st of data)
- The requests go to the "Arbiter block"

ECE437, Spring 2016

(41)

## Memory (Hardware detour)

- Used for instruction fetch and data access ld/st
- Memory is big and slow
  - takes 300 CPU clocks today! (ch5)
- (1) Memory can allow only one of instruction fetch or ld/st data access at any given time → need to arbitrate between I-fetch and D-access
  - We will alleviate this in ch5 but will still need arbitration

ECE437, Spring 2016

(38)

## Lab memory timing

- No ld/st among instructions



- With a ld or st



ECE437, Spring 2016

(40)

## Memory arbitration

- The arbiter block chooses one of I-request or D-request if both are active (ld/st) OR I-request if D-request is not active (non-ld/st)
  - If both active,
    - the D-request is for current instruction and I-request is for the next instruction
    - D-request gets priority to complete the current instruction before going to next
- The chosen request accesses memory

ECE437, Spring 2016

(42)

## CPU-Memory Interface



- Request is in CPU and arbiter is in memory system
- Do not break this interface by going around it - else you will suffer later

ECE437, Spring 2016

(44)

## Memory arbitration

- When access complete (I or D), the arbiter/controller asserts the corresponding ready (I-ready or D-ready)
  - This is how CPU knows access is complete - this is important in 3 slides
- **VERY IMPORTANT:** once a D-access is complete, de-assert the D-request
  - Else arbiter will keep choosing that request
  - Changes behavior for pipeline.

ECE437, Spring 2016

(44)

## Memory arbitration - Detail 1

- For ld/st, the instruction is latched within memory system WHILE data access occurs (else instruction would vanish while data access occurs!)
  - During the middle clock of previous timing diagram
  - This latching is done for you in the code given to you

ECE437, Spring 2016

(45)

## Memory arbitration- Detail 2

- Lab has two clocks - CPU clock and RAM clock
  - CPU clock is the main clock
  - RAM clock an internal detail to be ignored
  - RAM clock happens to be 2x CPU clock
    - Does not match reality
    - Done to make our specific FPGA work for mostly single-cycle CPU
    - Fixed in the next slide

ECE437, Spring 2016

(46)

## Memory arbitration

- In lab, memory is variable latency - so your design should work at different latencies (important in real world)
  - Mostly single-cycle is at the lowest latency
  - Longer latencies → multiple clocks per I-fetch or D-access
  - You MAY NOT assume memory will complete within one cycle → you MUST wait for I-ready or D-ready
  - I-ready, D-ready in one cycle at lowest latency & in more cycles at longer latencies

ECE437, Spring 2016

(47)

## Why?

- Variable latency checks for timing independence of design - important skill
- Single memory, so we don't change from two memories (in single cycle) to one memory (in pipelining) - easier for you
- **GENERAL-** you should test and debug every block BOTH separately AND after merging with rest of design BEFORE going to the next block else you will have a mess that can't be debugged

ECE437, Spring 2016

(48)

## Addressing modes

- There are many ways of accessing data
- 1. Register addressing  
- add \$1, \$2, \$3



ECE437, Spring 2016

(49)

## Addressing Modes

- 2. Base addressing (aka displacement)  
- lw \$1, 100(\$2) // \$2 = 400, M[500] = 42, \$1 <= 42



ECE437, Spring 2016

(50)

## Addressing Modes

- 3. Immediate addressing  
- addi \$1, \$2, 100



ECE437, Spring 2016

(51)

## Addressing Modes

- 4. PC relative addressing  
- beq \$1, \$2, 100 // if (\$1 == \$2) PC = PC + 100



ECE437, Spring 2016

(52)

## Addressing Modes

- Pseudodirect addressing  
- j Loop // instruction contains address\*



ECE437, Spring 2016

(53)

## Addressing Modes

- Not found in MIPS
- 5. Indexed: add two registers - base + index
- 6. Indirect: M[M[addr]] - two memory references
- 7. Autoincrement/decrement: add data size
- 8. Autoupdate - found in IBM PowerPC, HP PA-RISC  
- like displacement but update register

ECE437, Spring 2016

(54)

## Addressing Modes

- Autoupdate

```
lwupdate $1, 4[$2] // $1 == M[4+$2]; $2 == $2 + 4
```



ECE437, Spring 2016

(55)

## Addressing Modes

```
for (i = 0, I < N, i += 1)  
    sum += A[i];
```

- \$7 - sum, \$8 - address of a[i], \$2 - tmp

inner:

```
lw $2, 0($8)  
addi $8, $8, 4  
add $7, $7, $2
```

new inner:

```
lwupdate $2, 4($8)
```

```
add $7, $7, $2
```

- any problems with new inner ?

ECE437, Spring 2016

(56)

## Choosing ISA:

- very application specific.
- might have system with less memory. Might want to have more & powerful instructions.

# Processor Implementation

## ECE437: Introduction to Digital Computer Design

Mithuna Thottethodi

Chapter 4

Spring 2016



ECE437, Spring 2016

(2)

## Outline

- Datapath - single cycle
  - single instruction, 2's complement, unsigned
- Control

instruction.  
data flows here.

ECE437, Spring 2016

(3)

## Datapath for Instructions

- Single cycle datapath
- Compose using well-understood pieces
  - Mux, flip-flops and gates
  - ALU
  - Register file

ECE437, Spring 2016

(4)

## Comb. Logic Elements

- Adder



- ALU



- Mux



ECE437, Spring 2016

(5)

## Storage Elements

- Register
  - for PC



- Register file
  - 32 registers
  - 2 read ports/buses
  - 1 write port/bus



- Memory
  - 1 input bus
  - 1 output bus
  - Not bidirectional



ECE437, Spring 2016

(6)

## Computer as State Machine



- Storage elements
  - Memory, Register file, PC
- Combinational elements
  - ALUs, Adders, Muxes

ECE437, Spring 2016

(7)

## Processor Implementation

- This Lecture: Datapath



ECE437, Spring 2016

(8)

## Datapath - 1 CPI

- Assumption: Get one whole instruction done in one long cycle
  - fetch, decode/read operands, execute, memory, writeback
  - useful way to represent steps and identify required datapath elements: RTL
- For single instruction
- Put it together

ECE437, Spring 2016

(9)



ECE437, Spring 2016

(10)

## Outline

- Datapath - single cycle
  - single instruction, 2's complement, unsigned
- Control

ECE437, Spring 2016

(11)

## Register Transfer Language

- RTL gives the meaning of the instructions
- All start by fetching the instruction

$op | rs | rt | rd | shamt | funct = MEM[PC]$

$op | rs | rt | Imm16 = MEM[PC]$

### inst Register Transfers

|       |                                                                                       |                        |
|-------|---------------------------------------------------------------------------------------|------------------------|
| ADDU  | $R[rd] \leftarrow R[rs] + R[rt];$                                                     | $PC \leftarrow PC + 4$ |
| SUBU  | $R[rd] \leftarrow R[rs] - R[rt];$                                                     | $PC \leftarrow PC + 4$ |
| ORI   | $R[rt] \leftarrow R[rs] OR zero\_ext(Imm16);$                                         | $PC \leftarrow PC + 4$ |
| LOAD  | $R[rt] \leftarrow MEM[R[rs] sign\_ext(Imm16)];$                                       | $PC \leftarrow PC + 4$ |
| STORE | $MEM[R[rs] + sign\_ext(Imm16)] \leftarrow R[rt];$                                     | $PC \leftarrow PC + 4$ |
| BEQ   | $\text{if } (R[rs] == R[rt]) \text{ then } PC \leftarrow PC + sign\_ext(Imm16)    00$ | $PC \leftarrow PC + 4$ |
|       | $\text{else } PC \leftarrow PC + 4$                                                   |                        |

ECE437, Spring 2016

(12)

pad with 1 or 0 according to Imm16 [15]  
 pad end with 2 zeroes.

## A Simple Implementation

- ADD and SUB**
  - addU rd, rs, rt
  - subU rd, rs, rt
- OR Immediate:**
  - ori rt, rs, imm16
- LOAD and STORE Word**
  - lw rt, rs, imm16
  - sw rt, rs, imm16
- BRANCH:**
  - beq rs, rt, imm16



ECE437, Spring 2016

(13)

## Fetch Instructions

- Fetch instruction, then update PC
- PC updated (at the end of) every cycle
  - What if no branches or jumps?



ECE437, Spring 2016

(14)

## ALU Instructions

- $R[rd] \leftarrow R[rs] \text{ op } R[rt]$  Example: addU rd, rs, rt
  - Ra, Rb, and Rw come from instruction's rs, rt, and rd fields
  - ALUctr and RegWr: control logic after decoding the instruction



ECE437, Spring 2016

(15)

## Logical operation with Immediate

- $R[rt] \leftarrow R[rs] \text{ op ZeroExt[imm16]}$



ECE437, Spring 2016

(16)

## Load Instruction

- $R[rt] \leftarrow \text{Mem}[R[rs] + \text{SignExt}[imm16]]$  Example: lw rt, imm16(rs)



ECE437, Spring 2016

(17)

## Store Instruction

- $\text{Mem}[R[rs] + \text{SignExt}[imm16]] \leftarrow R[rt]$  Example: sw rt, imm16(rs)



ECE437, Spring 2016

(18)

## Conditional Branch Instruction



- beq rs, rt, imm16

- IR = Mem[PC] // Fetch the instruction from memory
- COND  $\leftarrow R[rs] == R[rt]$  // Calculate the branch condition
- if (COND eq 0) // Calculate the next instruction's address
  - PC  $\leftarrow PC + 4 + (\text{SignExt}(imm16) \times 4)$
- else
  - PC  $\leftarrow PC + 4$

What is this?

ECE437, Spring 2016

(19)

## Datapath for 'beq'

- beq rs, rt, imm16
- Datapath generates condition (equal)



ECE437, Spring 2016

(20)



ECE437, Spring 2016

(21)

## ORI

- ORI is not in the book
- ORI shows that some instructions need zero extension instead of sign extension
  - Logical vs. arithmetic ops
- What will change in the datapath if ORI is absent?

ECE437, Spring 2016

(22)

## Exercise

- See worksheet (Fig 4.15)
- Highlight active datapath for
  - Add
  - Beq
  - Sw
  - Lw

ECE437, Spring 2016

(23)

jal  
jr etc  
not done in this  
circuit diag.  
Must modify accordingly.

Add

# Worksheet



Beg

# Worksheet



# Summary

- For a given instruction
  - Describe operation in RTL
  - Use ALUs, Registers, Memory, adders to achieve reqd. functionality
- To add instructions
  - Rinse and repeat
  - Reuse components by using muxes
- Controls : later
  - Selection controls for muxes
  - ALU controls for ALU ops
  - Register address controls
  - Write enables for registers/memory

ECE437, Spring 2016

(25)



ECE437, Spring 2016

(26)

# Exercise

- Add jump instruction to single cycle datapath
  - j Addr
  - RTL
    - $PC \leftarrow (PC+4)[31:28] // Addr // 00$



- See worksheet #1

ECE437, Spring 2016

(27)

# Control for Datapath



ECE437, Spring 2016

(28)

# Textbook version



ECE437, Spring 2016

(29)

# Controls for Add Operation

- $R[rd] = R[rs] + R[rt]$



ECE437, Spring 2016

(30)

## Meaning of Control Signals

- rs, rt, rd and imm16 hardwired in datapath
- PCSrc:  $0 \Rightarrow PC \leftarrow PC + 4; 1 \Rightarrow PC \leftarrow PC + 4 + \text{SignExt}(Imm16) \mid\mid 00$



ECE437, Spring 2016

(31)

## Meaning of Control Signals

- ExtOp: "zero", "sign"
- ALUSrc:  $0 \Rightarrow \text{regB}; 1 \Rightarrow \text{immmed}$
- ALUctr: "add", "sub", "or"
- MemWr: write memory
- MemtoReg:  $1 \Rightarrow \text{Mem}$
- RegDst:  $0 \Rightarrow \text{"rt"}; 1 \Rightarrow \text{"rd"}$
- RegWr: write dest register



ECE437, Spring 2016

(32)

## ORI Controls: Worksheet

- $R[rt] \leftarrow R[rs] \text{ or } \text{ZeroExt}[Imm16]$



ECE437, Spring 2016

(33)

## LW Controls

- $R[rt] \leftarrow \text{Data Memory} \{R[rs] + \text{SignExt}[imm16]\}$



ECE437, Spring 2016

(35)

## SW Controls: Worksheet

- $R[rt] \rightarrow \text{Data Memory} \{R[rs] + \text{SignExt}[imm16]\}$



## BEQ Controls

- if  $(R[rs] - R[rt] == 0)$  then Zero  $\leftarrow 1$ ; else Zero  $\leftarrow 0$



ECE437, Spring 2016

(36)

## Control Logic

- Logic must generate appropriate signals for all instructions
- Summary slide (previous)
  - A way of representing the truth table
- First:
  - Equations in terms of opcodes
- Next
  - Equations in terms of instruction bits

ECE437, Spring 2016

(40)

## Controls: Logic equations

- PCSrc  $\Leftarrow$  if (OP == BEQ) then ZERO else 0
- ALUSrc  $\Leftarrow$  if (OP == "R-type") then "regB"  
elseif (OP == BEQ) then regB, else "imm"
- ALUctr  $\Leftarrow$  if (OP == "R-type") then funct  
elseif (OP == ORi) then "OR"  
elseif (OP == BEQ) then "sub"  
else "add"
- ExtOp  $\Leftarrow$  \_\_\_\_\_
- MemWr  $\Leftarrow$  op == store
- MemtoReg  $\Leftarrow$  \_\_\_\_\_
- RegWr:  $\Leftarrow$  \_\_\_\_\_
- RegDst:  $\Leftarrow$  \_\_\_\_\_

ECE437, Spring 2016

(41)

## Controls: Logic equations

- PCSrc  $\Leftarrow$  if (OP == BEQ) then EQUAL else 0
- ALUSrc  $\Leftarrow$  if (OP == "R-type") then "regB"  
elseif (OP == BEQ) then regB, else "imm"
- ALUctr  $\Leftarrow$  if (OP == "R-type") then funct  
elseif (OP == ORi) then "OR"  
elseif (OP == BEQ) then "sub"  
else "add"
- ExtOp  $\Leftarrow$  if (OP == ORi) then "zero" else "sign"
- MemWr  $\Leftarrow$  (OP == Store)
- MemtoReg  $\Leftarrow$  (OP == Load)
- RegWr:  $\Leftarrow$  if ((OP == Store) || (OP == BEQ)) then 0 else 1
- RegDst:  $\Leftarrow$  if ((OP == Load) || (OP == ORi)) then 0 else 1

ECE437, Spring 2016

(42)

## Truth Table summary

| See<br>Appendix A | func<br>op | We Don't Care :-) |         |         |         |          |         |
|-------------------|------------|-------------------|---------|---------|---------|----------|---------|
|                   |            | 10 0000           | 10 0010 | 00 1101 | 10 0011 | 10 1011  | 00 0100 |
|                   | add        | sub               | ori     | lw      | sw      | beq      | jump    |
| RegDst            | 1          | 1                 | 0       | 0       | x       | x        | x       |
| ALUSrc            | 0          | 0                 | 1       | 1       | 1       | 0        | x       |
| MemtoReg          | 0          | 0                 | 0       | 1       | x       | x        | x       |
| RegWrite          | 1          | 1                 | 1       | 1       | 0       | 0        | 0       |
| MemWrite          | 0          | 0                 | 0       | 0       | 1       | 0        | 0       |
| PCSrc             | 0          | 0                 | 0       | 0       | 0       | 1        | 0       |
| Jump              | 0          | 0                 | 0       | 0       | 0       | 0        | 1       |
| ExtOp             | x          | x                 | 0       | 1       | 1       | x        | x       |
| ALUctr<2:0>       | Add        | Subtract          | Or      | Add     | Add     | Subtract | xxx     |

| 31     | 26 | 21 | 16 | 11             | 6     | 0         |                  |
|--------|----|----|----|----------------|-------|-----------|------------------|
| R-type | op | rs | rt | rd             | shamt | funct     | add, sub         |
| I-type | op | rs | rt |                |       | immediate | ori, lw, sw, beq |
| J-type | op |    |    | target address |       |           | Jump             |

ECE437, Spring 2016

(43)

## Local vs Global Control

- One more layer of abstraction
  - ALUctr  $\Leftarrow$  if (OP == "R-type") then funct  
elseif (OP == ORi) then "OR"  
elseif (OP == BEQ) then "sub"  
else "add"



ECE437, Spring 2016

(44)

## Global Control: Truth Table

| op         | 00 0000  | 00 1101 | 10 0011 | 10 1011 | 00 0100  | 00 0010 |
|------------|----------|---------|---------|---------|----------|---------|
| R-type     | ori      | lw      | sw      | beq     | jump     |         |
| ALUSrc     | 1        | 0       | 0       | x       | x        | x       |
| MemtoReg   | 0        | 1       | 1       | 1       | 0        | x       |
| RegWrite   | 1        | 0       | 1       | x       | x        | x       |
| MemWrite   | 0        | 0       | 0       | 1       | 0        | 0       |
| Branch     | 0        | 0       | 0       | 0       | 1        | 0       |
| Jump       | 0        | 0       | 0       | 0       | 0        | 1       |
| ExtOp      | x        | 0       | 1       | 1       | x        | x       |
| ALUop<N:0> | "R-type" | Or      | Add     | Add     | Subtract | xxx     |



ECE437, Spring 2016

(45)

## Encoding



- In this exercise, ALUOp has to be 2 bits wide to represent:
  - (1) "R-type" instructions
  - "I-type" instructions that require the ALU to perform:
    - (2) Or, (3) Add, and (4) Subtract
- To implement the full MIPS ISA, ALUOp has to be 3 bits to represent:
  - (1) "R-type" instructions
  - "I-type" instructions that require the ALU to perform:
    - (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi)

|                  | R-type   | ori  | lw   | sw   | beq      | jump |
|------------------|----------|------|------|------|----------|------|
| ALUOp (Symbolic) | "R-type" | Or   | Add  | Add  | Subtract | XXX  |
| ALUOp<2:0>       | 1 00     | 0 10 | 0 00 | 0 00 | 0 01     | XXX  |

ECE437, Spring 2016

(46)

## Global Control: Truth Table

| op         | 00 0000  | 00 1101 | 10 0011 | 10 1011 | 00 0100  | 00 0010 |
|------------|----------|---------|---------|---------|----------|---------|
| R-type     | ori      | lw      | sw      | beq     | jump     |         |
| RegDst     | 1        | 0       | 0       | x       | x        | x       |
| ALUSrc     | 0        | 1       | 1       | 1       | 0        | x       |
| MemtoReg   | 0        | 0       | 1       | x       | x        | x       |
| RegWrite   | 1        | 1       | 1       | 0       | 0        | 0       |
| MemWrite   | 0        | 0       | 0       | 1       | 0        | 0       |
| Branch     | 0        | 0       | 0       | 0       | 1        | 0       |
| Jump       | 0        | 0       | 0       | 0       | 0        | 1       |
| ExtOp      | x        | 0       | 1       | 1       | x        | x       |
| ALUOp<2:0> | "R-type" | Or      | Add     | Add     | Subtract | XXX     |
|            | (100)    | (010)   | (000)   | (000)   | (001)    |         |



ECE437, Spring 2016

(47)

## Truth Table for RegWrite

| op       | 00 0000 | 00 1101 | 10 0011 | 10 1011 | 00 0100 | 00 0010 |
|----------|---------|---------|---------|---------|---------|---------|
| RegWrite | 1       | 1       | 1       | 0       | 0       | 0       |

- RegWrite = R-type + ori + lw
  - =  $\text{lop}^5 \& \text{lop}^4 \& \text{lop}^3 \& \text{lop}^2 \& \text{lop}^1 \& \text{lop}^0$  (R-type)
  - +  $\text{lop}^5 \& \text{lop}^4 \& \text{op}^3 \& \text{op}^2 \& \text{lop}^1 \& \text{op}^0$  (ori)
  - +  $\text{op}^5 \& \text{lop}^4 \& \text{lop}^3 \& \text{lop}^2 \& \text{op}^1 \& \text{op}^0$  (lw)



ECE437, Spring 2016

(48)

## PLA implementation



ECE437, Spring 2016

(49)

## PLA Representation



ECE437, Spring 2016

(50)

## Putting it all together



ECE437 Spring 2016

ExtOp

We have something  
150 to 200 x slower



## ECE437: Introduction to Digital Computer Design

Mithuna Thottethodi

Back to Chapter 4

(Done with "Make it correct".

Now: "Make it fast")

Spring 2016

## Outline

### Pipelining

- What? Basic concepts
  - Overlapping execution
  - Latency vs. throughput

### - Why? Performance implications

- Speedup
- CPI, cycletime

### - How? Implementation challenges

ECE437, Spring 2016

(2)

## Pipelining

- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
- Washer takes 30 minutes
- Dryer takes 30 minutes
- "Folder" takes 30 minutes
- "Stasher" takes 30 minutes to put clothes into drawers



ECE437, Spring 2016

(3)

## Sequential Laundry



- Sequential laundry takes 8 hours for 4 loads
- If they learned pipelining, how long would laundry take?

ECE437, Spring 2016

(4)

## Pipelined Laundry



## Pipelining lessons

- Pipelined laundry takes 3.5 hours for 4 loads!
- Not esoteric computer architecture concept
- Natural, Intuitive

ECE437, Spring 2016

(5)



ECE437, Spring 2016

(6)

- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages
- Pipeline rate limited by slowest pipeline stage
- Unbalanced lengths of pipe stages reduces speedup
- Time to "fill" pipeline and time to "drain" it reduces speedup
- Stall for Dependencies

max efficiency point  
at  $n \rightarrow \infty$

loads / operations / instructions

$$\text{improvement factor} = \frac{2N}{2 + (N-1)} = \frac{2N}{1.5 + N/2} \geq \frac{4N}{3+N}$$

$\approx 4$

### The Stages of Load



- Ifetch: Instruction Fetch
  - Fetch the instruction from the Instruction Memory
- Reg/Dec: Registers Fetch and Instruction Decode
- Exec: Calculate the memory address
- Mem! Read the data from the Data Memory
- Wr: Write the data back to the register file

ECE437, Spring 2016

(7)

$$\begin{aligned} \text{old time} &= CPI \times n \times t \\ \text{new time} &= CPI \times (n-1) \times \frac{t}{P} + CPI \times 1 \times \frac{t}{P} \\ \text{Speedup} &= P \end{aligned}$$

### Ideal speedup

- All instructions are executed in P pipeline stages in a multicycle path (i.e. CPI = P)
- Cycles time = t ns (say)
- Instr. Count = n
- Old time = P × t × n
- New time = n × t + (P-1) × t
- Speedup =  $P / (1 + (P-1)/n)$  =  $\frac{\text{old time}}{\text{new time}}$
- P is some constant, n is large  $\Rightarrow$  Speedup = P

ECE437, Spring 2016

(9)

### Pipelined Execution Representation



- Ideal speedup = 5

ECE437, Spring 2016

(8)

when pipeline stages are uniform

can say depends  
on biggest block  
that defines clk

### Non-uniform pipeline stages



ECE437, Spring 2016

(10)

after some cycles, we'll have 5 instructions same time.  
but overlay will be 200 ns. old instr is 800 ns.  
factor =  $800/200 = 4$ .

If ideal, factor = 5  $\Rightarrow$  overlay = 160 ns. [Not Happens].

### Non-uniform stages

Maximum Speedup  $\leq$  Number of stages  
Speedup  $\leq$  Time for unpipelined operation  
Time for longest stage

ECE437, Spring 2016

(11)

### Exercise

- A single cycle processor implementation can be pipelined in two ways
- Pipeline A uses a 5-stage pipeline
  - the 5 stages account for 15%, 10%, 20%, 20%, 35% of the delays respectively
- Pipeline B uses a 3-stage pipeline
  - the stages are balanced
- If instructions are all independent, which pipeline implementation is the better option

clock speed goes from 100% to 35%.  
here clock speed goes from 100% to 33.3%.

$\Rightarrow$  B better.

ECE437, Spring 2016

(12)



## Pipelined Processor Design

- Designing a pipelined processor
  - Associate resources with states
  - Resources not necessarily atomic
    - Register reads and writes can happen in the same cycle
    - Writes in first half of cycle and reads in the second half
  - Assert appropriate controls in each stage
    - Make sure all necessary information is carried through inter-pipestage flip-flops

ECE437, Spring 2016

(10)

## Pipelining the Load Instruction



- The five independent functional units in the pipeline datapath are:
  - Instruction Memory for the Ifetch stage
  - Register File's Read ports (bus A and busB) for the Reg/Dec stage
  - ALU for the Exec stage
  - Data Memory for the Mem stage
  - Register File's Write port (bus W) for the Wr stage

ECE437, Spring 2016

(20)

## Pipelined Datapath

- Pipeline datapath with registers

ECE437, Spring 2016

(21)

1st half write | 2nd half read

## Instruction Fetch

ECE437, Spring 2016

(22)

## Instruction Decode/Reg Read

ECE437, Spring 2016

(23)

## Execute (Address calculation)

ECE437, Spring 2016

(24)

## Memory Access



ECE437, Spring 2016

(25)

## Writeback



ECE437, Spring 2016

(26)

## Pipelined Datapath



- Pipeline datapath with registers

ECE437, Spring 2016

(27)

## Pipeline registers

- Length of Pipeline registers
- Book says (Fig 4.35)
  - IF/ID : IR (32), PC+4 (32) : 64 bits
  - ID/EX: IR (32), PC+4 (32) + RegA + RegB : 128 bits
  - EX/MEM: ALUout(32) + zero(1) + PC+4+SX(imm) (32) : 97
  - MEM/WB: ALUout (32) + MemData(32) : 64
- To be refined:
  - (see next slide)
  - Other control bits (IR not going through)

ECE437, Spring 2016

(28)

## Corrected Datapath



- Carry destination register through pipeline registers to WB stage

ECE437, Spring 2016

(29)

## The Four Stages of R-type



- Ifetch: Instruction Fetch
  - Fetch the instruction from the Instruction Memory
- Reg/Dec: Registers Fetch and Instruction Decode
- Exec:
  - ALU operates on the two register operands
  - Update PC
- Wr: Write the ALU output back to the register file

ECE437, Spring 2016

(30)

## Pipelining R-type and Loads



- We have pipeline conflict or structural hazard:
  - Two instructions try to write to the register file at the same time!
  - Only one write port

ECE437, Spring 2016 (31)

## Key observation

- Each functional unit can only be used once per instruction
  - Each functional unit must be used at the same stage for all instructions:
    - Load uses Register File's Write Port during its 5th stage
- | 1    | 2      | 3       | 4    | 5      |
|------|--------|---------|------|--------|
| Load | Ifetch | Reg/Dec | Exec | Mem Wr |
- R-type uses Register File's Write Port during its 4th stage
- | 1 | 2      | 3      | 4               |
|---|--------|--------|-----------------|
|   | R-type | Ifetch | Reg/Dec Exec Wr |
- 2 ways to solve this pipeline hazard.

ECE437, Spring 2016 (32)

## Soln.1: Insert "Bubble"



- Insert a "bubble" into the pipeline to prevent 2 writes at the same cycle
  - The control logic can be complex.
  - Lose instruction fetch and issue opportunity.
- No instruction is started in Cycle 6!

ECE437, Spring 2016 (33)

Values being clashed

## Soln.2: Delay R-type's Write

- Delay R-type's register write by one cycle:
  - Now R-type instructions also use Reg File's write port at Stage 5
  - Mem stage is a NOOP stage: nothing is being done.



ECE437, Spring 2016 (34)

## Does soln. 2 lose performance?

- Does it increase R-type's latency?
- Does it worsen R-type's CPI?
- What is the bottomline performance concern in pipelining?

ECE437, Spring 2018 (35)

bubbles hurt ☹ it kills a -fetch.  
2 seems worse but actually  
doesn't affect ~~operations~~ too much.  
throughput

## Four Stages of Store



- Ifetch: Instruction Fetch**
  - Fetch the instruction from the Instruction Memory
- Reg/Dec: Registers Fetch and Instruction Decode**
- Exec: Calculate the memory address**
- Mem: Write the data into the Data Memory**

ECE437, Spring 2016

(36)

## Exec Stage of Store



- Reg B and SX(Imm) are both needed

ECE437, Spring 2016

(37)

## Mem stage of Store



ECE437, Spring 2016

(38)

## Three\*\* Stages of Beq



- Ifetch: Instruction Fetch**
  - Fetch the instruction from the Instruction Memory
- Reg/Dec:**
  - Registers Fetch and Instruction Decode
- Exec:**
  - compares the two register operand,
  - select correct branch target address
  - latch into PC
- Four stages as in book**
  - Assume one delay slot (shadow instruction)
  - One "predict not taken" instruction

ECE437, Spring 2016

(39)

## Recap: Datapath



ECE437, Spring 2016

(40)

## Outline

- Pipeline Datapath**
- Control**
  - Relatively simple for pipelined processors
  - Similar to single cycle control

ECE437, Spring 2016

(41)

## Pipelined Datapath with Controls



## Pipeline Control vs. Single cycle control

- Similar (generation)
- What about
  - PCWrite? IR-write?
  - Write enable for the pipeline registers?
- Pipelined processor implementation
  - Common Stages
    - Instruction fetch stage (IF)
    - Decode and Register read (ID)
  - Instruction specific stages
    - Execute (Exec)
    - Memory access (Mem)
    - Write back (WB)

ECE437, Spring 2018 (43)

## Generating controls

- Simplify the problem
  - Reduced pipeline control to single-cycle control (Almost)
  - Generate controls once
  - Consume (i.e., use and discard) signals as you proceed along the pipeline stages
- Identify Stage of consumption for all control signals

ECE437, Spring 2016 (44)

## Focus on Control



- The Main Control generates the control signals during Reg/Dec
  - Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
  - Control signals for Mem (MemWr Branch) are used 2 cycles later
  - Control signals for Wr (MemtoReg MemWr) are used 3 cycles later

ECE437, Spring 2016 (45)

## Meaning of controls

- RegWr : 1-> write, 0-> no write
- MemToReg : 1-> MDR, 0-> ALUOut
- RegDst : 1->rd, 0->rt
- ALUOp<1:0> : 00->Add, 01->Sub, 10->funct'
- ALUSrc : 0->RegB, 1->SX(Imm)
- Branch: 0->non-branch inst, 1-> branch
- MemRead: 1->memread, 0-> no memread
- MemWrite: 1->memwrite, 0->no memwrite
- ALU control abstracted away (as before)
  - Inputs: ALUop (2 bits), 6 "funct" bits from IR

ECE437, Spring 2018 (46)

## Extended Pipeline Registers



- Simple extensions to carry control state

ECE437, Spring 2018 (47)





ECE437, Spring 2016

(53)



ECE437, Spring 2016

(54)



## Implementing Pipeline control

- How do we design the control logic block?
  - Similar to single-cycle implementation
  - Derive logic expressions
    - E.g. MemtoReg = lw
    - ALUSrc = lw OR sw
    - RegWrite = R-type OR lw
  - Implement Combinational logic
    - PLA implementation
    - ROM implementation

ECE437, Spring 2016

(58)

## Pipeline registers

- Preliminary estimates
  - IF/ID : IR (32), PC+4 (32) : 64 bits
  - ID/EX: IR (32), PC+4 (32) + RegA + RegB : 128 bits
  - EX/MEM: ALUout(32) + zero(1) + PC+4+SX(imm) (32) : 97
  - MEM/WB: ALUout (32) + MemData(32) : 64
- Corrections:
  - ALUout and MemData
  - Destination register (5 bits)
  - Other control bits (IR not going through)

ECE437, Spring 2016

(59)

Do not have control out  
in each state.

## Implementing Pipeline Control

- Exercise
  - Compute required bit-width of pipeline registers
  - Before the next lecture

ECE437, Spring 2016

(60)

## Summary

- Only slightly more complicated than single cycle
  - not really, only because we:
    - ignored complications of forwarding, branch prediction, pipeline bubbles, squashing mispredicted instructions (after branches)
  - Need deeper understanding of hazards
    - need to modify datapath as well

ECE437, Spring 2016

(61)

## Today

- Pipeline datapath and control assuming independent instructions (no hazards)
- Data hazards
  - Types
  - Detecting RAW hazards
  - Handling RAW hazards (Partial)
    - Datapath
    - Control behavior

ECE437, Spring 2016

(62)

## Any complications

- Definitely:
  - Need to maintain "illusion" of sequential execution
  - Execution is actually overlapped.
- Pipeline Hazards
  - structural hazards: attempt to use the same resource two different ways at the same time
    - E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
  - data hazards: attempt to use item before it is ready
    - E.g., one sock of pair in dryer and one in washer; can't fold until get sock from washer through dryer
    - instruction depends on result of prior instruction still in the pipeline
  - control hazards: attempt to make a decision before condition is evaluated
    - E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in
    - branch instructions

ECE437, Spring 2016

(63)



## Hazards

- Structural hazards

- Two instructions need the same hardware  
- should not overlap.

- Data Hazards

- Data not ready

- Control Hazards

- Which instruction to fetch? Not known.  
eg. BRA

ECE437, Spring 2016

(64)

## Hazards

- Can always resolve hazards by waiting

- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards

- Delays

- Pipeline stalls/bubbles
- Reduce speedup

ECE437, Spring 2016

(65)

### Single Memory: Structural Hazard



ECE437, Spring 2016

(66)

## Data Hazards

add r1, r2, r3

sub r4, r1, r3

and r6, r1, r7

or r8, r1, r9

xor r10, r1, r11

ECE437, Spring 2016

(68)

## Structural Hazards

- Single memory (suppose)

If 1.3 memory accesses per instruction

- How?
- 1 per instruction for instruction fetch
- Fraction for data load/store
  - Depends on instruction mix
  - 20% load + 10% store
  - 15% load + 15% store

• CPI is at least 1.3 (otherwise memory is used more than 100%)

int.  
1 to get 0.3  
(30%)  
for 1 and store

ECE437, Spring 2016

(67)

## Hazards on r1

- Dependencies backwards in time



ECE437, Spring 2016

(69)

If we read the egg & write -ve, we can access one block early

2

where data created at first.

### Data Hazard Solution



### Forwarding (a.k.a. bypassing)



- Can't solve with forwarding:
  - Must delay/stall instruction dependent on loads

### Control Hazard: Solutions

- Stall: wait until decision is clear
  - It's possible to move up decision to 2nd stage by adding hardware to check registers as being read



- Impact: 2 clock cycles per branch instruction  
=> slow

### Control Hazard: Solutions

- Predict: guess one direction then back up if wrong
  - Predict not taken



- Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right 50% of time say)
- More dynamic scheme: history of 1 branch (90%)

### Control Hazard: Solutions

- Redefine branch behavior (takes place after next instruction) "delayed branch"



- Impact: 0 clock cycles per branch instruction if can find instruction to put in "slot" (50% of time)
- As launch more instruction per clock cycle, less useful

### Summary: Hazards

- Structural hazards
  - Two instructions need the same hardware
  - Delay (pipeline bubble)
- Data Hazards
  - Data not ready
  - Forward/bypass (not for loads)
- Control Hazards
  - Which instruction to fetch? Not known.
  - Delayed branch, Predict not taken

ECE437, Spring 2016

(75)

delayed branch (shadow intr)  
→ let compiler deal with empty delay.

## Control Hazard: Solutions

- Redefine branch behavior (takes place after next instruction) "delayed branch"



- Impact: 0 clock cycles per branch instruction if can find instruction to put in "slot" (50% of time)
- As launch more instruction per clock cycle, less useful

ECE437, Spring 2016

(74)

## Summary: Hazards

### Structural hazards

- Two instructions need the same hardware
- Delay (pipeline bubble)

### Data Hazards

- Data not ready
- Forward/bypass (not for loads)

### Control Hazards

- Which instruction to fetch? Not known.
- Delayed branch, Predict not taken

ECE437, Spring 2016

(75)

## Data Hazards

- Challenge: maintain illusion of sequential execution
- Types of data hazards
  - RAW, WAR, WAW



ECE437, Spring 2016

(76)

## Data Hazards



- Avoid some "by design"
  - eliminate WAR by always fetching operands early (DCD) in pipe
  - eliminate WAW by doing all WBs in order (last stage, static)
- Detect and resolve remaining ones
  - stall or forward (if possible)

ECE437, Spring 2016

(77)

## Hazards on r1

### Dependencies backwards in time



ECE437, Spring 2016

(78)

## Data Hazard Solution



ECE437, Spring 2016

(79)

## Handling RAW Hazards

- Pre-requisite for handling RAW hazard
  - Detection!
  - Need to know:
    - Pending writes
      - available results that haven't been written back to registers
    - Operand Reads
      - Later instructions that potentially use these values
  - Instructions may not write to register file (store, branch)

ECE437, Spring 2016

(80)

## Recap : Pipeline Register Widths



ECE437, Spring 2016

(81)

## Logic equations for Hazard Detection

- Restatement of equations
- Text book version
  - WB stage is not really a hazard
    - Data is written in first half of cycle, read in 2nd half
  - EX/MEM.RegisterRd = ID/EX.RegisterRs
  - EX/MEM.RegisterRd = ID/EX.RegisterRt
  - MEM/WB.RegisterRd = ID/EX.RegisterRs
  - MEM/WB.RegisterRd = ID/EX.RegisterRt

ECE437, Spring 2016

(82)

## Lookahead: Forwarding datapath

- We know how to detect RAW hazards
- Now,
  - Modify Datapath to enable forwarding
  - Desired control behavior

ECE437, Spring 2016

(83)

## Base Pipelined Datapath

- Simplified representation of pipelined datapath
  - To avoid clutter



a. No forwarding

ECE437, Spring 2016

(84)

## Datapath w/Forwarding Unit



b. With forwarding

- ForwardA/ForwardB: 01→Mem, 10→EX

ECE437, Spring 2016

(85)

## Data Hazards and Forwarding: Walkthrough

### Code snippet

- identify hazards
- identify forwarding paths

```
sub $2, $1, $3
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
```

ECE437, Spring 2016

(86)



### True dependence : Forward in time



# Walkthrough



- Skip the boring stuff, jump to cycle 3



## • Forward ALUOut to Operand 1



- Forward ALUout to Op1, Mem to Op2



- Two candidates match, forward the latest

## Final Datapath



- "Imm" can be 2<sup>nd</sup> operand (Fig 4.57)

ECE437, Spring 2016

(93)

## Forwarding Control Behavior

- MEM hazard

If (MEM/WB.RegWrite AND  
MEM/WB.RegisterRd != 0 AND  
MEM/WB.RegisterRd = ID/EX.RegisterRs)  
ForwardA = 01

If (MEM/WB.RegWrite AND  
MEM/WB.RegisterRd != 0 AND  
MEM/WB.RegisterRd = ID/EX.RegisterRt)  
ForwardB = 01

- Does this fully meet our requirements ?

ECE437, Spring 2016

(95)

## Forwarding Control Behavior

- EX hazard

If (EX/MEM.RegWrite AND // not store or branch  
EX/MEM.RegisterRd != 0 AND // Result is used  
EX/MEM.RegisterRd = ID/EX.RegisterRs)  
ForwardA = 10

If (EX/MEM.RegWrite AND  
EX/MEM.RegisterRd != 0 AND  
EX/MEM.RegisterRd = ID/EX.RegisterRt)  
ForwardB = 10

ECE437, Spring 2016

(94)

## Lookahead: RAW hazard with load inst



- Forwarding as solution to RAW hazard
  - possible if no (true) dependence going backwards in time
  - True for R-type instructions
    - Data available after EX stage (i.e., at ALUOut)
  - Not true for load instruction

ECE437, Spring 2016

(96)

## Load instruction



- Replaced "sub" with "lw" in previous code-example

ECE437, Spring 2016

(96)

## Lookahead: RAW hazard with load inst



- Forwarding as solution to RAW hazard
  - possible if no (true) dependence going backwards in time
  - True for R-type instructions
    - Data available after EX stage (i.e., at ALUOut)
  - Not true for load instruction

ECE437, Spring 2016

(97)

## Load instruction



ECE437, Spring 2016

(98)

## Solution

- Catch-all solution for hazards
  - Stall
    - always works, but hurts performance
    - Use as last resort
- Challenge:
  - Modify pipeline implementation to support stalls when hazards are detected

ECE437, Spring 2016

(99)

## Load instruction



ECE437, Spring 2016

(100)

## Hazards with load instruction

- True dependencies: backward in time
- Stall the pipeline
- Minor change in terminology
  - If forwarding can solve it, it is not a hazard!
  - "Hazard" refers only to true backward dependencies in time.

ECE437, Spring 2016

(101)

## Handling the hazard

- As before
  - Detection
    - Logic equations to detect hazard
  - Actual stalling
    - Datapath/control modifications to achieve stalling

ECE437, Spring 2016

(102)

## Detection

- Conditions
  - Preceding instruction must read memory
    - MemRead must be asserted
  - Destination of preceding instruction ( $rt$ ) must be one of operands of current instruction
- Logic equations- restate above conditions formally
  - If( ID/EX.MemRead AND ((ID/EX.RegRt = IF/ID.RegRs) OR (ID/EX.RegRt = IF/ID.RegRt)) )  
STALL

lw \$2, 20(\$1)  
and \$4, \$2, \$5  
or \$4, \$4, \$2  
add \$9, \$4, \$2

## Stalling the pipeline

- Instruction cannot proceed
  - Following instruction must be stalled too.
  - Otherwise state in pipeline registers is overwritten
- Preceding instructions may proceed as usual
- Solution
  - inject NOP into EX/Mem pipeline
  - Prevent writes to PC and IF/ID register

# Datapath



## Walk-through (2 of 6)

or \$4, \$4, \$2



## Walk-through (3 of 6)



## Walk-through (4 of 6)



- Load value forwarded from MEM/WB register

## Walk-through (5 of 6)

after<1>



- \$4 value forwarded from EX/MEM register

## RAW Hazard with Loads: Summary

- True backward dependencies in time
  - Need to stall
- Stall achieved by
  - Detecting hazard (remember logic equation)
  - Inserting NOP (all EX/MEM/WB controls set to 0)
  - Preventing IF/ID register and PC from being overwritten
- Next Branch/Control Hazards

ECE437, Spring 2016

(112)

## When are conditional branches resolved?



ECE437, Spring 2016

(113)

## Branch Hazards



- Branch resolved in the MEM stage
- If taken,
  - $PC \leftarrow PC + 4 + SX(Imm * 4)$
  - $40 + 4 + 7 * 4 = 72$

ECE437, Spring 2016

(114)

## Control/Branch Hazards

- Branch resolved in the MEM stage
  - But next instruction has to be fetched in the next cycle
  - Reduce the penalty by moving decision earlier in pipeline
    - Need additional comparator ( $r1=r2?$ ) and adder ( $PC+4+SX(Imm*4)$ )
    - Value needed in earlier stage
      - what if  $r1/r2$  write is pending?
      - Forwarding and/or stalling
  - Reduced penalty from 3 cycles to 1 cycle

ECE437, Spring 2016

(115)

## Datapath for branch hazards:



ECE437, Spring 2016

(116)

## Can we do anything about the 1cycle stall?

### Two solutions

- Predict branch is always not taken
  - More sophisticated prediction schemes
- Delay slots
  - Compiler's problem
- Walkthrough example for solution #1
  - Predict not taken

ECE437, Spring 2016

(117)

# Walkthrough (1 of 2)



## Walkthrough (2 of 2)



# SEE FULL SIZE

## Walkthrough (1 of 2)



## Walkthrough (2 of 2)



## Dynamic Branch Prediction

- Better than static prediction
  - Branches are predictable
  - ~90% of program execution time is spent in ~10% of code (inner loops)
- Think of a program loop of N iterations
  - Taken N-1 times
  - Not taken last time

ECE437, Spring 2016

(120)

## Dynamic Branch Prediction



## Branch Prediction



- Store each branch's history \*\*\*
  - Not really
- Keep a small table indexed by program counter
- PC is large (32 bit number)
- Mapping to number of table entries
  - E.g. 16-entry branch prediction table
  - Mapping: use last 4 bits of PC
- Problem: Multiple branches may map to same entry in table -- Aliasing

ECE437, Spring 2016

(122)

## Branch Prediction: Branch Target Buffer



- Unlike static prediction of always not-taken, dynamic prediction will predict both taken and not-taken
- For taken prediction, we have a problem (what?)
- There is something about branch targets that helps us solve the problem (what?)
- Keep a small table of br targets indexed by PC (BTB)
  - WHILE fetching (potentially) a branch, use branch PC to look up prediction AND target of branch so NEXT cycle you can fetch taken instruction if predict taken else fetch not-taken instruction - back-to-back fetch for both taken or not-taken
- Many-to-few mapping from PC to BTB

ECE437, Spring 2016

(123)

# Dynamic Branch Prediction



1-bit branch predictor



2-bit branch predictor

- How does hardware "learn" branch behavior?
- Store each branch instruction's history \*\*\*
  - If a branch was taken "recently", predict taken
    - One bit saturating counter
    - Two bit counters

## Recap

- Branch instructions
  - Control flow hazard
  - Static branch prediction
    - Predict not taken
    - Squash instruction if prediction incorrect
- Dynamic Branch prediction
  - 1-bit and 2-bit state machines to track history of branches
  - Finite table
    - Potential for "aliasing"
    - Multiple branches map to the same predictor

ECE437, Spring 2016

(124)

## Delayed Branch

|     |      |               |     |      |               |
|-----|------|---------------|-----|------|---------------|
| 10  | lw   | r1, r2(35)    | 10  | lw   | r1, r2(35)    |
| 14  | addl | r2, r2, 3     | 14  | addl | r2, r2, 3     |
| 20  | sub  | r3, r4, r5    | 20  | sub  | r3, r4, r5    |
| 24  | orl  | r8, r9, 17    | 24  | beq  | r6, r7, 100   |
| 20  | beq  | r6, r7, 100   | 30  | orl  | r8, r9, 17    |
| 34  | add  | r10, r11, r12 | 34  | add  | r10, r11, r12 |
| 100 | and  | r13, r14, 15  | 100 | and  | r13, r14, 15  |

- Delayed branch: inst after branch always executed
  - Invisible to programmer
  - Compiler and/or assembler transforms code

ECE437, Spring 2016

(125)

## Easy way\*\*\* to hide branch hazard delay

- Delayed branch
  - Instruction after branch always executes
- **①** Find an independent instruction from before the branch
- **②** Find instructions from Taken (target) OR from Not Taken (fall-through) code section
- \*\*\* For Architects



ECE437, Spring 2016

(126)

## Ideal delay slot operation



ECE437, Spring 2016

(127)

## Target/Fall-through instructions



ECE437, Spring 2016

(128)

## What next?

- Exceptions
  - Multiple instructions in flight
  - PC has changed
- Advanced topics
  - Superscalar, dynamically scheduled processors, etc
- Real machines
  - Core-2-pipeline, Niagara Pipeline, Pentium 4

ECE437, Spring 2016

(129)

Exception  $\rightarrow$  program transfer (internal)

(Traps) e.g. On illegal op. code, divide 0

in VM.      |  
                → Precise :  
                |  
                → Imprecise :

Interrupts  $\rightarrow$  external event triggered.  
Asynchronous.

disk: milli sec.  
solid state: nano sec.

or illegal op code



### Interrupt, Exception, Trap?

- Interruptions**
    - caused by external events
    - asynchronous to program execution
    - may be handled between instructions
    - simply suspend and resume user program
  - Traps**
    - caused by internal events
      - exceptional conditions (overflow)
      - errors (parity)
      - faults (non-resident page)
    - synchronous to program execution
    - condition must be remedied by the handler
    - instruction may be retried or simulated and program continued or program may be aborted
  - MIPS convention:**
    - External: Interrupts
    - Internal: Exception
- examples

ECE437, Spring 2016

(132)

### Exception Semantics

- MIPS architecture defines the instruction as having no effect if the instruction causes an exception.
- When get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state.
- This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard
  - Precise interrupts vs Imprecise interrupts ] refer to exceptions.

ECE437, Spring 2016

(133)

### Exceptions

- Pipeline Semantics**
  - No instruction after the exception causing instruction may execute
  - Every instruction preceding the exception causing instruction must complete execution

### Virtual memory

- 2 programs run together. They have total addr space.
- Still do not coincide with each other.
- Virtual memory (Translation mechanism) gets different addr. in diff places.
- Is also manages local memory with disk.

overly simplified ex.  
Say I get [lw]  
instr. but the  
value to load  
is in disk & hence  
unavailable.

The exception is  
thrown, program  
will come back  
later. In this time  
exception handler  
gets ready for  
translation

### MIPS Exceptions

- All exceptions jump to same handler code
  - "Cause" register
- We consider
  - Illegal instructions
  - Arithmetic overflows
- Handler behavior
  - Save PC of offending instruction (How? PC+4 has already been written to PC)
  - Use special register EPC (why not use \$31 like jal?)
  - Set cause register appropriately (0=ILL; 1=OVF)
  - Jump to handler at fixed address

ECE437, Spring 2016

(135)

(can be  
overwritten  
(risk))

## Datapath modifications

- Pipeline complications
- What stage is exception detected?
  - Overflow?
    - In EX stage, Also squash (convert to nop) EX stage
  - Illegal Instruction?
    - In ID stage, squash (convert to nop) ID stage
    - Similar to RAW hazard
  - What about external interrupts?
- Overflow in instruction i, illegal instruction in instruction i+1
  - Simultaneous exceptions
  - Hardware sorting

ECE437, Spring 2016

(136)



## Walk-through: Code snippet

### Main Code

```
40 sub $11,$2,$4
44 and $12,$2,$5
48 or $13,$2,$1
→ 4C add $1,$2,$1
50 slt $15,$6,$7
```

### Exception Code

[EPC] sw \$25, 1000(\$0)  
 $\downarrow$   
 $0x40000040$   
 (Same PC  
for exception)

ECE437, Spring 2016

(137)

## Walkthrough (1 of 2)



ECE437, Spring 2016

(138)

- All three instructions converted to nop

Note: except PC  
 saves PC + 4  
 but we want PC,  
 so, must have  
 subtraction circuit.

## Walkthrough (2 of 2)



ECE437, Spring 2016

(139)

- Fetch next instruction from handler PC (MIPS)

## Pipelined Processor



• Voila!

ECE437, Spring 2016

(140)

## Understanding Performance

- Iron law: Insts/prog \* CPI \* cycletime
- With pipelining:
  - CPI ~ 1 (with ideal memory, good branch prediction and few data hazards)
  - Cycletime : determined by critical path of one stage

ECE437, Spring 2016

(141)

*defacto standard  
[even in phones]*

## Superscalar Processor

- What does it mean?
  - Scalar processors (operate on scalar quantities)
  - Vector (operate on vectors) [e.g. add 2 vectors of arrays]
- Superscalar: multiple scalar operations in one cycle
- More than one instruction per cycle

ECE437, Spring 2016

(142)



## Dynamic Scheduling

- No need to suffer hazards if other useful work can be achieved
- Load Hazard results in pipeline stall
  - But other instructions are ready
  - "Oh! But we cannot execute instructions out of order" - Not really

lw \$t0, 20(\$s2)  
addu \$t1, \$t0, \$t2  
sub \$s4, \$s4, \$t3  
slli \$t5, \$s4, \$t3

ECE437, Spring 2016

(144)

## Dynamic Scheduling

- Instructions can execute when operands are ready
- Instructions can "commit" when all preceding instructions have committed

ECE437, Spring 2016

(145)

*enter in order  
function ops when values are ready*

*exit in order.*

## Real machines

- Let's examine Core 2
  - Microarchitecture more or less stable
  - Technology has improved

ECE437, Spring 2016

(146)

## Core 2

- 45nm
- Multiple decodes
- 14 stage pipeline
  - (went as high as ~31 in Pentium 4 line)
- Many other considerations
  - Pipelining for yield
- Source: Wikipedia



ECE437, Spring 2016

(147)

kills many hazards by picking different threads.

Operating system selects running threads.

○ bubble pipeline

similar but 6 stages.



Came as 24 core server  
when intel was on 2 core

### Pipelining Performance

- Start with ideal assumption
- Gradually introduce realism
  - Delay through all stages not equal
  - Structural hazards
  - Data (RAW) hazards
  - Control Hazards
  - Speedup

ECE437, Spring 2016 (149)



### Review: Ideal speedup

- All instructions are executed in P pipeline stages in a multicycle path (i.e. CPI = P)
- Cycletime = t ns (say)
- Instr. Count = n
- Old time = P x t x n
- New time = n x t + (P-1) x t
- Speedup = P/(1 + (P-1)/n) ≤ P
- P is some constant, n is large => Speedup ≈ P

ECE437, Spring 2016 (151)



$$\text{SpeedUp} = \frac{N \times t \times (\text{CPI} \times P)}{(N-1) \times t \times \text{CPI} + 1 \times t \times (\text{CPI} \times P)} = \frac{PN}{P+N-1} \Big|_{N \rightarrow \infty} = P$$

### Non-uniform stages

Maximum Speedup  $\leq$  Number of stages  
 Speedup  $\leq \frac{\text{Time for unpipelined operation}}{\text{Time for longest stage}}$

ECE437, Spring 2016

(154)

### Recap exercise

- A single cycle processor implementation can be pipelined in two ways
- Pipeline A uses a 5-stage pipeline
  - the 5 stages account for 15%, 10%, 20%, 20%, 35% of the delays respectively
- Pipeline B uses a 3-stage pipeline
  - the stages are balanced  $\Rightarrow \text{speedup} = \frac{100}{33.3} = 3$
- If instructions are all independent, which pipeline implementation is the better option

$$\begin{aligned} \text{speedup} &= \frac{100}{33.3} \\ &= 3 \\ &= 3 \end{aligned}$$

ECE437, Spring 2016

(155)

### Third Dose of Reality

- Structural hazards:**
  - E.g. single memory
  - Say 30% in instructions are memory operations
  - 1.3 memory accesses/instruction
  - CPI is atleast 1.3 (otherwise memory is used more than 100%)
  - State of the art: Two memories (caches) to eliminate structural hazards

any inst. &  
data fetch  
only one at  
time.

ECE437, Spring 2016

(156)

### Fourth Dose of Reality

#### Data hazards

- We can handle R-type RAW hazards with zero penalty (forwarding)
- Loads require stalls
  - Instruction mix: 20% loads, 80% other
  - Hazards: 60% of load values are used by the immediate next instruction
  - $\Rightarrow \text{CPI} = 0.8*1 + 0.2 * (0.6*2 + 0.4*1) = 1.12$
- What about WAR and WAW hazards?

ECE437, Spring 2016

(157)

here we tie  
the NOP to  
the load  
instruction.  
 $= 1 + \text{stall-cycles}$

CPI is throughput  
dependent.

### Alternate view (Easier to compute)

- $\text{CPI} = \text{CPI}_{\text{ideal}} + \text{CPI}_{\text{stall}}$
- $\text{CPI}_{\text{ideal}} = 1$ 
  - Includes cost of executing all instructions
  - Reason about stalls separately
- $\text{CPI}_{\text{stall}} = 0.2 * 0.6 * 1 = 0.12$ 
  - Load fraction = 0.2
  - Fraction of loads that cause stall = 0.6
  - Cycles of stall each time = 1

ECE437, Spring 2016

(158)

### Fifth Dose of Reality

#### Branch Hazards

- Stall depends on where branch is resolved
- Assume ID stage (with extra hardware)
  - 1 cycle penalty
    - Can fill delay slot with useful instructions
    - Can predict branch outcome
- Branches constitute 20%, delay slot can be filled 90% of the time
  - $\text{CPI} = 0.8*1 + 0.2 * (0.9*1 + 0.1*2) = 1.02$
- Branches constitute 20%, prediction accuracy is 90%
  - $\text{CPI} = 0.8*1 + 0.2 * (0.9*1 + 0.1*2) = 1.02$

ECE437, Spring 2016

(159)

when  
prediction  
was wrong.

here we don't  
tie NOP with  
operation (1e).  
Simply look at  
how often we  
stall.

### Alternate view (Easier to compute)

- $CPI = CPI_{ideal} + CPI_{stall}$
- $CPI_{ideal} = 1$ 
  - Includes cost of executing all instructions
  - Reason about stalls separately
- $CPI_{stall} = 0.2 * 0.1 * 1 = 0.02$ 
  - Branch fraction = 0.2
  - Fraction of branches that cause stall = 0.1
  - Cycles of stall = 1

ECE437, Spring 2016

(160)

### Mix and match

- Detailed instruction mix
  - Load frequency and hazard frequency
  - Branch frequency and branch misprediction ratio
  - Deal with each term separately

ECE437, Spring 2016

(161)

### Develop ability to correlate concepts

- 5-stage pipeline with no other hazards
  - Inst mix: 20% branches, 80% other
- Branch prediction in ID stage
  - Scheme A: 70% accuracy
  - Scheme B: 90% accuracy, but 10% increase in cycle time
  - Which is better?

ECE437, Spring 2016

(162)

### Develop ability to correlate concepts

- Requires CPI computation, iron law

$$\begin{aligned} CPI(A) &= 0.8 * 1 + 0.2 * (2 * 0.3 + 1 * 0.7) = 1.06 \\ CPI(B) &= 0.8 * 1 + 0.2 * (2 * 0.1 + 1 * 0.9) = 1.02 \\ \text{Cycletime (A)} &= t \\ \text{Cycletime (B)} &= 1.1 * t \\ \text{Insts/prog is the same for both} \\ \text{Iron law:} \\ - CPI(A) * \text{cycletime}(A) &= 1.06 * t \\ - CPI(B) * \text{cycletime}(B) &= 1.02 * 1.1 * t = 1.122 * t \end{aligned}$$

ECE437, Spring 2016

(163)

### Alternate View

- Count CPI for instructions and stalls separately
- $CPI = CPI_{ideal} + CPI_{stall}$
- $CPI_{ideal}(A) = 1$
- $CPI_{stall}(A) = 0.2 * 0.3 * 1 = 0.06$ 
  - Branch fraction = 0.2
  - Fraction of branches that cause stall = 0.3
  - Cycles of stall = 1
- $CPI_{ideal}(B) = 1$
- $CPI_{stall}(B) = 0.2 * 0.1 * 1 = 0.02$ 
  - Branch fraction = 0.2
  - Fraction of branches that cause stall = 0.1
  - Cycles of stall = 1

Think:  
Why is this view  
equivalent to the view  
on the previous  
slide?

ECE437, Spring 2016

(164)

Finally multiply by  
1.1 for  
comparison.

### Summary

- Exceptions
  - Know how to handle the easy cases
    - What to squash, what not to
  - Know how complicated exceptions can be
- Read Chapter 4 NOW
  - Maximize impact
  - Study while lecture material is "warm".
    - 2-3 hours now vs. 6-8 hours later

ECE437, Spring 2016

(165)

## Outline



- Memory

- Technology, organization, motivation for hierarchical organization

ECE437, Spring 2016

(2)

## Memory

- Storage elements

- registers, latches.

Small

In processor

- Expensive to add (?)

→ SRAM (Caches)

Medium

Onchip or board, close to processor

- Costly

→ DRAM (Main memory)

Large, 50ns access time

Today 4c/MB (0.36c/MB 4GB/\$30)

Spring 2016 0.62c/MB, Spring 2008,

SDRAM 2006 0.1c/MB (0.75c/MB)

→ In 2006 \$0.05/MB

Disk/Tape etc.

Large, 50ms access time

Slow (~ms)

Today 4c/GB (2TB for \$80)

Spring 2016 (2.4c/GB 4TB/\$100)

Spring 2010, 7.5c/GB (1TB for \$75\*) Fall

2008, \$0.11/GB (1TB for \$108.99), Spring 2008, \$0.22/GB (200GB for \$112\*)

→ In 2006 \$0.35-0.40/GB (200GB for 70\$\*)

Processor Datapath

Memory subsystem

I/O subsystem

ECE437, Spring 2016

(3)

## Memory Hierarchy Technology

- Random Access: *can access all/any memory directly (with similar time)*
- "Random" is good: access time is the same for all locations
- DRAM: Dynamic Random Access Memory
  - High density, low power, cheap, slow
  - DYNAMIC: need to be "refreshed" regularly
- SRAM: Static Random Access Memory
  - Low density, high power, expensive, fast
  - Static: content will last "forever" (until lose power)
- "Not-so-random" Access Technology:
  - Access time varies from location to location and from time to time
  - Examples: Disk, CDROM
- Sequential Access Technology: access time linear in location (e.g., Tape)
- The Main Memory: DRAMs + Caches: SRAMs

ECE437, Spring 2016

(4)

## DRAM



- Dynamic RAM

- Dense, 1T/bit-cell
- Forgets after a while
- 16Mb: 4K x 4K cell-array
- 24 bit address
  - 12 bit for row, 12 for column—reflected in the interface

- Implementation

- Word/byte DRAM built as DIMM/SIMMs

ECE437, Spring 2016

(5)

## 1T DRAM cell

- Charge on capacitor
- Write:
  - 1. Drive bit line
  - 2. Select row
- Read:
  - 1. Precharge bit line to Vdd
  - 2.. Select row
  - 3. Cell and bit line share charges
    - Very small voltage changes on the bit line
    - Can detect changes of ~1 million electrons
  - 4. Sense sense amp *sense changes*
    - Can detect changes of ~1 million electrons
  - 5. Write: restore the value
- Refresh
  - 1. Just do a dummy read to every cell.



ECE437, Spring 2016

(6)

write op:

set row high.  
set bit low or high.

The value is written.

→ If we want to change/write sub-row, must read whole row first.

bit line is plain metal!  
So on Hi-Z,  
it acts as capacitor  
& saves old value

reading destroys  
value & so  
must read  
again.

## Classical DRAM Organization



ECE437, Spring 2016

(7)

## DRAM Optimizations

- Fast Page Mode:
  - Row once, vary column address
  - Now universal
- EDO DRAM: Extended data out
  - Fast page mode
  - FPM plus pipelining → many reads on same row, does all together
- Synchronous DRAM
  - Tied to system clock, increasing bus-speed
  - SDRAM-DDR, DDR-2?
- RDRAM (Rambus)
  - transfer data on bus changes.
  - Analog tech still same.

ECE437, Spring 2016

(10)

*was a  
middleman  
to so get  
cut off by  
DDR.*

*company that  
made design  
& making money  
out of it.*

## DRAM organizations



- DRAM core unchanged
- Organization/data transfer optimizations
- Compare SDRAM vs. DDR vs DDR2

Picture source:  
<http://www.lostcircuits.com/> via xbitlabs.com

## DDR 3

- Same trajectory
  - Improved transfer rates (Digital part)
    - 8 transfers vs. 4 transfers
- Other changes
  - Voltage reduction (low power)
  - Technology change

ECE437, Spring 2016

(11)

## SRAM

- Data is static (as long as power is applied) (static != non-volatile)
- Logically, two cross-connected inverters with switches
  - CMOS inverter, MOS switch
  - 6-transistor implementation

ECE437, Spring 2016

(12)



ECE437, Spring 2016

(12)

*Read is  
not destructive.  
→ no need to  
write after read.*

*differential sense amp  
since one of bit & bit  
will change more than  
the other.*

*Also, NO  
refresh*

## SRAM Organization (16x4)



ECE437, Spring 2016

(13)

## SRAM Organization



- Internal arrays may be different
  - 32Kx8 array realized with 8 512x64 arrays

ECE437, Spring 2016

(14)

## Memory Technology Summary

- Performance of Main Memory:**
  - Latency:** Cache Miss Penalty
    - Access Time: time between request and word arrives
    - Cycle Time: time between requests
    - Bandwidth: I/O & Large Block Miss Penalty (L2)
- Main Memory is DRAM: Dynamic Random Access Memory**
  - Dynamic since needs to be refreshed periodically (8 ms)
  - Addresses divided into 2 halves (Memory as a 2D matrix)
    - RAS or Row Access Strobe
    - CAS or Column Access Strobe
- Cache uses SRAM: Static Random Access Memory**
  - No refresh (6 transistors/bit vs. 1 transistor /bit)
  - Address not divided
- Size:** DRAM/SRAM 256-512, Cost/Cycle time: SRAM/DRAM 32-64

ECE437, Spring 2016

(15)

## Technology Trends

| DRAM |        |            |
|------|--------|------------|
| Year | Size   | Cycle Time |
| 1983 | 256 Kb | 220 ns     |
| 1986 | 1 Mb   | 190 ns     |
| 1989 | 4 Mb   | 165 ns     |
| 1992 | 16 Mb  | 145 ns     |
| 1995 | 64 Mb  | 120 ns     |
| 1998 | 256 Mb |            |
| 2001 | 1 Gb   | 60ns       |
| 2004 | 4 Gb   | 50ns**     |
| 2008 | 16 Gb  | 45ns       |

Capacity                      Speed (latency)

Logic: 2x in 3 years    2x in 3 years - Nope!  
 DRAM: 4x in 3 years    2x in 10 years  
 Disk: 4x in 3 years    2x in 10 years

ECE437, Spring 2016

(16)

3D Ram is new & coming out.

## Problem

- Why do we care about the memory system?
  - CPU only as fast as mem-system can supply
- Understand SRAM/DRAM technology
- Exploit program characteristics to overcome processor-memory gap

ECE437, Spring 2016

(17)

## Challenge: Proc-Mem Gap

- Fact: Large memories are slow (and cheap), fast memories are small (and expensive)
- How do we create a memory that is large, cheap and fast (most of the time)?
  - Hierarchy
  - Parallelism

ECE437, Spring 2016

(18)

## The Memory Hierarchy

- By taking advantage of the principle of locality:
  - Present the user with as much memory as is available in the cheapest technology.
  - Provide access at the speed offered by the fastest technology.
- Flip-flops/latches->SRAM->DRAM->magnetic disks



ECE437, Spring 2016

(19)

'S' is not second.  
 it is like one-s of ns...  
 hundreds of bytes...

# Locality



- The Principle of Locality:**
  - Program access a relatively small portion of the address space at any instant of time.
- A library**
  - Finding the few books you want: Slow
  - One you found the books
    - Reading various chapters: Fast
    - Switching between books: Fast
  - Library  $\rightarrow$  Memory: Larger the better
  - Books at table  $\rightarrow$  Cache: Size is limited but access is faster

ECE437, Spring 2016

(20)

# Two flavors of locality



Temporal locality – reuse of data  
Spatial locality – use of nearby data

- Temporal Locality (Locality in Time):**
  - $\Rightarrow$  Keep most recently accessed data items closer to the processor
  - $\Rightarrow$  Odds are you'll refer to books on your table more than once
- Spatial Locality (Locality in Space):**
  - $\Rightarrow$  Move blocks consists of contiguous words to the upper levels
  - $\Rightarrow$  Odds are you'll read contiguous pages/chapters

ECE437, Spring 2016

(21)

# Tradeoffs



ECE437, Spring 2016

(22)

# Illusion of Speed and Capacity



- Hit:** data appears in some block in the upper level (example: Block X)
  - Hit Rate: the fraction of memory access found in the upper level
  - Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
- Miss:** data needs to be retrieved from a block in the lower level (Block Y)
  - Miss Rate = 1 - (Hit Rate)
  - Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
- Hit Time << Miss Penalty**

ECE437, Spring 2016

(23)

# Why caches work

- Ahmdahl's law and caches**
  - Big win if: cache-hits are common case
  - Property of programs: "Locality"
- Average memory access time**
  - $F(\text{hit-time}, \text{miss-rate}, \text{miss-penalty})$
  - $5\text{ns} + 10\% * 50\text{ns} = 5 + 5 = 10\text{ns}$

# Lookahead: Caches

- Several issues:**
  - Determine hit/miss: How do we know if a data item is in the cache?
  - If it is, how do we find it?
  - If it isn't, where do I place it?
  - Replacement: What do we do with data that was present?
  - Who manages this? Compiler? Hardware? Software/OS/Programmer?

ECE437, Spring 2016

(25)

# Lookahead: Caches

- Several issues:
  - Determine hit/miss: How do we know if a data item is in the cache?
  - If it is, how do we find it?
  - If it isn't, where do I place it?
  - Replacement: What do we do with data that was present?
  - Who manages this? Compiler? Hardware? Software/OS/Programmer?

## Managing the memory hierarchy



- Whose responsibility is it?
  - Short answer: it depends on the level

# Register<->Main memory



- Managed explicitly by compiler/programmer
  - "Word" granularity
  - Load/store ties memory locations to registers (allocation)
  - Register temporaries ("spill" to memory when needed)
- Complexity!

*lw, sw, ll, sc  
given by code.*

*granularity : "Word"*

# Disk<->Memory



- Programmer: Explicit file read/write
- Disk-block/Page granularity
- OS: Automatic transparent to user
  - Virtual memory
  - Illusion of large memory, protection
  - More later

*make system call  
like fopen*

*brings pages into  
RAM*

*granularity : "Page"*

# Cache<->Memory



- **Hardware managed;** needs to be fast *for speed*
- Automatic: to avoid complexity of explicit management
- "Block" granularity to exploit spatial locality
- Retain recently accessed blocks to exploit temporal locality

ECE437, Spring 2016

(30)

granularity: "Block"

## Cache Operation

- Tag, data, valid
- Tag:
  - Mapping larger space (all addresses) to a smaller space (cache)
  - To identify which block (address) is resident
- Data:
  - Block: more than one word
- Valid:
  - Not everything in cache is meaningful
- Frame (block-frame/cache-frame)



# Cache Operation

- Hit/Miss detection
  - If (incoming tag == stored tag)
    - Hit // i.e. block is resident in cache
    - Return word to processor
  - Else
    - Miss
      - Make space : replace some other block
      - Get block from memory
      - Put block in "data" part, set tag using new address tag

```
if (tag match)
    return word
else
    make space
    get memory
    return word
```

to fill space

## Example of cache operation

- 16-frame cache
- 16 bit address-space
- Use\*\*\*:
  - Lower four bits of address as index of frame
  - All other bits of address as tag



# Cache Operation

- Cache operation for the following address-stream
- Assumes one word cacheblocks



# Cache Operation

- Tag comparators
- Hit detection
- How many frames?
- What is the cache size?
- What if block size is more than one word?



Fig 7.7

## Multi-word Cache Blocks



- Use block-offset lines to select word

ECE437, Spring 2016

(36)

## Big picture

Workload or Benchmark programs

Processor

reference stream  
 $\langle op,addr \rangle, \langle op,addr \rangle, \langle op,addr \rangle, \langle op,addr \rangle, \dots$   
op: i-fetch, read, write

Memory  
\$  
MEM

Optimize the memory system organization to minimize the average memory access time for typical workloads

avg. memory access time.

- Why do we care about AMAT? AMAT affects CPI
  - Remember there is 1.x memory ops per instruction

ECE437, Spring 2016

(37)

# Impact on Performance

- Suppose a processor executes at
  - Clock Rate = 2 GHz (0.5 ns per cycle)
  - CPI = 1.1
  - 50% arith/logic, 30% ld/st, 20% control
- Suppose that 5% of memory operations get 100 cycle (50ns) miss penalty
- CPI = ideal CPI + average stalls per instruction  
$$\begin{aligned} &= 1.1(\text{cyc}) + (0.30 \text{ (datamops/ins)} \\ &\quad \times 0.05 \text{ (miss/datamop)} \times 100 \text{ (cycle/miss)}) \\ &= 1.1 \text{ cycle} + 1.5 \text{ cycle} \\ &= 2.6 \end{aligned}$$

*95% memory access goes to cache.*

*datamemory operations*
- ~58 % of the time the processor is stalled waiting for memory!
- A 0.5% instruction miss rate would add an additional 0.5 cycles to the CPI

# Block Size Tradeoff

- In general, larger block size take advantage of spatial locality BUT:
  - Larger block size means larger miss penalty:
    - Takes longer time to fill up the block
  - If block size is too big relative to cache size, miss rate will go up
    - Too few cache blocks
- In general, Average Access Time:
  - = Hit Time + Miss Penalty x Miss Rate



- Measurements from real programs
- Bottomline: Block size chosen by experiment, typically 16-128 bytes

Fig 7.7

# Block Size Tradeoff

- In general, larger block size take advantage of spatial locality BUT:
  - Larger block size means larger miss penalty:
    - Takes longer time to fill up the block
  - If block size is too big relative to cache size, miss rate will go up
    - Too few cache blocks
- In general, Average Access Time:
  - = Hit Time + Miss Penalty x Miss Rate



## Block Size



- Measurements from real programs
- Bottomline: Block size chosen by experiment, typically 16-128 bytes

Fig 7.7

→ 4 words = 32 words

# Checkpoint

- Summary:
  - Cache management in hardware
  - Caches terminology and organization
    - Frames
    - Blocks
    - Tags
  - Example of Cache operation
- Next: 4 questions
  - Where is a block placed?
  - How is a block found?
  - Which block is replaced?
  - What happens on a write?

---

## 4 Questions for Memory Hierarchy

- Q1: Where can a block be placed in the upper level? (*Block placement*)
- Q2: How is a block found if it is in the upper level?  
(*Block identification*)
- Q3: Which block should be replaced on a miss?  
(*Block replacement*)
- Q4: What happens on a write?  
(*Write strategy*)

4.10 A & B, a scenario where B is better is if we access a no. that falls in same place, eg. Addr 12 & 4. In (A), there will always conflict & give 100% miss. In (B), both stored, so no miss after first read.



### Example: 1 KB Direct Mapped Cache with 32 B Blocks

- [32 - (N-way)]  
bits are tag bits
- For a  $2^{**} N$  byte cache:
    - The uppermost  $(32 - N)$  bits are always the Cache Tag
    - The lowest  $M$  bits are the Byte Select (Block Size =  $2^{**} M$ )



## Another Extreme Example: Fully Associative

- Fully Associative Cache
  - Forget about the Cache Index
  - Compare the Cache Tags of all cache entries in parallel
  - Example: Block Size = 32 B blocks, we need N 27-bit comparators
- By definition: Conflict Miss = 0 for a fully associative cache



ECE437, Spring 2016

(45)

## A Two-way Set Associative Cache

- N-way set associative: N entries for each Cache Index
  - N direct mapped caches operate in parallel
- Example: Two-way set associative cache
  - Cache Index selects a "set" from the cache
  - The two tags in the set are compared in parallel
  - Data is selected based on the tag result



ECE437, Spring 2016

(46)

# Disadvantage of Set Associative Cache

- N-way Set Associative Cache versus Direct Mapped Cache:
  - N comparators vs. 1
  - Extra MUX delay for the data
  - Data comes AFTER Hit/Miss decision and set selection
- In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
  - Possible to assume a hit and continue. Recover later if miss.



ECE437, Spring 2016

(47)

## 4-way set associative cache



ECE437, Spring 2016

(48)

# Associativity spectrum



- Conflict misses reduced with higher associativity
- Associative search is complex, suited for smaller caches
- Increasing associativity:
  - Increases tag bits, shrinks index bits
  - Increases comparator size (~ tag bits)

## Performance



- A little associativity goes a long way

# Organization Methodology



- How to determine:
  - Number of bits for
    - Index, tag and block offset
- Walkthrough example(s)
  - 32KB, 32B block, 2-way associative cache

ECE437, Spring 2016

(51)

## Cache Organization

- Cache size = 32 KB (CS)
- Block size = 32 B (BS)
  - Frames (F) = CS/BS = 1024 (= 1K)
- Associativity = 2-way (A)
  - Number of frames/way = F/A = 512

|        |       |       |
|--------|-------|-------|
| t = 18 | i = 9 | b = 5 |
|--------|-------|-------|

- Address-bits = 32 bits (Ad)
  - Block-offset bits (b) =  $\lg(\text{BS}) = \lg(32) = 5$
  - Index bits (i) =  $\lg(F_{\text{pW}}) = \lg(512) = 9$
  - Tag bits (t) = Ad - i - b = 32 - 9 - 5 = 18

ECE437, Spring 2016

(52)

# Survey Results



## Cache Block Diagrams

- 96KB, 3-way set associative, 64Byte block cache
- Direct-mapped, 16KB, 128 byte block cache

March 3<sup>rd</sup>

Can solve by prefetching values.  
Can also solve by increasing block size.

- ### Miss Classification
- Compulsory** (cold start or process migration, first reference): first access to a block
    - "Cold" fact of life: not a whole lot you can do about it
    - Note: If you are going to run "billions" of instruction, Compulsory Misses are insignificant
  - Conflict (collision):**  $\Rightarrow$  not enough set associativity
    - Multiple memory locations mapped to the same cache location
    - Solution 1: increase cache size
    - Solution 2: increase associativity
  - Capacity:**
    - Cache cannot contain all blocks accessed by the program
    - Solution: increase cache size
  - Invalidation:** other process (e.g., I/O) updates memory

ECE437, Spring 2016

(15)

eg. Cache is 8KB  
size but  
program runs  
16KB serial again  
& again. Here  
program has locality  
but cache not adequate

In fully associative cache,  
conflict miss = 0 (always).  
Hence, a miss must be  
of another kind.

### Source of Cache Misses Quiz

|                                   | Direct Mapped | N-way Set Associative | Fully Associative |
|-----------------------------------|---------------|-----------------------|-------------------|
| Cache Size:<br>Small, Medium, Big | big           | medium                | small             |
| Compulsory Miss:                  | same          | same                  | same              |
| Conflict Miss                     | high          | med                   | zero (0)          |
| Capacity Miss                     | low           | med                   | high              |
| Invalidation Miss                 | same          | same                  | same              |

Choices: Zero, Low, Medium, High, Same

ECE437, Spring 2016

(16)

cache size  
(if all has same  
Capacity, then  
these could also be  
Same)

### Q3. Block Replacement

- Q3: Which block should be replaced on a miss?

(Block replacement)

- Easy for Direct Mapped
- Set Associative or Fully Associative:
  - Random
  - FIFO (Not a good idea, in general)
  - LRU (Least Recently Used)
  - NRU (Not Recently Used)

OPT (Ideal)

Simple heuristics capture large fraction of opportunity.  
Sweat for the last 10%

ECE437, Spring 2016

(17)

look at freq of usage.  
least freq. one is replaced.

track most freq used. Replace any of the others.  
 $P(\text{replace 2nd most freq}) = \frac{1}{\# \text{ slots}}$   
still good.

v. imp.  
Solve once atleast.

### Exercise

- Give an example of an address stream where

- 2-way associative cache is better than direct-mapped cache
- Direct mapped cache is better than 2-way cache.

- Use 16 entry caches, assume LRU replacement

### Q4. Write strategy

- Q4: What happens on a write? (Write strategy)
- Write through** — The information is written to both the block in the cache and to the block in the lower-level memory.
- Write back** — The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
  - is block clean or dirty?
- Pros and Cons of each?
  - WT: read misses cannot result in writes
  - WB: no writes of repeated writes
- WT always combined with write buffers so that don't wait for lower level memory

ECE437, Spring 2016

(18)

ERU implem.  
each index has a rank from 0 to n-1.

When touching an address, that index's rank goes to 0 and all ranks before it increase by 1.

eg  
0 1 2  
1 0 0  
2 3 3  
3 4 4  
(4) 0 1  
5 5 5  
6 6 6  
7 7 7

### Write Buffer for Write Through



here data goes to WB & DRAM.  
The WB slowly writes into DRAM.

- A Write Buffer is needed between the Cache and Memory
  - Processor: writes data into the cache and the write buffer
  - Memory controller: writes contents of the buffer to memory
- Write buffer is just a FIFO:
  - Typical number of entries: 4-8
  - Works fine if: Store frequency (w.r.t. time)  $\ll$  1 / DRAM write cycle
  - Can tolerate bursty behavior
- Memory system designer's nightmare:
  - Store frequency (w.r.t. time)  $\rightarrow$  1 / DRAM write cycle
  - Write buffer saturation

if too many stores from processor against DRAM write, then will have saturation problem

ECE437, Spring 2016

(19)

keeps data in cache.  
It checks if addr is clean or dirty. If dirty, memory to be updated.

this again means no space for reads.

write  
allocate

If write miss, instead of getting block into cache & writing to it, can just write to memory. Mostly, it will be final write. If it is a hot block, v. likely that it will be read. The read miss will get the block in cache anyways.

## Write Buffer Saturation



- Store frequency (w.r.t. time)  $> 1 / \text{DRAM write cycle}$ 
  - If this condition exists for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):
    - Store buffer will overflow no matter how big you make it
    - CPU Cycle Time  $\leq \text{DRAM Write Cycle Time}$
- Solution for write buffer saturation:
  - Use a write back cache
  - Install a second level (L2) cache:



ECE437, Spring 2016

write through model

## Write-miss Policy: Write Allocate versus Not Allocate

- Assume: a 16-bit write to memory location 0x00 and causes a miss
  - Do we read in the block?

Yes: Write Allocate  
No: Write No Allocate



ECE437, Spring 2016

(67)

## AMAT Impact quiz

|                                    | Cache hit time | Miss penalty     | Missrate                   |
|------------------------------------|----------------|------------------|----------------------------|
| Cache Size:<br>Small, Medium, Big? | ↑              | —                | ↓                          |
| Associativity:<br>Low-to-high      | ↑              | —                | ↓                          |
| Block size:<br>Low to high         | small ↑        | ↑ more to bring. | ↓ compulsory miss reduces. |
| Replacement Policy:<br>FIFO to LRU | —              | —                | ↓ *                        |

Conflict miss  
reduces.  
Compulsory miss  
reduces.

- Average Memory Access time
- AMAT = hit time + miss-rate \* miss-penalty
- Choices: same, increasing or decreasing
- Ignore last row for now

ECE437, Spring 2016

(68)

## Improving Cache Performance

- Reduce Hit time
  - small and simple  $\rightarrow$  direct mapped
- Reduce miss rate
  - Large cache, large blocksize, associative,
- Reduce miss penalty
  - Reduce block-size
- Remember Amdahl's law
  - Common case: hit
  - Reduce miss-rate at the cost of hit time

ECE437, Spring 2016

(70)

## Cache design space

- Several interacting dimensions
  - cache size
  - block size
  - associativity
  - replacement policy
  - write-through vs write-back
  - write allocation
- The optimal choice is a compromise
  - depends on access characteristics
    - workload
    - use (I-cache, D-cache, TLB)
  - depends on technology / cost
- Simplicity often wins



ECE437, Spring 2016

(71)

## Practical design issues

- Multi-level Caches
- Split Cache vs. unified cache

ECE437, Spring 2016

(72)

## Practical design issues

- Multi-level Caches
- Split Cache vs. unified cache

ECE437, Spring 2016

(72)



ECE437, Spring 2016

(73)

## Multilevel Caches

- Cycle time = 0.5ns ( $\sim 2\text{GHz}$  clock)
- Main memory access = 50ns = 100 cycles
- L1 miss rate = 5%
- Without 2<sup>nd</sup> level cache
  - $\text{AMAT}_{L1} = 1 + 5\% * 100 = 6$  cycles
- With 2<sup>nd</sup> level cache
  - L2 miss-rate = 2% (local miss-rate)
  - L2 hit time = 10 cycles *this means it can be a big cache since 10 cycles used to get/set data.*
  - $\text{AMAT}_{L2} = 10 + 2\% * 100 = 12$  cycles
  - $\text{AMAT}_{L1} = 1 + 5\% * 12 = 1.6$

ECE437, Spring 2016

(74)

## AMAT Impact: Answers

|                                    | Cache hit time | Miss penalty | Missrate |
|------------------------------------|----------------|--------------|----------|
| Cache Size:<br>Small, Medium, Big? | inc            | same         | dec      |
| Associativity<br>:Low-to-high      | inc            | same         | de       |
| Block size:<br>Low to high         | same           | inc          | dec**    |
| Replacement Policy<br>FIFO to LRU  | same           | same         | dec*     |
| Multi-level Caches                 | same           | dec*         | same     |

ECE437, Spring 2016

(75)

*having L2 will help reduce miss penalty.  
look at eg. in prev slide.*

## Split caches

- One for instruction, one for data
- Split cache
  - Instructions account for 75% of mem accesses
  - I-missrate = 5%, D-missrate = 6%
  - $\text{AMAT} = (1 + 0.05 * 10) * 0.75 + (1 + 0.06 * 10) * 0.25$
  - = 1.525
- Unified Cache
  - Aggregate missrate = 4%
  - $\text{AMAT} = (1 + 0.04 * 10) = 1.4$ ???
  - For modern pipelined processor:
    - single-memory structural hazard

ECE437, Spring 2016

(76)

*This is worse, but allows avoiding of pipeline hazards.*

Today, L1 split to prevent pipeline hazards. L2 & beyond, we use unified for DYNAMIC MEMORY SHARING.

ECE437, Spring 2016

(77)



## Summary

- Memory technology (Capacity/cost/speed)
- Need for hierarchy
- Performance
  - AMAT, ideal vs. real CPI
- Cache management:
  - Associativity, indexing, write handling, multi-word blocks etc.
- Diagrams of arbitrary cache organizations
- Next:
  - Cache-friendly coding techniques
  - Virtual Memory

ECE437, Spring 2016

(78)

## Software interaction

- RAM model of computation
  - All memory accesses take the same amount of time
  - Theoretical Model - has nothing to do with DRAM
- Reality:
  - Caches introduce non-uniformity
  - Hits take less time than misses
- Quicksort
  - fastest comparison based sorting algorithm when all keys fit in memory:  $\Theta(n \lg(n))$
- Radixsort
  - also called "linear time" sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys:  $\Theta(n)$

for r.v. big  
r.v., cache  
don't  
really  
help.

ECE437, Spring 2016

(79)

## QS vs. RS : Instructions



ECE437, Spring 2016

(80)

## QS vs. RS : Time, Instructions



ECE437, Spring 2016

(81)

Quick (Instr/key)  
Radix (Instr/key)

as time progresses,  
radix goes flat and  
quick will cross it.  
But, if cache  
friendliness, quick is better.  
Radix is cache unfriendly  
since it must pass through  
all keys again &  
again for each digit.

## QS vs. RS : Cache misses



- RAM model results are still valid... but at much larger input sizes
- How does one create practical, fast algorithms?
- Cache-aware programming/compilation

ECE437, Spring 2016

(82)

## Exercise 1

- What is the miss-rate of the following code segment for
  - 16 frame, direct mapped cache with 4-word cache blocks?
  - 16-frame, 2-way set-associative cache with 4-word cache blocks?
- Assume cold start (i.e., no valid data in the cache to begin with)
- Assume the following start addresses (bytes) for the three arrays
  - A : 0xC000
  - B : 0xC100
  - C : 0xC200
- Ignore writes for now

```
for(i=0;i<64;i++){
    C[i] = min(A[i], B[i]);
}
for(i=0;i<64;i++){
    D[i] = max(A[i], B[i]);
}
```

What if start  
addresses are:  
A : C000  
B : D110  
C : E220  
Instead?

each array  
256 bytes

ECE437, Spring 2016

(83)



2

In  
Direct mapping  
we have B blocks  
out of k  
so on.  
Merging (Fusing)  
loops gives  
no adv.  
Original 2 loops  
also no adv.  
100% miss  
rate.

# Worksheet 1

ECE437, Spring 2018

## Cache-aware programming

- **Instruction Sequencing**
    - *Loop Interchange*: change nesting of loops to access data in order stored in memory
    - *Loop Fusion*: Combine 2 independent loops that have same looping and some variables overlap
    - *Blocking*: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down entire columns or rows
  - **Data Layout**
    - *Merging Arrays*: Improve spatial locality by single array of compound elements vs. 2 separate arrays
    - *Nonlinear Array Layout*: Mapping 2 dimensional arrays to the linear address space
    - *Pointer-based Data Structures*: node-allocation
  - Example walkthrough: Loop fusion, Blocking, Merging Arrays

22222  
22222

ECE437 Spring 2016

(B)

## Worksheet for example

|                       | Base | Loop Fusion | Array Merging | Tiling |
|-----------------------|------|-------------|---------------|--------|
| Direct Mapped         |      |             |               |        |
| 2-way Set associative |      |             |               |        |

- Count read misses

ECE437, Spring 2016

186

## Miss categorization

- $64+64 = 128$  words
  - Each miss brings in 4 words
    - Minimum of 32 ( $=128/4$ ) misses
    - Cannot do better
  - Exercise: Identify conflict and capacity misses

ECE437, Spring 2016

(B)

## Walk through Exercise 1

- Count read misses for the following code segment for
    - 16 entry, direct mapped cache with 4-word cache blocks?
    - 16-entry, 2-way set-associative cache with 4-word cache blocks?
  - Assume cold start (i.e. no valid data in the cache to begin with)
  - Assume the following start addresses for the three arrays
    - A : 0xC000
    - B : 0xC100
    - C : 0xC200
    - D : 0xC300
  - Ignore writes for now
  - USE WORKSHEET

ECE437, Spring 2016

1P

## #1 : Loop Fusion

- Coverts distant reuse to near reuse
  - Enhances temporal locality
  - Code Transformation

```

for(i=0;i<64;i++) {
    C[i] = min( A[i], B[i] );
}
for(i=0;i<64;i++) {
    D[i] = max( A[i], B[i] );
}

```

---

ECE437 Spring 2016

100

## #2: Array merging

- Eliminates conflicts
  - Array of compound structure vs.
  - multiple arrays of simple data
- Enhances spatial and temporal locality
- Data layout transformation

```
for(i=0;i<64;i++) {  
    C[i] = min( A[i] , B[i] );  
}  
for(i=0;i<64;i++) {  
    D[i] = max( A[i] , B[i] );  
}  
  
-----  
  
Struct merge {  
    int A;  
    int B;  
};  
Struct merge M[64];  
for(i=0;i<64;i++) {  
    C[i] = min( M[i].A , M[i].B );  
}  
for(i=0;i<64;i++) {  
    D[i] = max( M[i].A , M[i].B );  
}
```

ECE437, Spring 2016

(01)

## #3: Blocking (Tiling)

- Exploits re-use across loops
  - Divide into pieces that fit in the cache vs.
  - Marching through whole array
- Capacity misses
- Code Transformation

```
for(i=0;i<64;i++) {  
    C[i] = min( A[i] , B[i] );  
}  
for(i=0;i<64;i++) {  
    D[i] = max( A[i] , B[i] );  
}
```

```
for (j=0; j<2;j++)  
{  
    for(i=0;i<32;i++) {  
        C[32*j + i] = min( A[32*j + i] , B[32*j + i] );  
    }  
    for(i=0;i<32;i++) {  
        D[32*j + i] = max( A[32*j + i] , B[32*j + i] );  
    }  
}
```

ECE437, Spring 2016

(02)

## State of the practice

- Cache friendly programming challenges
  - No global view of application
  - Different cache sizes
- Analyze programs after they're written
  - Find bad access patterns
  - Fix them
  - Lather, Rinse and Repeat

ECE437, Spring 2016

(03)

# Worksheet for example

*Solve conflict  
misses*

solves capacity  
issues).

|                       | Base                                 | Loop Fusion                 | Array Merging                   | Tiling                                                   |
|-----------------------|--------------------------------------|-----------------------------|---------------------------------|----------------------------------------------------------|
| Direct Mapped         | miss A<br>64 + miss B<br>64          | miss A<br>64 + miss B<br>64 | miss A<br>32 + miss B<br>32 + 0 | miss A<br>64 + miss B<br>64                              |
| 2-way Set associative | miss A + miss B<br>16 + miss B<br>16 | miss A + miss B<br>16 + 16  | miss A + miss B<br>32 + 0       | miss A + miss B<br>$(8 + 8) \times 2$<br>for outer loops |

- Count read misses

best we  
can do  
 $\therefore$  128 words  
by undetectable.  
 $= 32 \cancel{out}$   
miss.

here we have a capacity miss.  
So, 2-ways do not help

# ECE437: Introduction to Digital Computer Design

Mithuna Thottethodi  
Chapter 7 (texbook + CD)

Spring 2016

## Outline

- Multiprocessors
  - Why?
  - Power changes everything
  - What about performance
    - Two processors of speed X instead of one processor of speed 2X
    - Challenges in harnessing parallelism
  - Is there a single vendor with uniprocessor products? (non-embedded)
- Key concepts
  - Programming models
  - Cache coherence
  - Synchronization
  - Consistency

ECE437, Spring 2016

(2)

## Various Programming Views



ECE437, Spring 2016

(3)

bits slower: switch ↑ critical path.

## Shared address space MPs



ECE437, Spring 2016

(4)

## Process vs. Thread

- "Heavyweight" processes
  - Different Thread of control
    - PC, registers, stack *It has its own values*
  - Different Address space (page table, heap)
- "Lightweight" processes, a.k.a. "threads"
  - Different PC, register values, and stack allocation
    - "Context"
  - Same address space thus same page table, same heap*
- Shared regions across heavyweight processes possible.

ECE437, Spring 2016

(5)

## AMD and Intel quad-core designs



Figure © Embedded.com

ECE437, Spring 2016

(6)

# How do threads communicate?

- Our focus:

- Shared memory, bus-based systems
  - Shared address space
  - Memory does not have to physically shared
- Communication via loads and stores

ECE437, Spring 2016

(7)

# Writing Parallel Programs

- Example: Monte Carlo simulation - Embarrassingly parallel
- Determine value of Pi ( $\pi$ )
- Area of square =  $1 \times 1 = 1$
- Area of  $\frac{1}{4}$  circle =  $(\pi r^2)/4 = \pi/4$
- Algorithm:
  - Pick a random point in the unit square
  - If inside quarter circle, Cin++
  - Repeat N times (large N)
- probability of randomly picked point being inside the circle
  - From first principles:  $(\pi/4)/1 = \pi/4$
  - From experiment = Cin/N
  - $\pi = 4 * Cin/N$



ECE437, Spring 2016

(8)

## Writing parallel programs

- Create "P" (P=4) threads
- Each thread does  $N/4$  experiments
- "Join" (Back to one thread)
- What happens to Cin?
- Can each thread update Cin?
- For now, assume they can
  - synchronization problem later
- Almost linear speedup
  - P processors  $\Rightarrow$  P speedup



ECE437, Spring 2016

(9)

## Merge-sort

- Basic Sequential (Recursive) algorithm
  - Recursively Mergesort top-half and bottom-half of data
    - Terminate at single element
    - Merge the two halves
- Parallel algorithm (Assume fixed two-way parallelism)
  - One thread does Sequential merge-sort of top-half
  - Another thread does the same with bottom half
  - Merge on one thread. (Must wait till both threads are done.)

ECE437, Spring 2016

(10)

## Parallel merge

- Previous version did merging on one thread
- Possible to split merge work across two threads.
  - The first thread merges from the top till it has  $N/2$  elements
  - The second thread merges from the bottom till it has  $N/2$  elements.
- Done.

ECE437, Spring 2016

(11)

## Common Parallelization Strategies

- Think of parallelism at the algorithm level
- Possibly different
  - Best parallel algorithm
  - Best parallelization of best sequential algorithm
- Think of partitioning
  - Tasks? (do webpage serving and data-base lookup for different requests in parallel.)
  - Data? (Do similar operations on different data in parallel.)

ECE437, Spring 2016

(12)

## Programming models

- Pthreads - Shared memory processors \*\*
- MPI - clusters \*\*
- OpenMP - Shared memory
  - Marking loops as parallel loops
- Higher level primitives
  - Streaming, MapReduce, Parallel Recursion (Cilk C)
  - Customized to architecture
    - CUDA for nVidia GPUs

ECE437, Spring 2016

117

## Several Interesting Problems

- Does everything work intuitively\*\*\*  
 - No.  
 - Reason about coherence/consistency
- Sequencing  
 - Synchronization
- After the break: Cache Coherence

ECE437, Spring 2016

118

## Example Cache Coherence Problem



- Is there a problem?  
 - Replication + writes = coherence problem  
 P<sub>2</sub> wants u=7 but Memory has u=5.  
 Also, P<sub>1</sub> has u=5, a stale value.

ECE437, Spring 2016

119

## Coherence

- Fuzzy idea
  - Necessary for correctness
- Precise definition
  - For any memory location X
    - Operations issued by any process occur in the order in which they issued
    - Value returned by read is the value written by last write
- Derivative properties
  - Write propagation: writes become visible to everyone (when?)
  - Write serialization: writes to a single location seen in same order by every processor

ECE437, Spring 2016

120

## How to enforce coherence

- Correct actions:
  - On write, invalidate other copies
  - On write, update other copies
- Snooping:
  - Every node can "see" the bus
  - If all transactions can be observed
    - Each can maintain coherence

ECE437, Spring 2016

121

## Snoopy coherence



- Controller updates state of blocks in response to processor and snoop events and generates bus actions
- Often have duplicate cache tags

ECE437, Spring 2016

122

3/23

ideally we want to get  
the most recent value.

after step (5),  
there is a  
stale value  
in P1's cache  
& memory.

Also, who  
gets to  
give (5)  
the value.

Latent in  
#3, even  
memory is stale.

### Example Cache Coherence Problem



- Is there a problem?
  - Replication + writes = coherence problem

ECE437, Spring 2016

(15)

### Coherence

- Fuzzy idea
  - Necessary for correctness
- Precise definition
  - For any memory location X
    - Operations issued by any process occur in the order in which they issued
    - Value returned by read is the value written by last write
- Derivative properties
  - Write propagation: writes become visible to everyone (when?)
  - Write serialization: writes to a single location seen in same order by every processor

ECE437, Spring 2016

(16)

### How to enforce coherence

- Correct actions:
  - On write, invalidate other copies ] can do either
  - On write, update other copies ] maybe unnecessary
- Snooping:
  - Every node can "see" the bus ] imp
  - If all transactions can be observed
    - Each can maintain coherence

ECE437, Spring 2016

(17)

Snoop tag: duplicate copy of all tags.

### Snoopy coherence



- Controller updates state of blocks in response to processor and snoop events and generates bus actions
- Often have duplicate cache tags

ECE437, Spring 2016

(18)

Snoop controller sees the bus action and if it sees action, it updates itself.

when you snoop a write, you lose your read/write permissions.

At any time we can have one writer and one reader (given two writers).

### Snoopy Design Choices

- Snoopy protocol
  - set of states
  - state-transition diagram
  - actions
- Basic Choices
  - write-through vs. write-back
  - invalidate vs. update

ECE437, Spring 2016

(19)

### A 3-State Write-Back Invalidiation Protocol

- 3-State Protocol (MSI)
  - Modified
    - one cache has valid/latest copy
    - memory is stale [RAM]
  - Shared
    - multiple caches have valid copy
  - Invalid
- Must invalidate all other copies before entering modified state
- Requires bus transaction (order and invalidate)

3 states for the blocks of data in cache.

ECE437, Spring 2016

(20)

All blocks initially in Invalid State.

M : have read & write permission.

S : have read permission.

I : no permissions.

### MSI Processor and Bus Actions

- Processor:
  - PrRd (load)
  - PrWr
  - Writeback on replacement of modified block
- Bus
  - Bus Read (BusRd) Read without intent to modify, data could come from memory or another cache
  - Bus Read-Exclusive (BusRdX) Read with intent to modify, must invalidate all other caches copies
  - Writeback (BusWB) cache controller puts contents on bus and memory is updated
  - Definition: cache-to-cache transfer occurs when another cache satisfies BusRd or BusRdX request
- Let's draw it!

ECE437, Spring 2016

(21)

on a tag level

### MSI State Diagram



ECE437, Spring 2016

(22)

Memory inhibit is given to cache to cache transfer of data is done

## Update vs Invalidate

1. P1 updates P2: 10 & they all use value.  
great to update.
2. P1 updates 1 value 100 times.  
better to invalidate.  
why? Bus to implement update takes Energy -

### MSI State Diagram



ECE437, Spring 2016

(22)

## An example

| Proc Action   | P1 State | P2 state | P3 state | Bus Act | Data from  |
|---------------|----------|----------|----------|---------|------------|
| 1. P1 read u  | S        | -        | -        | BusRd   | Memory     |
| 2. P3 read u  | S        | -        | S        | BusRd   | Memory     |
| 3. P3 write u | I        | -        | M        | BusRdX  | Memory (?) |
| 4. P1 read u  | S        | -        | S        | BusRd   | P3's cache |
| 5. P2 read u  | S        | S        | S        | BusRd   | Memory     |

- Single writer, multiple reader protocol
- Why Modified to Shared?
  - Why not to Invalid?
  - Give up as little as possible. Retain option to read.
  - Why throw away write permission?
  - Ownership
- What if not in any cache?
  - Read, Write produces 2 bus transactions!

ECE437, Spring 2016

(23)

## Invalidate vs. Update

better for update

Pattern 1:  
for i = 1 to k  
  P1(write, x); // one write before many reads  
  P2--Pn-1(read, x);  
end for i

better for invalidate.

Pattern 2:  
for i = 1 to k  
  for j = 1 to m  
    P1(write, x); // many writes before one read  
  end for j  
  P2(read, x);  
end for i

ECE437, Spring 2016

(24)

## Invalidate vs. Update, cont.

- Pattern 1 (one write before reads)

~~N=16~~ M=10, K=10

- Update

• Iteration 1: N regular cache misses (70 bytes)

• Remaining iterations: update per iteration (14 bytes)

~~center, 8 data~~

- Total Update Traffic =  $16 \times 70 + 9 \times 14 = 1246$  bytes

• book assumes 10 updates instead of 9 -

- Invalidate

• Iteration 1: N regular cache misses (70 bytes)

• Remaining: P1 generates upgrade (6), 15 others Read miss (70)

• Total Invalidate Traffic =  $16 \times 70 + 9 \times 6 + 15 \times 70 = 10,624$  bytes

~~bytes~~

- Pattern 2 (many writes before reads)

- Update = 1400 bytes

- Invalidate = 824 bytes

Only 1 with Read Snarfing

becomes 1 if snarfing

PL writes go to 2:N-1 in invalidate when, say PG reads value, the others SNARF & get the valid.

Otherwise, 2:N-1 must all get a read.

SNARFING makes sense only if there are more than 2 processors.

## Coherence: Performance Issues

- Access Pattern specific optimizations
- Update vs. Invalidate
- Cache Block size
  - Remember 3C classification
  - 4<sup>th</sup> C: "Coherence miss" You read, someone wrote to this location
  - True vs. False Sharing
- Read Snarfing
- Non-Atomic Bus

ECE437, Spring 2016

(25)

in reality this happens.  
we have super simplified it.

## Coherence vs. Consistency

- Intuition says loads should return latest value
  - what is latest?
- Coherence concerns only one memory location
- Consistency concerns apparent ordering for all locations
- A Memory System is Coherent if
  - can serialize all operations to that location such that,
  - operations performed by any processor appear in program order
    - program order = order defined by program text or assembly code
  - value returned a read is value written by last store to that location

ECE437, Spring 2016

(26)

says each read  
happens to pull  
latest written  
value.

## Why Coherence != Consistency

B3UE

```
/* initial A = B = flag = 0 */
P1           P2
A = 1;       while (flag == 0); /*spin*/
B = 1;       print A;
flag = 1;    print B;
```

Intuition says printed A = B = 1  
Coherence doesn't say anything, why?  
Consider coalescing write buffer

ECE437, Spring 2016

we expect  
11 to  
be printed.

But coherence  
does NOT guarantee  
that assign  
happens in  
order.

## The Memory Model



ECE437, Spring 2016

## Sequential Consistency

- Register Allocation can break SC

```
a = f(x)
while (a == 0);
      b = g(a);
      lw $arg1, x
      jal f
      sw a, $retval
loop: beqz $1, $0, loop
      lw $arg1, $1
      jal g
```

- Use "volatile" declaration to prevent register allocation

- volatile int a;

prevents 'a' in

a register.

We want to keep  
reading 'a' from  
memory, not register.

ECE437, Spring 2016

here, \$1 gets 'a'  
and \$1 never  
updates. So  
changing 'a'  
does not break  
loop.  
To fix

This error may  
work 10 times,  
fail once then work again.

## Sequential Consistency

- Programmers think in SC
  - i.e. preserve memory order
- Performance maximized when memory order is flexible
  - i.e. do not preserve memory order
- Architect's solution: Don't preserve order
  - Our problem is now the programmer's problem
  - Won't fly
- Can we get intuitive programming, correctness and fast performance?
- Several interesting approaches proposed
  - Parallel programming remains harder than sequential programming.
- Next: Say, we have SC
  - Can we co-ordinate activities of threads?

ECE437, Spring 2016

## Why Synchronize?

lw \$r1, ac-balance  
add \$r1, \$r1, 100  
sw \$r1, ac-balance

lw \$r1, ac-balance  
add \$r1, \$r1, -50  
sw \$r1, ac-balance

- Two concurrent operations on the same bank account
  - Deposit \$100 check
  - Withdraw \$50 cash
- Correct result
  - $ac\text{-balance}(new) = ac\text{-balance}(old) + 50$
- What can go wrong?

Interleaving of operations is bad.  
Must prevent this interleaving.

ECE437, Spring 2016

## Synchronization

- Solution: Locks/Mutex
  - "critical section"
  - Only one person can have a lock
- Other synchronization primitives
- Barriers
  - Wait for everyone before proceeding
  - Can we just use a counter?
- Event notification
  - Producer consumer
  - Just use flags?
  - Consistency issues

LOCK ac-lock  
lw \$r1, ac-balance  
add \$r1, \$r1, 100  
sw \$r1, ac-balance  
UNLOCK ac-lock

LOCK ac-lock  
lw \$r1, ac-balance  
add \$r1, \$r1, -50  
sw \$r1, ac-balance  
UNLOCK ac-lock

ECE437, Spring 2016

Mar 30.

Any interleaving  
will cause  
issues.

## Why Synchronize?

lw \$r1, ac-balance  
add \$r1, \$r1, 100  
sw \$r1, ac-balance

lw \$r1, ac-balance  
add \$r1, \$r1, -50  
sw \$r1, ac-balance

- Two concurrent operations on the same bank account
  - Deposit \$100 check
  - Withdraw \$50 cash
- Correct result
  - $ac\text{-balance}(new) = ac\text{-balance}(old) + 50$
- What can go wrong?

ECE437, Spring 2016

(32)

## Synchronization

- Solution: Locks/Mutex
  - "critical section"
  - Only one person can have a lock
- Other synchronization primitives
- Barriers
  - Wait for everyone before proceeding
  - Can we just use a counter?
- Event notification
  - Producer consumer
  - Just use flags? ↗
  - Consistency issues

LOCK ac-lock  
lw \$r1, ac-balance  
add \$r1, \$r1, 100  
sw \$r1, ac-balance  
UNLOCK ac-lock

LOCK ac-lock  
lw \$r1, ac-balance  
add \$r1, \$r1, -50  
sw \$r1, ac-balance  
UNLOCK ac-lock

ECE437, Spring 2016

(33)

## How Not To Implement Locks

- LOCK  
while(lock\_variable == 1);  
lock\_variable = 1;  
0: unlocked  
1: locked &
- UNLOCK  
lock\_variable = 0;
- Implementation requires Mutual Exclusion!
  - Can have two processes successfully acquire the lock say P1 & P2 loads lock-variable.  
It is 0. So both lock the value  
K process code when only one guy should process it.

ECE437, Spring 2016

(34)

## Performance of Test & Set

### LOCK

while (test&set(x) == 1);

### UNLOCK

x = 0;

- High contention (many processes want lock)
- Remember the CACHE!
- Each test&set is a read miss and a write miss
- Several Optimizations

ECE437, Spring 2016

(35)

BUT, test&set is not RISK type. The op. does 2 things together, un-RISK like.

## Atomic Read-Modify-Write Operations

- int i = 1;
- Test&Set(r,x)  
r = m[x]  
m[x] = 1
  - Swap(r,x)  
r = m[x], m[x] = r
  - Compare&Swap(r1,r2,x)  
if (r1 == m[x]) then  
r2 = m[x], m[x] = r2
  - Fetch&Op(r,x,op)  
r = m[x], m[x] = op(m[x])

• r is register  
• m[x] is memory location x

ECE437, Spring 2016

(36)

## Better Lock Implementations

- Two choices:
  - Don't execute test&set so much
  - Spin without generating bus traffic
- Test&Set with Backoff
  - Insert delay between test&set operations (not too long)
  - Exponential seems good ( $k \times c$ ) chance that for some time, no ops happen.
  - Not fair
- Test-and-Test&Set
  - Spin (test) on local cached copy until it gets invalidated, then issue test&set
  - Intuition: No point in trying to set the location until we know that it's not set, which we can detect when it gets invalidated...
  - Still contention after invalidate
  - Still not fair

ECE437, Spring 2016

(37)

load flag and check. If locked, you wait until UNLOCKED. At this point, your copy invalidates & so do a TEST&SET.

## LL-SC based locks

- Different approach
  - Do not offer atomic Read-modify-write primitive
  - Offer mechanisms to detect non-atomicity
  - Load linked sets flag + address (link register)
    - Flag cleared if any other bus write observed
  - Store-conditional

```

task:    ll $s1, X   X is lock addr. $s1 gets value that
         bne $s1, $0 Lock // held lock
         sc X,$s2           // $s2 contains 1
         bne $s2, $0 Lock // non atomic
         jr $s1
         sw $0, X
         jr $s1

```

LL puts X in register & sets a flag.  
SC checks flag & ensures addr is same.

## Synchronization: Performance Issues

- Granularity
  - Lock single element vs. Lock whole array
  - Locking overhead vs. loss of concurrency
- Contention
  - TAS good at low contention
- Fairness
  - TAS offers no guarantees of fairness

## Synchronization

- Deadlocks: Huge Correctness Issue
- Races
- May not be visible to the programmer
  - Hard to debug (Non-reproducible)
- "Locks don't compose with themselves"
  - Other synchronization mechanisms have other problems
- We live with this problem
  - Find a solution, become famous, rich.

P<sub>1</sub>

P<sub>2</sub>



① LL X

④ SC X (returns 0)



② LL X

③ SC X (returns 1)

Apr 1.

## Synchronization.

Say we lock an array/hash table.

Then the other thread cannot operate.

This kills parallelism.

Now say we have a lock on bucket.

Need more memory but gives great concurrency (parallel programming).

TAS (test & set) do not guarantee fairness in operation orders.

TAS has good contentions though.

↳ less requirement of locks  
for program operations.

Contention: # of  
PIDs fighting for  
resource.  
good cont.  $\Rightarrow$  less contention

### Deadlock.

Bank ex. Say there must be data transfer from A to B.

to send from A to B, need to lock both accounts.

Now, say P1 gets lock to A ( $A \rightarrow B$  transfer)

& P2 gets lock to B ( $B \rightarrow A$  transfer)

Now, they are stuck. DEADLOCK.

To fix, we can have processor back off after 'n' fails.

- this causes LIVE LOCK. (both back & come in),

To fix, we can have global lock. But this means single transaction at the ~~same~~ time. Too inefficient.

To fix, let ID of smaller then ID of larger is locked.

So, in  $A \rightarrow B$  &  $B \rightarrow A$ , both try lock A first, then B.

In this case no issue happens (fixed).

↳ debugging  
↳ reproducible  
↳ hard to  
fix

## Outline

- Basic arithmetic (Ch 3.1-3.5)

- Representing numbers
- 2's Complement, unsigned
- Addition and subtraction
- Add/Sub ALU
  - full adder, ripple carry, subtraction, together
- Logical operations
  - and, or, xor, nor, shifts + barrel shifter
- Carry lookahead, overflow
- Integer multiplication, division
- floating point representation/arithmetic

ECE437, Spring 2016

## Unsigned Integers

- Recall:

- $n$  bits give rise to  $2^n$  combinations
- let us call a string of 32 bits as " $b_{31} b_{30} \dots b_3 b_2 b_1 b_0$ "
- $f(b_{31} \dots b_0) = b_{31} \times 2^{31} + \dots + b_1 \times 2^1 + b_0 \times 2^0$
- Treat as normal binary number
- e.g.,  $0 \dots 011010101$   
 $= 1 \times 2^7 + 1 \times 2^6 + 0 \times 2^5 + 1 \times 2^4 + 0 \times 2^3 + 1 \times 2^2 + 0 \times 2^1 + 1 \times 2^0$   
 $= 128 + 64 + 16 + 4 + 1 = 213$
- max  $f(111 \dots 11) = 2^{32} - 1 = 4,294,967,295$
- min  $f(000 \dots 00) = 0$
- range  $[0, 2^{32}-1] \Rightarrow \# \text{ values } (2^{32}-1) = 0 + 1 = 2^{32}$

ECE437, Spring 2016

## Numbers

- Bits are just bits (no inherent meaning)
  - conventions define relationship between bits and numbers
- Binary numbers (base 2)  
 $0000\ 0001\ 0010\ 0011\ 0100\ 0101\ 0110\ 0111\ 1000\ 1001\dots$   
 decimal:  $0 \dots 2^{32}-1$
- Of course it gets more complicated:  
 numbers are finite (overflow)  
 fractions and real numbers \*  
 negative numbers  
 e.g., no MIPS subi instruction; addi can add a negative number
- How do we represent negative numbers?  
 i.e., which bit patterns will represent which numbers?

ECE437, Spring 2016

no. of  
numbers  
is 0.  
can't have  
2 types of  
zeroes.

## Number Representation

| Sign Magnitude: | One's Complement | Two's Complement |
|-----------------|------------------|------------------|
| 000 = +0        | 000 = +0         | 000 = +0         |
| 001 = +1        | 001 = +1         | 001 = +1         |
| 010 = +2        | 010 = +2         | 010 = +2         |
| 011 = +3        | 011 = +3         | 011 = +3         |
| 100 = -0        | 100 = -3         | 100 = -4         |
| 101 = -1        | 101 = -2         | 101 = -3         |
| 110 = -2        | 110 = -1         | 110 = -2         |
| 111 = -3        | 111 = -0         | 111 = -1         |

- Balance, number of zeros, ease of arithmetic

ECE437, Spring 2016

## Signed Integers

- 2's complement
- $f(b_{31} b_{30} \dots b_1 b_0) = b_{31} \times 2^{31} + \dots + b_1 \times 2^1 + b_0 \times 2^0$
- max  $f(0111 \dots 11) = 2^{31} - 1 = 2147483647$
- min  $f(100 \dots 00) = -2^{31} = -2147483648$  (asymmetric)
- range  $[-2^{31}, 2^{31}-1] \Rightarrow \# \text{ values } (2^{31}-1 - -2^{31} + 1) = 2^{32}$
- E.g., -6
- $000 \dots 0110 \rightarrow 111 \dots 1001 + 1 \rightarrow 111 \dots 1010$

Scanned by CamScanner

## Two's Complement Operations

- Negating a two's complement number: Invert all bits and add 1
  - remember: "negate" and "invert" are quite different!
- Converting n bit numbers into numbers with more than n bits:
  - MIPS 16 bit immediate gets converted to 32 bits for arithmetic
  - copy the most significant bit (the sign bit) into the other bits
 

|      |               |           |
|------|---------------|-----------|
| 0010 | $\rightarrow$ | 0000 0010 |
| 1010 | $\rightarrow$ | 1111 1010 |
  - "sign extension"
    - (remember ORI vs. LW, zero extend vs sign extend)

## Negation in 2's complement

- Negation: Invert all bits, add 1
  - Why?
- If the  $(k+1)$  bit 2's complement representation of a number  $N$  is  $\langle b_k b_{k-1} \dots b_1 b_0 \rangle$   

$$N_k = -b_k \times 2^k + b_{(k-1)} \times 2^{(k-1)} + \dots + 2^1 \times b_1 + 2^0 \times b_0$$
- Show that the negation procedure is sound.

ECE437, Spring 2016

## Negation in 2's complement

- Key trick:
  - Complement of bit  $b$  can be written as  $(1-b)$
- Inversion gives us  

$$-(1-b_k) \times 2^k + (1-b_{(k-1)}) \times 2^{(k-1)} + \dots + 2^1 \times (1-b_1) + 2^0 \times (1-b_0)$$
- Separating the red and blue terms and adding 1 from flipping over addition  

$$-2^k + 2^{(k-1)} + \dots + 2^1 + 2^0 + 1 - N_k$$
- Blue terms plus 1 goes to zero. Q.E.D.  

$$= 2^k$$

ECE437, Spring 2016

$$\text{hence } -2^k + 2^k - N_k = -N_k$$

Proof: flip bit & add 1 causes negation

## Sign extension

- Consider representation of -2:

| 3bit (decimal) | 2-bit (decimal) |
|----------------|-----------------|
| 011 (+3)       |                 |
| 010 (+2)       |                 |
| 001 (+1)       | 01 (+1)         |
| 000 (0)        | 00 (0)          |
| 111 (-1)       | 11 (-1)         |
| 110 (-2)       | 10 (-2)         |
| 101 (-3)       |                 |
| 100 (-4)       |                 |

ECE437, Spring 2016

## Mathematical basis

### Inductive proof

- if the  $(k+1)$ -bit 2's complement representation of a number  $N$  is  $\langle b_k b_{k-1} \dots b_1 b_0 \rangle$   

$$N_k = -b_k \times 2^k + b_{(k-1)} \times 2^{(k-1)} + \dots + 2^1 \times b_1 + 2^0 \times b_0$$
- Then the  $(k+2)$ -bit 2's complement representation  $\langle b_k b_{k-1} \dots b_1 b_0 \rangle$  also represents  $N$ 
  - $N = -b_k \times 2^{k+1} + b_{(k-1)} \times 2^{(k-1)} + \dots + 2^1 \times b_1 + 2^0 \times b_0$
  - $= (-2 \times b_k + b_0) \times 2^k + b_{(k-1)} \times 2^{(k-1)} + \dots + 2^1 \times b_1 + 2^0 \times b_0$
  - $= -b_k \times 2^k + b_{(k-1)} \times 2^{(k-1)} + \dots + 2^1 \times b_1 + 2^0 \times b_0$

$$\therefore N_{k+1} = N_k$$

ECE437, Spring 2016

## Addition and Subtraction

- Similar to decimal (carry/borrow twos instead of tens)
- Identical operation for signed and unsigned

- E.g. Unsigned vs signed

|      |       |    |
|------|-------|----|
| 0011 | 3     | 3  |
| 1010 | 10    | -6 |
|      | <hr/> |    |
| 1101 | 13    | -3 |

ECE437, Spring 2016

## Interesting cases

- Show computation in 4-bit 2's complement representation  
 $4+4$

$$(-4) + (-4)$$

- Overflow: later

ECE437, Spring 2016

## ALU bit-slice

- Bit-wise operation

- and, or, add

- Full adder?

- Sub?



ECE437, Spring 2016

## Full adder

- Three inputs and two outputs

$Cout, s = F(a, b, Cin)$

-  $Cout$ : only if at least two inputs are set

-  $s$ : only if exactly one input or all three inputs are set

- Logic?



ECE437, Spring 2016

odd no of 1's  
give output of 1.

## Subtract

$$A - B = A + (-B)$$

- form two complement by invert and add one



ECE437, Spring 2016

## Exercise

- How do I convert 237 to binary?

- What is the 2's complement representation (3-bit) of:

-3:

+5:

- If a number X is represented as  $b_3 b_2 b_1 b_0$  if 4 bit 2's complement arithmetic, what is the 8-bit 2's complement representation of X?

- Show the computation in 4-bit 2's complement arithmetic:

$$5 - (-3)$$

ECE437, Spring 2016

## Overflow

| Decimal | Binary | Decimal | 2's Complement |
|---------|--------|---------|----------------|
| 0       | 0000   | 0       | 0000           |
| 1       | 0001   | -1      | 1111           |
| 2       | 0010   | -2      | 1110           |
| 3       | 0011   | -3      | 1101           |
| 4       | 0100   | -4      | 1100           |
| 5       | 0101   | -5      | 1011           |
| 6       | 0110   | -6      | 1010           |
| 7       | 0111   | -7      | 1001           |
|         |        | -8      | 1000           |

- Examples:  $7 + 3 = 10$  but ...  
 $-4 - 5 = -9$  but ...



ECE437, Spring 2016

## Overflow

- Overflow: the result is too large (or too small) to represent properly

- Example:  $-8 \leftarrow 4\text{-bit binary number} \leftarrow 7$

- When adding operands with different signs, overflow cannot occur

- Overflow occurs when adding:

- 2 positive numbers and the sum is negative

- 2 negative numbers and the sum is positive

- On your own: Prove you can detect overflow by:

- Carry into MSB ⊕ Carry out of MSB



ECE437, Spring 2016

## Overflow detection

- Carry into MSB  $\oplus$  Carry out of MSB
  - For N-bit ALU: Overflow =  $\text{CarryIn}[N-1] \text{ XOR } \text{CarryOut}[N-1]$



for overflow.

## Negative, Zero

- Required for conditional branches
- Zero
  - How?
  - NOR all 32 bits
  - Avoid 33rd bit (carry out)
- Negative may be required on overflow
  - If  $(a-b)$  jump: jump taken if  $a-b$  is negative
- Tempting to consider MSB
  - E.g. if  $(-5 < 4)$  branch
  - Branch should be taken, but  $(-5-4)$  computation results in overflow... so MSB is 0
  - E.g. if  $(7 < -3)$  branch
  - Branch should not be taken but  $(7 - (-3))$  results in overflow... so MSB is 1.
  - Negative = ??

ECE437, Spring 2016

## Outline

- Advanced addition techniques
  - The problem with ripple carry - ~~linear delay~~
  - Blocked ripple carry - ~~ripple~~  
~~in chunks / blocks~~  
~~to avoid too much delay.~~
  - Carry Look-ahead adder
  - Carry select adder
  - Barrel shifter

ECE437, Spring 2016

## Ripple-carry adder



ECE437, Spring 2016

In 32 bit adder,  
there would be  
horrible  
gate delays.

## Problem : Slow

- Is a 32-bit ALU as fast as a 1-bit ALU?
  - Delay =  $32 \times \text{Critical-path(Fast adder)} + \text{XOR}$
- Is there more than one way to do addition?
  - Two extremes: ripple carry and sum-of-products
  - Flatten expressions to two levels

Can you see the ripple? How could you get rid of it?

$$\begin{aligned}
 c_1 &= b_0c_0 + a_0c_0 + a_0b_0 \\
 c_2 &= b_1c_1 + a_1c_1 + a_1b_1 \\
 c_3 &= b_2c_2 + a_2c_2 + a_2b_2 \\
 c_4 &= b_3c_3 + a_3c_3 + a_3b_3
 \end{aligned}
 \quad
 \begin{aligned}
 c_2 &= <7 \text{ min-terms}> \\
 c_3 &= <15 \text{ min-terms}> \\
 c_4 &= <31 \text{ min-terms}>
 \end{aligned}
 \quad
 \begin{aligned}
 &\text{When we} \\
 &\text{replace} \\
 &\text{C}_n \text{ with} \\
 &\text{string.}
 \end{aligned}$$

Not feasible! Why? Exponential fanin

Now, we use more  
gates but or gates

need more & more  
inputs.

ECE437, Spring 2016

## Blocked Ripple Carry

- Flatten Logic in blocks of "k" say 4
  - But not the naive version
  - Block of 4 requires 31-input OR gate
- Ripple carry from one block to the next
- Delay through N-bit addition
  - $\xrightarrow{(N/4) * 2 + 1 (\text{XOR})}$   $\leftarrow$  factor improvement
- Reduction by a constant factor
  - Still linear in number of inputs
  - Can do better: Logarithmic delay

ECE437, Spring 2016

No bits  $\times$  critical path + XOR delay  
block size path

## Carry Lookahead Adder

- Reformulate addition
  - Facilitates block computation
  - Facilitates hierarchical, parallel computation
  - Key concepts: Generate and Propagate
- An approach in-between our two extremes
  - Ripple carry
  - Flattened 2-level

ECE437, Spring 2016

## Carry look-ahead

- Motivation:
  - If we didn't know the value of carry-in, what could we do?
  - When would we always generate a carry?
    - $\bullet g_i = a_i \cdot b_i$
  - When would we propagate the carry?
    - $\bullet p_i = a_i \oplus b_i$
- Did we get rid of the ripple?

ECE437, Spring 2016

## CLA: Plumbing Analogy



ECE437, Spring 2016

## Carry-lookahead adder



ECE437, Spring 2016

## Carry-Lookahead Adder

- Waitaminute!
  - Nothing has changed
  - Fanin problems if you flatten!
    - Not really, Linear fanin, not exponential
  - Ripple problem if you don't!
- Enables divide-and-conquer
- Figure out Generate and Propagate for k-bits together
- Compute hierarchically

ECE437, Spring 2016

## 2-level 16-bit CLA



ECE437, Spring 2016

## Leaf Node



- Zero level  $g_i$ , and  $p_i$  generation
  - One gate delay
- Part of Full adder (only sum bit)
  - Two gate delays (ignoring inverters)

ECE437, Spring 2016

## Intermediate nodes

- We know how carry's are generated



## Block level signals



ECE437, Spring 2016

## CLA: Delay Analysis



- Height of tree =  $O(\log_4(n))$
- 16 bit addition :  $k \cdot \log_4(16) = k \cdot 2$

ECE437, Spring 2016

## CLA Logic



block generate & propagate

## Computing G's and Ps



- Parallel algorithm in hardware
  - g's and p's : 1
  - First level Gs and Ps :  $1 + 2 = 3$
  - Second level Gs and Ps :  $1 + 2 + 2 = 5$ 
    - (not needed for 16 bit adder example)

ECE437, Spring 2016

## Computing C's



- Different tree.
- Note propagation order
- Worth spending some time to think about the intricacies

ECE437, Spring 2016

## Delays

- $c1-c3 : 1 + 2$
- $c4 : 1 + 2 + 2$  \*\*
- $c5-c7 : 1 + 2 + 2 + 2 = 7$
- $c8 : 1 + 2 + 2$
- $c9-c11 : 1 + 2 + 2 + 2 = 7$
- $c12 : 1 + 2 + 2$
- $c13-c15 : 1 + 2 + 2 + 2 = 7$

\*\* Can be 1 + 2 if computed in first level CLA block.  
No overall improvement by doing this

ECE437, Spring 2016

## Carry-selection: Guess



ECE437, Spring 2016

costs hardware & energy.

o or 1.  
Pick from mux.

→ GATE DELAYS CALCULATION

# 2-level 16-bit CLA

Assume we have  
4 input gate  
that has some delay  
at 2 input gate.

For 32-bit,  
just add another level. [takes only extra 2 gate delays]



ECE437, Spring 2016

At high bit nos,  
CLA needs high no. of wires  
that slow it a bit.

## Shift

- E.g., Shift left logical for  $d<7:0>$  and  $\text{shamt}<2:0>$ 
  - Using 2-1 muxes called Mux(select, in0, in1)
  - $\text{stage}0<7:0> = \text{Mux}(\text{shamt}<0>, d<7:0>, d<6:0> \parallel 0)$
  - $\text{stage}1<7:0> = \text{Mux}(\text{shamt}<1>, \text{stage}0<7:0>, \text{stage}0<5:0> \parallel 00)$
  - $\text{dout}<7:0> = \text{Mux}(\text{shamt}<2>, \text{stage}1<7:0>, \text{stage}1<3:0> \parallel 0000)$
- Other operations
  - Right shift
  - Arithmetic shifts
  - Rotate

ECE437, Spring 2016

## Barrel Shifter



3 bits allows 0-7 shifts  
Logarithmic efficiency

32 bit no has 5 shift bits.

8 bit no has 3 shift bits  
 $\hookrightarrow \log_2 8$ .

## Extensions

- Design a barrel shifter unit that can do
  - Right shift
  - Left shift
- Design a shifter/rotator combo that can do
  - Right rotate
  - Left rotate
  - In addition to shifts

ECE437, Spring 2016

## Grade School Multiplication

- Multiplicand ( $1000_{10}$ ) and Multiplier ( $1001_{10}$ )
- Two approaches
  - Purely combinational
  - Sequential

ECE437, Spring 2016

## Adding more than two operands



- Add four numbers ( $A, B, E, F$ )
- "Traditional adder": ripple carry
- Basic block: Full adder
- $N-1$  adders for  $N$  numbers

ECE437, Spring 2016

## Carry-save adders



- Same basic block
  - Adds three numbers per stage (using  $Cin$  and  $S$ )
- Number of stages reduced to  $\sim 2/3N$  for  $N$ -numbers

ECE437, Spring 2016

Say we had 9 no.  
111 can do 3 trees.  
[Wallace tree]

## Purely combinational

- Uses carry-save adders (in a tree organization)
- Partial products
- AND multiplicand with multiplier bits

ECE437, Spring 2016

## Grade School Multiplication

- Sequential approach
- Reuse hardware
  - Partial products generated, accumulated into product each cycle

ECE437, Spring 2016

## Grade-School Version



- Naïve implementation
  - 2x 64-bit registers, 1x 32 bit register
  - 1x 64-bit ALU

$$326 \times 326 \Rightarrow 646$$

result

ECE437, Spring 2016

## Flow chart

- Repeated shifts and adds
  - E.g.  $13 \times 11 = 143$



ECE437, Spring 2016

shift has wasted zero  
An optimization is to merge them

## How does it work?



- Faithful reproduction of grade-school algorithm

ECE437, Spring 2016

## Version #1



- Additional zeroes (padding)

ECE437, Spring 2016

## Optimization #1



- Insight: LSB bits are frozen after each iteration
- Reduced 64-bit ALU to 32-bit ALU

ECE437, Spring 2016

## Flow chart

- Product register shifted right after LSB frozen
- 64-bit product computed with 32-bit adder



ECE437, Spring 2016

## Flow of computation



- Shift product right

ECE437, Spring 2016

## Optimization #2



- Insight: bits of multiplier are consumed
  - 1x 32-bit register
  - 1x 64-bit register
  - 32-bit ALU

ECE437, Spring 2016

## Flow chart

- Product register manipulates both multiplier and product



ECE437, Spring 2016

## Signed Multiplication

- $\{ \langle +,+ \rangle, \langle -,- \rangle \} : +$
- $\{ \langle +,- \rangle, \langle -,+ \rangle \} : -$
- If multiplier negative,
  - Multiplier = - Multiplier
  - Negative ++
- If multiplicand negative,
  - Multiplicand = -Multiplicand
  - Negative ++
- Product = Multiplicand \* Multiplier
- If (Negative) sign = '-' , else sign = '+'

ECE437, Spring 2016

## Booth's Algorithm

- Signed/Unsigned integer multiplication
  - Multiple bits at a time
    - We won't go into this topic
  - Like CLA for addition
    - sophisticated
  - Intuition : exploit run-lengths

ECE437, Spring 2016

## Booth's Algorithm Intuition

- How would you compute the decimal product:
  - $5345 * 999 ?$
  - $5345 * (1000 - 1)$
  - $5345000 - 5345 = 5339655$
- What about
  - $5345 * 9990$
  - $5345 * (10000 - 10)$
  - $53450000 - 53450 = 53396550$
- Now, what about
  - $5345 * 9900990$
  - $5345 * (1000000 - 100000 + 1000 - 10)$
  - $52915500000 + 5291550 = 52920791550$
- Works only when multiplier has only '9's in the decimal system

ECE437, Spring 2016

## Application to Binary

- "1 is the new 9"
  - $9 = 10-1$  (10 is base of decimal system)
  - $1 = 2-1$  (2 is the base of binary system)
- For example

  - $0010 * 0011$
  - $0010 * (0100 - 0001)$

ECE437, Spring 2016

## Run of '1's identification

end of run      middle of run      beginning of run  
 $0(1|1|1|1)0$

| Current bit | Bit to right | Explanation           | Example  | Action |
|-------------|--------------|-----------------------|----------|--------|
| 1           | 0            | Start of run of '1's  | 11001110 | sub    |
| 1           | 1            | Middle of run of '1's | 11001110 | nop    |
| 0           | 1            | End of run of '1's    | 11001110 | add    |
| 0           | 0            | Middle of run of '0's | 11001110 | nop    |

ECE437, Spring 2016

## Works for Signed numbers

- Consider 2's complement number 10110
    - Normal 2's complement view
      - $10110 = -16 + 0 + 4 + 2 + 0 = -10$
    - Booth encoding view
      - $10110 = -16 + 8 - 0 - 2 + 0 = -10$
      - $10110 :: \underline{11010}$  in Booth encoding
- 10110(0)
  - 10110(0)
  - 10110(0)
  - 10110(0)
  - 10110(0)

ECE437, Spring 2016

## Booth's Algorithm

Example:  $-4 * 6$

- 4 bit 2's complement 1100 \* 0110
- Extend multiplicand width
  - $-4 = 1111\ 1100$
- Booth encoding of Multiplier
  - $6 = 1010$  (corresponds to  $+8 - 2$ )

|           |   |           |
|-----------|---|-----------|
| 1111 1100 | 0 | 0000 0000 |
| 1111 1000 | 1 | 0000 1000 |
| 1111 0000 | 0 | 0000 0000 |
| 1110 0000 | 1 | 1110 0000 |
|           |   | 1110 1000 |

ECE437, Spring 2016

$$\begin{array}{r} -4 \times 6 \\ \hline 1111\ 1100 \\ 1111\ 1000 \text{ shift} \\ 1111\ 0000 \text{ shift} \\ 1110\ 0000 \text{ shift} \end{array}$$

| Partial prod |  |
|--------------|--|
| 0000 0000    |  |
| 0000 1000    |  |
| 0000 1000    |  |
| 1110 1000    |  |

add 1110 1000

$\xrightarrow{\text{L = -2}}$

[sub  $-4 \times 2$ ]

[add  $-4 \times 8$ ]

# In class

## The Mathematical Basis

- Recall the table
  - Multiplier =  $a_{j-1} - a_j$
- 3-bit numbers
  - A ( $a_2, a_1, a_0$ )
  - B ( $b_2, b_1, b_0$ )
- Compute  $B \times A$ 
  - $B * ((a_1 - a_2) \cdot 2^2 + (a_0 - a_1) \cdot 2^1 + (0 - a_0) \cdot 2^0)$
  - $B * (-a_2 \cdot (2^2) + a_1 \cdot (2^2 - 2^1) + a_0 \cdot (2^1 - 2^0))$
  - $B * (-a_2 \cdot (2^2) + a_1 \cdot (2^1) + a_0 \cdot (2^0))$
  - $B * A$

| Current bit | Bit to right | Multiplier |
|-------------|--------------|------------|
| 1           | 0            | -1         |
| 1           | 1            | 0          |
| 0           | 1            | +1         |
| 0           | 0            | 0          |

ECE437, Spring 2016

## Summary

- Integer multiplication:
  - Grade school version with minor optimizations
- Sophisticated optimizations
  - Booth's algorithms
    - More involved
    - Scope for handling multiple bits at a time
  - Remember CLA vs. ripple-carry (grade-school version)

ECE437, Spring 2016

## Recap

- Booth's Multiply Algorithm
  - Identify and exploit runs of '1's
  - Remember analogy of multiplication by 999
    - 0110 normal encoding ( $4+2=6$ )
    - 1010 Booth encoding ( $8-2=6$ )

ECE437, Spring 2016

## Outline

- Integer Division
  - Restoring/Non-restoring
- Floating Point Arithmetic

ECE437, Spring 2016

## Integer Division

- Remember progressive optimization of multiplication hardware
  - Similar Story
- Start with naïve hardware
  - Faithful implementation of grade-school method (long division)
  - Optimize

ECE437, Spring 2016

## Grade School Division

- Consider decimal division
  - Subtract largest possible multiple
  - Shift down one digit
  - "Lather, Rinse and repeat"
  - Binary
    - Exactly two choices: 0 or 1



$$\text{Dividend} = \text{Quotient} * \text{Divisor} + \text{Remainder}$$

$$|\text{Dividend}| = |\text{Quotient}| * |\text{Divisor}| + |\text{Remainder}|$$

ECE437, Spring 2016

## Version #1



- Dividend initially in Remainder reg
- 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg

ECE437, Spring 2016

## Flow Chart

- Divisor shifts right
- Quotient bits shifted left



ECE437, Spring 2016

## Example

- $n+1$  steps for  $n$ -bit quotient
- $7 / 2 = (3,1)$
- Use 8(4) bit 2's complement representation
- Register size/ALU size optimizations possible

| Remainder | Divisor   | Quotient |
|-----------|-----------|----------|
| 0000 0111 | 0010 0000 | 0000     |
| 1110 0111 |           |          |
| 0000 0111 | 0001 0000 | 0000     |
| 1111 0111 |           |          |
| 0000 0111 | 0000 1000 | 0000     |
| 1111 1111 |           |          |
| 0000 0111 | 0000 0100 | 0000     |
| 0000 0011 | 0000 0010 | 0001     |
| 0000 0001 | 0000 0001 | 0011     |

ECE437, Spring 2016

## Recap

- Booth's Multiply Algorithm
  - Identify and exploit runs of '1's
  - Remember analogy of multiplication by 999
    - 0110 normal encoding ( $4+2=6$ )
    - 1010 Booth encoding ( $8-2=6$ )

ECE437, Spring 2016

## Outline

- Integer Division
  - Restoring/Non-restoring
- Floating Point Arithmetic

ECE437, Spring 2016

## Integer Division

- Remember progressive optimization of multiplication hardware
  - Similar Story
- Start with naïve hardware
  - Faithful implementation of grade-school method (long division)
  - Optimize

ECE437, Spring 2016

## Version #1



- Dividend initially in Remainder reg
- 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg

ECE437, Spring 2016

## Grade School Division

- Consider decimal division
  - Subtract largest possible multiple
  - Shift down one digit
  - "Lather, Rinse and repeat"
- Binary
  - Exactly two choices: 0 or 1



$$\text{Dividend} = \text{Quotient} * \text{Divisor} + \text{Remainder}$$

$$|\text{Dividend}| = |\text{Quotient}| * |\text{Divisor}|$$

ECE437, Spring 2016

hardware doesn't know tables.  
but we have either 0 or 1. So we  
guess 1 and see if subtraction results  
in the no. If not, we go up a level  
& use 0.

## Flow Chart

- Divisor shifts right
- Quotient bits shifted left



ECE437, Spring 2016

## Example

- n+1 steps for n-bit quotient
- $7 / 2 = (3,1)$
- Use 8(4) bit 2's complement representation
- Register size/ALU size optimizations possible

| Remainder | Divisor   | Quotient |
|-----------|-----------|----------|
| 0000 0111 | 0010 0000 | 0000     |
| 1110 0111 | 0001 0000 | 0000     |
| 0000 0111 | 0000 1000 | 0000     |
| 1111 0111 | 0000 0100 | 0000     |
| 0000 0011 | 0000 0010 | 0001     |
| 0000 0001 | 0000 0001 | 0011     |



ECE437, Spring 2016

always do

remainder - Divisor.

If -ve, restore.

shift divisor right

shift Quotient left.

If +ve. Shift divisor right.

get 1 in quotient on left shift.

not  
in  
exam.

## Non-restoring division

### Restoring Division

- Subtract largest possible multiple
  - Try '1'
  - If too large (i.e. remainder negative), restore

### Alternative

- Non-restoring division
- Introduce  $\bar{1}$  (Like Booth's)

ECE437, Spring 2016

## Non-Restoring Division

- With +ve divisor
  - Subtract if positive remainder
  - Add if negative remainder
- Vice versa with -ve divisor

| Remainder   | Divisor   | Quotient |
|-------------|-----------|----------|
| - 0000 0111 | 0010 0000 | 00000    |
| + 1110 0111 | 0001 0000 | 00001    |
| + 1111 0111 | 0000 1000 | 00011    |
| + 1111 1111 | 0000 0100 | 00111    |
| - 0000 0011 | 0000 0010 | 01111    |
| 0000 0001   | 0000 0001 | 11111    |

$$\text{Quotient} = 0011 \Rightarrow 1 \times 2^1 + 1 \times 2^0$$

$$\text{Quotient} = 1111 \Rightarrow 1 \times 2^4 - 1 \times 2^3 - 1 \times 2^2 - 1 \times 2^1 + 1 \times 2^0$$

ECE437, Spring 2016

## Conversion to standard form

- Non-restoring division uses two unique symbols
  - Can use "0" for  $\bar{1}$
- Issues
  - Perfect division before all bits are computed
  - Re-conversion to standard form

ECE437, Spring 2016

## Outline

- Integer Division
  - Restoring and Non-restoring
- Floating Point Arithmetic

## Floating Point

- Integer arithmetic till now
- How to represent real numbers:
  - Uncountably infinite
  - Realistically, can operate with limited number of significant digits (precision)
  - Remember exponent separately
- Remember scientific norms
  - Planck's constant  $6.626068 \times 10^{-34} \text{ m}^2 \text{ kg/s}$
  - Avogadro's constant  $6.022 \times 10^{23}$
  - Normalization:
    - Not  $0.6022 \times 10^{24}$ , Not  $60.22 \times 10^{22}$
    - MSD in [1,9] range except for 0.0

ECE437, Spring 2016

mantissa  
↓  
 $a \cdot e^b$  ← exponent

## FP in hardware

- Binary instead of decimal
- MSB = 1 (equivalent to [1,9] in decimal)
  - $(-1)^s \times f \times 2^e$
  - s: Sign stored separately  $f$  is  $1.0110\ldots$
  - f: is of the form 1.F
  - e: is the (signed integer) exponent
- Optimizations:
  - Base not stored (implicit 2)
  - MSB not stored (always 1)

ECE437, Spring 2016

If all are zero, it has value 0. That +1 is forgotten then.

still need to have a way to represent zeroes.

This is stored.  
∴ we know that 1 will always be there.

## IEEE 754

float : 32 bits      E (8 bit)  
double : 64 bits      E (11 bit)

- Floating point representation standard
- Floating points stored as a packed triple
  - $\langle S, E, F \rangle$
  - S : 1 bit
    - 0 → positive
    - 1 → negative
  - E : Biased representation for signed numbers
    - 8 bit (single precision—float) (Bias = +127)
    - 11 bit (double precision—double) (Bias = +1023)
  - F : (remember the 1 in 1.F is not stored)
    - 23 bit (single precision—float)
    - 52 bit (double precision—double)

ECE437, Spring 2016

## Examples

(For balance).

- 3.125 in double precision
  - 11.001
  - $(-1)^0 \times 1.1001 \times 2^1$  → to shift by 1 for 11.
  - Exponent 1 with bias of 1023 = 1024
  - 0100 0000 0000 1001 0000 0000 ...
  - 0x4009 0000 0000 0000
- 0.6 in single precision
  - 0.100110011001...
  - $(-1)^0 \times 1.00110011001 \dots \times 2^{-1}$
  - Exponent -1 with bias of 127 = 126
  - 0011 1111 0001 1001 1001...
  - 0x3F19 9999

ECE437, Spring 2016

suggests recurring value.

## Binary Fractions Recap

- How do you represent  $3.125_{(10)}$  in binary?

$$- 11.001_{(2)}$$

- For integer conversion to binary
  - Repeated division by 2 till quotient goes to zero

- For fraction conversion to binary
  - Repeated multiplication by 2 till fraction goes to zero

- Recurring binary fraction

$$- 0.6_{(10)} = 0.1001 1001 1001 \dots_{(2)}$$

$$\begin{array}{r} 0.125 \\ \times 2 \\ \hline 0.25 \\ \times 2 \\ \hline 0.5 \\ \times 2 \\ \hline 1.0 \end{array}$$

001

make 0.6 to binary.

Do by multp.

$$0.6 \times 2 = 1.2$$

$$0.2 \times 2 = 0.4$$

$$0.4 \times 2 = 0.8$$

$$0.8 \times 2 = 1.6$$

$$0.6 \times 2 = 1.2$$

ECE437, Spring 2016

11.1001...

$$0.4 \times 2 = 0.8$$

$$0.8 \times 2 = 1.6$$

$$0.6 \times 2 = :$$

## Exceptions

- Specified in great detail in the document IEEE 754-1985

- 20 pages

SC E 100 not covered.

| S | E   | F        | Number |
|---|-----|----------|--------|
| 0 | 0   | 0        | 0      |
| 0 | Max | 0        | +inf   |
| 1 | Max | 0        | -inf   |
| X | Max | $\neq 0$ | NaN    |
| X | 0   | $\neq 0$ | Denorm |

ECE437, Spring 2016

denormalized.  
No. too small to represent.

a no.  
=  $x/0$   
=  $\log(-x)$

## FP Representation Worksheet

- Consider two floating point numbers (represented in single precision, IEEE 754 format). A = 0xBFC00000 and B = 0x41600000. If C = A × B, what is the hexadecimal representation of C in IEEE 754 format.

$$A = \underbrace{1011}_{127} \underbrace{1111}_{1024} \underbrace{1100}_{1024} = (-1) \times 1.5 \times 2^{127-127} = -1.5.$$

$$B = 0 \times 416 = \underbrace{0100}_{127} \underbrace{0001}_{1024} \underbrace{0110}_{1024} = (+1) \times 2^3 \times 1.75 = 14$$

$$C = -21 = -10101.000 = -1.0101 \times 2^4$$

ECE437, Spring 2016

S = 1

$$f = 0101_2$$

$$e = 131_d = 10000011_2$$

$$= \underbrace{11000001}_{f} \underbrace{10101}_{e} \underbrace{0000}_{f}$$

$$C = 0x C1A8 000...$$

## Solution

- float : 32 bits      E (8 bit)
- $A = 0xBFC00000$   
 $= 1011\ 1111\ 1100\ 0000\ 0000\ 0000\ 0000\ 0000$   
 $= -1 * 1.5 * 2^0$
  - $B = 0x41600000$   
 $= 0100\ 0001\ 0110\ 0000\ 0000\ 0000\ 0000\ 0000$   
 $= +1 * 1.75 * 2^3$
  - $C = -21 = -1 * 1.0101 \text{ (binary)} * 2^4$   
 $= 1100\ 0001\ 1010\ 1000\ 0000\ 0000\ 0000\ 0000$   
 $= 0xC1A80000$

ECE437, Spring 2016

## Biased Exponent

### • Why?

- 2's complement arithmetic was elegant
- Larger numbers appear to have larger magnitude with biased representation
  - Negative numbers appear large in 2's complement

- Comparing two FP numbers is like comparing two integers with biased exponent

↳ comparing is common.  
Bias helps do that.

2's complement meant that the magnitude must be extracted. This is bad.

## Floating Point Addition

- FP addition
  - Align decimal point  $\xrightarrow{9.997 \times 10^2}$
  - Add  $\xrightarrow{+ 4.6371 \times 10^1}$
  - Normalize  $\xrightarrow{1.00016371 \times 10^3}$
- May involve rounding
  - The LSB may be shifted right beyond the limited precision

ECE437, Spring 2016

## FP Addition



ECE437, Spring 2016

## Floating Point Addition

- FP addition
  - Align decimal point
  - Add
  - Normalize
- May involve rounding
  - The LSB may be shifted right beyond the limited precision

ECE437, Spring 2016

$$\begin{array}{r} 9.997 \times 10^2 \\ + 4.6371 \times 10^1 \\ \hline \end{array}$$

$$\begin{array}{r} 9.997 \quad \times 10^2 \\ + 0.0046371 \times 10^2 \\ \hline 10.0016371 \times 10^2 \\ 1.00016371 \times 10^3 \end{array}$$

## FP Addition



ECE437, Spring 2016

### FP Addition Hardware



ECE437, Spring 2016

## FP Subtraction

- At most one shift for normalization after add
  - Applies when two operands are of same sign
  - Also the case when subtracting two operands of different signs
- Subtraction
  - Subtract two numbers of similar sign
  - Add two numbers of dissimilar sign

ECE437, Spring 2016

## Example

- $1.00010 \times 10^{13} - 1.00001 \times 10^{13}$
- $0.00001 \times 10^{13}$ 
  - Not normalized
- $1.0 \times 10^8$ 
  - Normalize involves multiple digit shifts

ECE437, Spring 2016

## FP multiplication

- Example
  - $A = 3.0 \times 10^1$
  - $B = 5.0 \times 10^2$
  - $A \times B = (3.0 \times 5.0) \times 10^{(1+2)} = 15.0 \times 10^3$
  - $A \times B = 1.5 \times 10^4$
- Multiply mantissas
- Add exponents (no overflow/underflow)
- Normalize (and round)
- Set sign (easy, exor sign bits)

ECE437, Spring 2016

## Adding exponents

- Biased representation
  - Adding biased numbers.
  - Raw number

$$\begin{aligned}
 & \text{Row number} && 1000\ 1011 \\
 & e_1 = 12, e_2 = 13; e_0 = (e_1 + e_2) = 25 && 1000\ 1100 \\
 & E_1 = 12 + 127 = 139, E_2 = 13 + 127 = 140 && \underline{1000\ 0001} \\
 & E_0 = e_0 + 127 = E_1 - 127 + E_2 - 127 + 127 && 1001\ 1000 \\
 & E_0 = E_1 + E_2 - 127 && \\
 & -127 = -(01111111) = (1000\ 0000) + 1 && \text{re-61ae.}
 \end{aligned}$$

100001-00000

## FP Multiplication



ECE437, Spring 2010

## FP Multiplication

- Mantissa multiplication
    - Two non-negative integers
    - Remember combinational approach
      - Carry save adders in tree (Wallace tree)

113401, 2009-07-05

## FP division

- Exponent computation
    - $E = E_1 - E_2 + 127$  → re-bias ←  
their bias cancel each other
  - Raw number
    - $e_1 = 12, e_2 = 13; e = (e_1 - e_2) = -1$
    - $E_1 = 12 + 127 = 139, E_2 = 13 + 127 = 140$
    - $E = e_1 \cdot 127 = (E_1 - 127) - (E_2 - 127) + 127$
    - $E_1 = E_1 - (E_2 - 127)$
    - $-E_2 = 1's C(E_2) + 1$
    - $127 = (01111111)$
    - $E = E_1 + 1's C(E_2) + 1000\ 0000$

ECE437, Spring 2016

$$\bar{E}_2 + 1$$

is  $-E_2$   
in 2<sup>nd</sup> comp.

## Rounding

- Decimal conventions
    - [5+9] → up
    - 5 → round to even ( $4.5 \sim 4$ ,  $5.5 \sim 6$ )
    - [15.] → down
    - 0 → unchanged
  - Binary rounding
    - xxxx.1... → up
    - xxxx.10000 → round to even
    - xxxx.0xx1x → down
    - xxxx.00000 → unchanged

STANLEY, February 1884.

## Guard, Round and Sticky bits



ECE437, Spring 2016

# Rounding

- IEEE 754
  - Error bounds
  - No more than  $\frac{1}{2}$  "units of the last place" (ULP)
  - Errors accumulate
    - Billions of FP ops in a single application
- Effects of limited precision
  - Commutativity, associativity etc. may no longer apply
  - $A = (3.1415 + 6.022 \times 10^{23}) - 6.022 \times 10^{23}$
  - $B = 3.1415 + (6.022 \times 10^{23} - 6.022 \times 10^{23})$
  - Is  $A == B$ ?

No.

ECE437, Spring 2016



In A, π must align to give Avogadro No.

On aligning, π pretty much becomes 0.

We have  $6.022e^{23} - 6.022e^{23} = 0$ .

In B, we get π as expected.

Order is imp.

This will happen :: of low precision.

## Virtual Memory



- Data movement between Disk and Main memory
- We know how layers of hierarchy interact
  - Cache and main memory
- Can we apply all the same techniques?
  - Similarities and differences

ECE437, Spring 2016

(95)

## VM : The real reason

- Application's view of memory
  - Large: > 4GB (32b address)
  - Exclusive: Only program in memory
- System view
  - Smaller: 512MB-2GB
  - Multiple programs share memory
  - Run in a protected manner (memory is private \*\*)
  - Address range is fixed (starts at 0x0)
  - Do not bring in entire program
    - Bring in relevant parts as needed
- VM reconciles these conflicting "views"
  - Not just the classical benefits of hierarchy
  - Illusion of
    - speed of expensive level
    - capacity and cost of cheaper level

ECE437, Spring 2016

(96)

## Virtual Memory

- Similarities:
  - Mapping a larger address space to a smaller space
  - Used for data movement between layers of the memory hierarchy
- Differences:
  - Miss-penalty : 10-100 cycles vs. ~1 million cycle
  - Block size : 16-64 bytes vs. 4 KB - 8 KB
  - Full associativity (But no associative search!)
  - Software handling
    - cache
    - disk
    - spatial locality
    - to reduce miss rate
    - no conflict misses
    - cannot do old type of search
    - if too many comparisons.

ECE437, Spring 2016

(97)

## VM Operation



- Programs use virtual address
- The data/code resides elsewhere (physical address)

ECE437, Spring 2016

(98)

## VM Operation

31 30 29 28 27 ..... 15 14 13 12 11 10 9 8 ..... 3 2 1 0

Virtual page number | Page offset

Translation

29 28 27 ..... 15 14 13 12 11 10 3 8 ..... 3 2 1 0

Physical page number | Page offset

Physical address

- Assume 4KB pages
- 32-bit VA and 30-bit PA
- System responsible for translation
- E.g.
  - `lw $r2, 0xfffffc004`
  - `0xfffffc → 0x20000 # VPN → PPN translation`
  - `PA = 0x20000004`

ECE437, Spring 2016

(99)

## VM Terminology

- Blocksize ~ 512B - 8KB+
  - Block : Page
- **Miss : Page-fault**
  - Fetch from disk
- Derivative properties:
  - Fundamental constraint: high access latency

naming convention.

ECE437, Spring 2016

(100)

## Back to 4 questions

### Q1. Where does a block go?

- Fully associative
- Block offset and tag
- NO INDEX
- Why?
  - AMAT = hit time + miss-rate \* miss-penalty
  - Miss-penalty = ~1 million cycles
  - Have to minimize miss-rate

ECE437, Spring 2016

(101)

## Q2. Block Identification

### Q2. Block identification

- Fully associative search??
- 30-bit physical address (16B)
- 4 KB pages
- Number of frames =  $2^{30}/2^{12} = 2^{18}$
- 256 K frames
- Compare 256 K frames in parallel??!!
- Reframe question:
  - Old: Is this VA in any given frame? => Parallel search
  - New: Where is this VA? => Table lookup
    - Page table

ECE437, Spring 2016

(102)

## Exercise: Virtual Memory Design

- Page size = 8KB
- Virtual Memory
  - $32b = 2^{32} = 4GB$
- Physical memory = 512KB
- What is the size of the per-process page table?
  - lookup table / phone book.
- What happens on a context switch?

ECE437, Spring 2016

(103)

## Page Table



ECE437, Spring 2016

(104)

## Page table

- Where does table reside?
  - Main memory
  - 100% overhead? → 1 to lookup addr. 1 to get value.
  - Each memory reference now generates two memory references
  - `lw $r2,0xffff0004`
    - Access page table entry for 0xffff0, get PPN
    - Access PPN:004
- We want to minimize main memory access
  - Page table entries can be cached like ordinary data
  - Tricky \*\*\*!!
  - Special cache for Page table entries

ECE437, Spring 2016

(105)

pid & lookup addr.  
in it.

## Page Table Entries

- What does a page table entry contain
  - Physical page number (18 bits)
  - Access control (read/write permissions)
  - Valid bit (1 bit)
  - Misc
    - Use bit for replacement
    - Dirty bit for write-back
  - ~ 4 bytes (1 word)

ECE437, Spring 2016

(106)

## Size of Page Table

- What is the size of the page table for a system with
  - 32 bit addresses
  - 1 GB physical memory
  - 4KB pagesize
- Virtual pages =  $2^{32}/2^{12} = 2^{20}$
- Physical pages =  $2^{30}/2^{12} = 2^{18}$
- PT size (per process) = (Entry size) \* (# entries)  
 $= 4 \text{ bytes} * 2^{20} = 2^{22} \text{ bytes} = 4 \text{ MB}$
- # of processes? ~60 on my WinXP Pro machine
  - 240 MB for page tables?
  - "Big government": 25% consumed for administration!
  - Techniques to reduce overhead
    - Gradual growing
    - Inverted PT: entries per physical page

ECE437, Spring 2016

(107)

## Replacement

- Q3. Block replacement
  - LRU and/or LRU approximation (LRU with reference/use bits)
  - Sophisticated mechanisms possible (handling in software)
- Page-fault : Exception
  - Save instruction that causes fault
  - OS service the fault, i.e. brings in the relevant page from disk (VA  $\rightarrow$  disk address??)
  - OS knows service is slow; schedule other program
  - When disk access is complete
  - Restart at offending instruction

ECE437, Spring 2016

(108)

## Write handling

- Q4. What happens on a write
  - Write-through or write-back?

ECE437, Spring 2016

(109)



Page-fault throws exception.  
Then software exception handles it.  
↳ Here the disk work is going on.  
Since it takes ~1 mil cycles, the  
OS gets someone else in  
as you are taking time.  
↳ When ready, you come back  
& retry the same instruction.

## Page table

- Where does table reside?
  - Main memory
  - 100% overhead?
    - Each memory reference now generates two memory references
    - lw \$r2,0xffff0004
      - Access page table entry for 0xffff0, get PPN
      - Access PPN:004
- We want to minimize main memory access!
  - Page table entries can be cached like ordinary data
  - Tricky \*\*\*!!
  - Special cache for Page table entries

ECE437, Spring 2016

(105)

## Page Table Entries

- What does a page table entry contain
    - Physical page number (18 bits)
    - Access control (read/write permissions)
    - Valid bit (1 bit)
    - Misc
      - Use bit for replacement ← LRU/NRU types.
      - Dirty bit for write-back
    - ~ 4 bytes (1 word)
- To determine usage timing for replacement.

ECE437, Spring 2016

(106)

## Size of Page Table

- What is the size of the page table for a system with
  - 32 bit addresses ✓
  - 1 GB physical memory ✓  $2^{30}$
  - 4KB pagesize ✓  $2^{12}$
- $\rightarrow$  Virtual pages =  $2^{32}/2^{12} = 2^{20}$
- Physical pages =  $2^{30}/2^{12} = 2^{18}$
- PT size (per process) = (Entry size) \* (# entries)
 
$$= 4 \text{ bytes} * 2^{20} = 2^{22} \text{ bytes} = 4 \text{ MB}$$
- # of processes? ~60 on my WinXP Pro machine
  - 240 MB for page tables?
  - "Big government": 25% consumed for administration!!
  - Techniques to reduce overhead
    - Gradual growing
    - Inverted PT: entries per physical page

ECE437, Spring 2016

(107)

## Replacement

- Q3. Block replacement
  - LRU and/or LRU approximation (NRU with reference/use bits)
  - Sophisticated mechanisms possible (handling in software)
- Page-fault : Exception
  - Save instruction that causes fault
  - OS service the fault, i.e., brings in the relevant page from disk (VA → disk address???)
  - OS knows service is slow; schedule other program
  - When disk access is complete
  - Restart at offending instruction

ECE437, Spring 2016

(108)

## Write handling

- Q4. What happens on a write
  - Write-through or write-back?



If page removed, & physical address used in cache, hard to determine what address to invalidate [need to hence use physical addr in cache].

## Memory Access Critical path

- Why use physical addresses to access caches?
    - Use virtual addresses
    - Faster: no translation
    - Block may have a different (still unique) tag and index
      - Who cares where the block resides in the cache?
  - Synonyms
    - Two virtual pages map to same physical page
    - Should not be replicated in cache
- Want to use physical addrs in the cache.

ECE437, Spring 2016

(111)

## Faster Translation

- 100% overhead with VM system
- Eliminate memory access for translation
  - Caching
    - Translation lookaside buffer (TLB)
    - Also DTB in some literature
    - A cache of translations
  - 64-128 entries
    - Covers 256 KB - 512 KB
  - Organization
    - 64 entry fully associative
    - 256 entry, 16-way set associative
    - May be multi-level

ECE437, Spring 2016

(112)

TLB

Cache only for

## Translation Lookaside Buffer



ECE437, Spring 2016

(112)

virtual addr.

physical addr.  
go to cache  
get data now

## TLB+Cache

- Cache operation
  - With physical addresses
  - Translation on critical path

ECE437, Spring 2016

(113)

there is iTLB & dTLB to avoid structural hazard.

## Pipeline



ECE437, Spring 2016

(114)

## Putting it all together



### Memory access flowchart

See Fig 7.25 for full flow chart

ECE437, Spring 2016

(115)



Get to  
Normal  
Cache

## Full Hierarchy design

- 32b VA, 30 bit PA (i.e. 1GB Phys. Mem)
  - TLB
    - 256 entry, 4-way associative
  - Cache
    - 32KB, direct mapped, 32B blocks

ECE437, Spring 2016

(117)

## Memory Access Critical Path

### Twist in the tale

- Virtual-index
- Physical tags
- Indexing and translation proceeds in parallel
- Tag comparison after translation



ECE437, Spring 2016

(118)

for speeding  
TLB with  
cache

## Summary

- 4Q on VM
  - Placement : fully associative
  - Identification : Page Table lookup
  - Replacement : LRU / LRU-approx
  - Writes : Writeback
- 4Q on TLB
  - Placement: small and fully associative, larger and set-associative
  - Identification: Associative search (CAM)
  - Replacement : random\*\*
  - Writes : ?? Writes to TLB??

ECE437, Spring 2016

(119)

## VM Miscellanea

- TLB: cache of VA-> translations
- On a context switch:
  - Change contents of PTBR for appropriate page table
  - What do we do with TLB contents?
    - Flush all entries
      - Simple, but inefficient
    - Associate Process ID with address
      - Flush required only when process IDs are reused

ECE437, Spring 2016

(120)

## VM Miscellanea

- Memory efficiency of page tables
  - Limit register
    - Only region between PTBR and Limit is valid
    - Grow as needed
  - Segmented
    - Two page tables and two limit registers (Stack and Heap)
  - Inverted Page table
    - Hashing to map VA to a number within PA range
      - Hash 32-bit VA to 28 bit PA
      - Lookup complications
      - Collision
  - Multilevel page tables
  - Paging page tables

ECE437, Spring 2016

(121)

## Real Stuff

| Characteristic | Nehalem                                                                    | Athlon 64x2                                                                                                           |
|----------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| VA             | 64 bit                                                                     | 48 bit                                                                                                                |
| PA             | 40 bit                                                                     | 40 bit                                                                                                                |
| Page size      | 4KB, 2/4MB                                                                 | 4KB, 2/4 MB                                                                                                           |
| TLB            | Split I&D<br>4-way assoc<br>Pseudo-LRU<br>I-128+7, D-64+32<br>TLB miss H/W | Split, two level I&D<br>L1 fully assoc (24I+40D)<br>L2 4-way assoc (256I+256D)<br>LRU/round-robin-LRU<br>TLB miss H/W |

ECE437, Spring 2016

(122)

## Real Stuff

| Characteristic | Nehalem          | Athlon 64x2 |
|----------------|------------------|-------------|
| L1 Cache       | Split I & D      | Split I & D |
| Size           | 32K + 32K        | 64K + 64K   |
| Associativity  | I 4-way; D 8-way | 2-way       |
| Hit time       | 3 cycles         | 3 cycles    |
| Replacement    | Approx LRU       | LRU         |
| Block          | 64 bytes         | 64 bytes    |
| Write          | Write-back       | Write-back  |

- Pentium D had write-through L1

ECE437, Spring 2016

(123)

## Real Stuff

| Characteristic | Nehalem                 | Athlon 64x2           |
|----------------|-------------------------|-----------------------|
| L2 Cache       | Unified                 | Split I & D           |
| Size           | 256KB (Private)         | 1MB + 1MB             |
| Associativity  | 8-way                   | 16-way                |
| Replacement    | Approx LRU              | Approx LRU            |
| Block          | 64 bytes                | 64 bytes              |
| Write          | Write-back              | Write-back            |
| L3 cache       | Unified (shared)<br>8MB | None (Phenom has it.) |

ECE437, Spring 2016

(124)

e.g.

Say Process B & Process Y have 8 bits of virtual address  
Physical memory has 4 bits ~~of~~ of address.

| P0 asks for addr 0x10.

This value comes into RAM at say address 0x4.

In P0's page table, tag 0x10 has value of 0x4.

| Now P1 asks for addr 0x08.

This value goes to 0xc in RAM.

P1's page table has 0xc at location 0x08.

| P0 asks for addr 0x00.

RAM gets a page at addr 0x~~08~~8.

P0's page table has 0x~~08~~8 at location 0x00.

| P1 asks for addr 0x14.

RAM gets a page at addr 0x0.

P1's page table has 0x0 at location 0x14.

| P0 asks for 0x10

Physical addr of 0x4 given.

Data read & life goes on.

| P1 asks for 0x04.

RAM is full. PAGE FAULT.

OS decides to get page at 0x4.

→ Old page that was evicted is written to hard disk if it was modified.

The P0's page table has valid bit invalidated ~~for~~ for RAM 0x4.

New page comes in.

P1's page table has 0x4 at location 0x04.

Page size comes into RAM (e.g. 4kb data is read).

### Super pages

programmer can define larger page sizes.

Cache has physical addr. So; it cannot be reached since lookup won't get location.

When doing this, the 5 cache with that address has to be invalidated. cannot make shared; lookup won't make have that

dirty bit maintained

Continuing but now with TLB

P1 asks v.addr 0x08.

misses in TLB, get



Now physical addr 0xc is read from page table.

P0 asks v.addr 0x0C.

misses in TLB. Looks in page table [not there].

Say 0x0 [value of 0x14 of P1 is removes]. The Page table of P1 must be cleared.  
P<sub>p</sub> addr. Caches too if need be.

Now 0x0C's value got into memory.

P0's page table gets 0x0 at ~~at~~ 0x0C.

And so, life goes on...

Can avoid tag by clearing whenever processor changes.

permission bits.

TLB also has cache like hierarchy goes down into L1 TLB (i & d).

Apr 18.

PA is constrained by  
TLB.

## ECE437: Introduction to Digital Computer Design

### Chapter 6

Spring 2016

## Inverted Page Table

Use hash table to translate physical addrs to virtual addrs.  
 Collision = open-addrs forwarding  
 Now size matches PA (saved space)  
 Big improvement. But, to allow good hashing performance, space is 2x4x PA size.

### Motivation

- I/O needed for
  - To/from users
  - To/from computers
  - To/from non-volatile storage media
- I/O Performance matters

$$\boxed{\text{Total time} = \text{CPU} + \text{I/O} - \text{overlap}}$$

- $10 + 4 - 4 = 10; 1x$
- $5 + 4 - [0,4] = [9,5]; [1.1x, 2x]$
- $1 + 4 - [0,1] = [5,4]; [2x, 2.5x]$

Slow improving

ECE437, Spring 2016

at best overlap all  
 at worst, see no overlap.

in 64 bit architecture  
 this is a good idea.

## I/O Performance

- What is performance?
  - With CPU: Iron Law
  - With I/O
    - Applications have different I/O needs
- Supercomputers read and write 1G of data
  - High data throughput
- Transactions processing does many independent, small I/O ops.
  - Fast I/O throughput (# of I/Os per sec)
- File systems
  - Fast response time, locality

ECE437, Spring 2016

(3)

## I/O Device Characteristics

See Figure 6.2

| Device           | I or O? | Partner | Data Rate Mb/s |
|------------------|---------|---------|----------------|
| Mouse            | I       | Human   | 0.0038         |
| Graphics Display | O       | Human   | 800-8000       |
| Modem            | I/O     | Machine | 0.016-0.064    |
| LAN              | I/O     | Machine | 100-1000       |
| Tape             | Storage | Machine | 32             |
| Disk             | Storage | Machine | 240-2560       |

ECE437, Spring 2016

(4)

## Disk Trends

- Disk Trends
  - \$/MB decreasing
    - 2010: \$125/2TB = 6.3c/GB
    - 2008: \$149/1.5TB = 10c/GB
    - 2006: \$92/250 GB = 36c/GB
    - 2004: 50c/GB (was 10000c/GB in 1997)
    - OEM prices lower
  - Disk diameter 14" → 1.8" → 1" \*\*
  - Seek time down ~5-10ms
  - Rotation speed unchanged
    - ~7200 rpm for consumer
    - ~15K rpm for high end disks
  - Xfer rates up
  - Move from parallel buses to Serial bus (Serial ATA)

ECE437, Spring 2016

(5)

## Magnetic Disks



ECE437, Spring 2016

pretty constant  
 can improve



- (1)
- ### Disk Access Time
- Disks R' us™ has a GenX disk with the following parameters
    - Average seek time = 10ms
    - Rotation speed = 5400 rpm
    - Xfer rate = 160MB/s
    - Disk-block size = 4KB
    - Controller overhead = 3ms
  - They introduce a GenY disk technology with rotation speed of 7200 rpm which they advertise as being 50% faster
  - If average queue time is 0 ms, what is the true speedup of GenY over GenX when reading a randomly chosen disk-block.
- ECE437, Spring 2016



- (2)
- ### Exercise
- A disk has the following parameters
    - Access size : 8KB
    - Bandwidth: 80MB/s
    - Controller overhead : 0.1 ms
    - Seek time : 6ms
    - Rotation speed: 7200 rpm
  - What is the average access time per transfer?
  - If the disk controller has a 8MB cache that results in a 20% hit rate, what is the average access time? (Hits eliminate seek time and rotational latency)
- ECE437, Spring 2016

Try.



- (3)
- ### Redundant Array of Inexpensive Disks
- MTTF for 1 disk.
- What if we want to store 100 disks
  - MTTF:  $5 \text{ yr} / 100 = 18 \text{ days}$
  - RAID 0 = No redundancy (Striping)
  - RAID 1 = mirror = full data redundancy = 100% overhead
  - RAID 3 = bit interleaved parity = small overhead
    - Dedicated parity disk
  - RAID 4 = block interleaved parity = small overhead + small writes
  - RAID 5 = distributed block wise parity = small overhead + small writes
  - Other raid levels in book
- ECE437, Spring 2016

[Redundant array of inexpensive disks]

Apr 20-

2

## Research impact

- top prof / conference.
- SOP now what interested  
Not what done in UG.

### Redundant Array of Inexpensive Disks

- What if we want to store 100 disks
- MTTF :  $5 \text{ yr}/100 = 18 \text{ days}$ 
  - RAID 0 = No redundancy (Striping)
  - RAID 1 = mirror = full data redundancy = 100% overhead
  - RAID 3 = bit interleaved parity = small overhead
    - Dedicated parity disk
  - RAID 4 = block interleaved parity = small overhead + small writes
  - RAID 5 = distributed block wise parity = small overhead + small writes
- Other raid levels in book

full copy of data  
from 4 disks  
this raid holds  
parity bits.  
it is written

ECE437, Spring 2016

(13)

Improvement  
to raid 5.

### Recap

- Disks
  - Components of latency
  - Organization
  - RAID
  - Availability/Dependability/Reliability
- Buses
  - Types
    - One-bus/two-bus/three-bus architectures
  - Transaction timing
    - Synchronous vs. asynchronous

ECE437, Spring 2016

(14)

### Connecting I/O to the CPU

- Many I/O devices with vastly different datarates
- I/O device economics
  - Need standards
    - Large number of peripheral device producers
    - Multiple business entities involved favors longer lasting standards
- Let's examine several connection strategies

ECE437, Spring 2016

(14)

### Advantages of Buses



- Versatility:
  - New devices can be added easily
  - Peripherals can be moved between computer systems that use the same bus standard
- Low Cost:
  - A single set of wires is shared in multiple ways
  - Manage complexity by partitioning the design

ECE437, Spring 2016

(15)

### Disadvantages of Buses



- It creates a communication bottleneck
  - The bandwidth of that bus can limit the maximum I/O throughput
- The maximum bus speed is largely limited by:
  - The length of the bus
  - The number of devices on the bus
  - The need to support a range of devices with:
    - Widely varying latencies
    - Widely varying data transfer rates

ECE437, Spring 2016

(16)

### Types of buses

- Processor-Memory Bus (design specific/proprietary)
  - Short and high speed
  - Only need to match the memory system
    - Maximize memory-to-processor bandwidth
  - Connects directly to the processor
  - Optimized for cache block transfers
- I/O Bus (industry standard)
  - Usually is lengthy and slower
  - Need to match a wide range of I/O devices
  - Connects to the processor-memory bus or backplane bus
- Backplane Bus (standard or proprietary)
  - Backplane: an interconnection structure within the chassis
  - Allow processors, memory, and I/O devices to coexist
  - Cost advantage: one bus for all components

ECE437, Spring 2016

(17)



### One Bus (Backplane) Architecture



- A single bus (the backplane bus) is used for:
  - Processor to memory communication
  - Communication between I/O devices and memory
- Advantages: Simple and low cost
- Disadvantages: slow and the bus can become a major bottleneck
- Example: IBM PC - AT

ECE437, Spring 2016

(1B)

### Two Bus System



- I/O buses tap into the processor-memory bus via bus adaptors:
  - Processor-memory bus: mainly for processor-memory traffic
  - I/O buses: provide expansion slots for I/O devices
- Apple Macintosh-II
  - NuBus: Processor, memory, and a few selected I/O devices
  - SCCI Bus: the rest of the I/O devices

ECE437, Spring 2016

(1C)

*Northbridge  
High speed bus.*

### Three Bus System



- Now fewer things hanging off the bus -*
- A small number of backplane buses tap into the processor-memory bus
    - Processor-memory bus is used for processor memory traffic
    - I/O buses are connected to the backplane bus
  - Advantage: loading on the processor bus is greatly reduced

ECE437, Spring 2016

(1D)

*AGP: accelerated graphics port.*

### Real Stuff: 925Xe

- Northbridge/Southbridge
- NB for high bandwidth
- SB for lower bandwidth
- PCI-e uses serial links
- Ganged links for higher bandwidth



ECE437, Spring 2016

(1E)

*Northbridge  
High speed bus.  
Southbridge  
Low speed bus.*

### Bus Definition



### Synchronous vs. Asynchronous

- Synchronous Bus:**
  - Includes a clock in the control lines
  - A fixed protocol for communication that is relative to the clock
  - Advantage: Involves very little logic and can run very fast
  - Disadvantages:
    - Every device on the bus must run at the same clock rate
    - To avoid clock skew, they cannot be long if they are fast
- Asynchronous Bus:**
  - It is not clocked
  - It can accommodate a wide range of devices
  - It can be lengthened without worrying about clock skew
  - It requires a handshaking protocol

ECE437, Spring 2016

(1F)

Apr. 22:



→ Need a lot of requests/ready flags & acknowledgements to the same.



## I/O Interfacing

- I/O operation needs to be initiated
  - Special opcodes → used to be done

→ Memory-mapped I/O → this is standard today.
- I/O completion must be known
  - Polling
  - Interrupt-driven

ECE437, Spring 2016

(28)

## I/O Device Notifying the OS

- The OS needs to know when:
  - The I/O device has completed an operation
  - The I/O operation has encountered an error
- This can be accomplished in two different ways:
  - Polling:
    - The I/O device puts information in a status register
    - The OS periodically checks the status register
  - I/O Interrupt:
    - Whenever an I/O device needs attention from the processor, it interrupts the processor from what it is currently doing.

ECE437, Spring 2016

(29)

## I/O commands

- Memory-mapped I/O
  - Special addresses not for memory (\$r9 contains special address for IOC)
  - Commands written (sw) as data (\$r4)
- I/O commands
  - Special opcodes
  - Send over I/O Bus

sw \$r4, 0(\$r9)



ECE437, Spring 2016

(29)

## Polling: Programmed I/O



ECE437, Spring 2016

(30)

## Interrupt Driven Data Transfer



ECE437, Spring 2016

(31)

## I/O Interrupts

- An I/O interrupt is just like the exceptions except:
  - An I/O interrupt is asynchronous
  - Further information needs to be conveyed
- An I/O interrupt is asynchronous with respect to instruction execution:
  - I/O interrupt is not associated with any instruction
  - I/O interrupt does not prevent any instruction from completion
    - You can pick your own convenient point to take an interrupt
- I/O interrupt is more complicated than exception:
  - Needs to convey the identity of the device generating the interrupt
  - Interrupt requests can have different urgencies:
    - Interrupt request needs to be prioritized

ECE437, Spring 2016

(32)



but this is virtual addr. Need to go to TLB [or interrupt CPU for TLB lookup] or read from page table to get physical addr.



if it was physical addr, had to get complete blocks. Also, a page might be moved out from mem. To prevent this, the page table value must be "pinned" [unremovable].



CPU sets instr. in memory & tells IOP to go to run the IOP processor at that instr. set.

The IOP works & tells CPU that it is done.



usually does big writes. Arbitrator can allow bulk read/write permission.



## OS Responsibilities

- Provide protection to shared I/O resources
  - Guarantees that a user's program can only access the portions of an I/O device to which the user has rights
- Provides abstraction for accessing devices:
  - Supply routines that handle low-level device operation
- Handles the interrupts generated by I/O devices
- Provide equitable access to the shared I/O resources
  - All user programs must have equal access to the I/O resources
- Schedule accesses in order to enhance system throughput

ECE437, Spring 2016

(140)

## Life cycle of I/O device

### Interfacing: Summary

- Multiprogramming
  - Program invokes syscall (i.e., invokes OS)
  - OS checks protection [cannot run root w/o password]
  - OS runs device drivers
  - Suspends current process and switches process
  - I/O interrupt fielded by OS
  - OS completes I/O and wakes up suspended process (i.e. make it runnable)
  - Run next ready process

ECE437, Spring 2016

(141)

## Outline

- Disks
- Buses
  - Performance vs. Simplicity tradeoff
  - Timing
- Interfacing
  - OS vs. user mode, virtualized interface
- Networks
- Video
- Optical drives

ECE437, Spring 2016

(142)

## Networking I/O Issues

- Protocol Stack
  - Application: sends a HTML file
  - HTML file sent over TCP/IP
  - TCP breaks up html file into IP packets
  - Sends IP packets (and ensures in-order delivery, retransmitting if necessary)
  - IP packets may further be broken up into Ethernet/802.11b frames
- Checksums/ error handling at several layer

ECE437, Spring 2016

(143)

## Networking I/O issues

- I/O Issues
- Partitioning this stack
  - What gets done on the CPU?
  - What gets done on other hardware?
- TCP offload engines (TOE) hardware cache to ports.
  - "smart NIC" handles more of TCP stack functionality
  - Relevant at GigE speeds
- Other functionality may also be offloaded
  - Firewall/VPN/filtering/IDS

ECE437, Spring 2016

(144)

## Networking



- WAN
  - Backbone
- LAN
  - Ethernet
  - WLAN (802.11b)

ECE437, Spring 2016

(145)

## Ethernet

- One shared medium
  - Distributed arbitration by collision detection
  - Officially : CSMA/CD : carrier-sense multiple access with collision detection
  - Length limits (why?)
  - Shared medium is bottleneck
- "The rumors of my demise have been greatly exaggerated." -- Mark Twain
- But wait... what survived is not the same
- Evolving
  - 100MBit ethernet, Gigabit ethernet, 10Gb Ethernet
  - Now a switched network standard
    - Not shared bus, point-to-point network

ECE437, Spring 2016

(40)

NOT bus network.  
No arbiter

carrier sense : everyone can read off wire.  
multiple access : everyone can write on wire.

collision detect : as you send, you listen.  
If someone else sends too,  
you can detect and act  
based on that.

If a case of collisions,  
exponential back off

wait - rand(10)  
if still collides, rand(100)

compatible

## 802.11b/g/n

- Also distributed arbitration
- CSMA/CA : carrier-sense multiple access with collision avoidance
- To minimize collision - Combination of
  - Handshakes (RTS/CTS) and
  - Backoff  $\xrightarrow{\text{if send}} \xrightarrow{\text{if no signal}}$
- Shared medium is the 802.11b/g/n spectrum (near 2.4GHz)
- Cannot detect collision
  - Limitation of RF - turn off receiver when transmitting
- Destination has to send an acknowledgement
  - No Ack  $\rightarrow$  collision and/or RF interference

ECE437, Spring 2016

(41)

cannot detect collision:

i.e. cannot read &amp; write together.

See collision avoidance.

send RTS.

if you get CTS, then write/send message.  
else, wait & try later.

Apr. 25.

back when  
IBM PC &  
GIGA games



## What is Performance for us?

- For computer architects
  - CPU execution time = time spent running a program
- Because people like faster to be bigger to match intuition
  - performance =  $1/X$  time
  - where X = response, CPU execution, etc.
- Elapsed time = CPU execution time + I/O wait
- We will concentrate mostly on CPU execution time

ECE437, Spring 2016

(7)

## Improve Performance

- Improve (a) response time or (b) throughput?
  - faster CPU
    - both (a) and (b)
  - Add more CPUs
    - (b) but (a) may be improved due to reduced queueing

Give an example of this phenomenon

ECE437, Spring 2016

(8)

## Performance Comparison

- Machine A is n times faster than machine B iff
  - $\frac{\text{perf}(A)}{\text{perf}(B)} = \frac{\text{time}(B)}{\text{time}(A)} = n$
- Machine A is  $x\%$  faster than machine B iff
  - $\frac{\text{perf}(A)}{\text{perf}(B)} = \frac{\text{time}(B)}{\text{time}(A)} = 1 + x/100$
- E.g., A 10s, B 15s
  - $15/10 = 1.5 \Rightarrow A$  is 1.5 times faster than B
  - $15/10 = 1 + 50/100 \Rightarrow A$  is 50% faster than B

ECE437, Spring 2016

(9)

## Breaking down Performance

- A program is broken into instructions
  - H/W is aware of instructions, not programs
- At lower level, H/W breaks instructions into cycles
  - lower level state machines change state every cycle
- E.g., 4 GHz Opteron runs 4 B cycles/sec, 1 cycle = 0.25 ns

ECE437, Spring 2016

(10)

## Iron law

- Time/program =  $\frac{\text{instrs}}{\text{sec}} / \text{program} \times \frac{\text{cycles}}{\text{instr}}$
- sec/cycle (a.k.a. cycle time, clock time) - 'heartbeat' of computer
    - mostly determined by technology and CPU organization
  - cycles/instr (a.k.a. CPI)
    - mostly determined by ISA and CPU organization
    - overlap among instructions makes this smaller
  - instr/program (a.k.a. instruction count)
    - instrs executed NOT static code
    - mostly determined by program, compiler, ISA

ECE437, Spring 2016

(11)

## Our Goal

- Minimize time which is the product, NOT isolated terms
- Common error to miss terms while devising optimizations
  - E.g., ISA change to decrease instruction count
  - BUT leads to CPU organization which makes clock slower

ECE437, Spring 2016

(12)

## Iron Law Example

- Machine A: clock 1 ns, CPI 2.0, for a program
- Machine B: clock 2 ns, CPI 1.2, for same program
- Which is faster and how much?
- Time/program = instrs/program  $\times$  cycles/instr  $\times$  sec/cycle
  - Time(A):  $N \times 2.0 \times 1 = 2N$
  - Time(B):  $N \times 1.2 \times 2 = 2.4N$
- Compare:  $\text{Time}(B)/\text{Time}(A) = 2.4N/2N = 1.2$
- So, Machine A is 20% faster than Machine B for this program

ECE437, Spring 2016

(13)

## Iron Law Example

- Keep clock of A at 1 ns and clock of B at 2 ns
- For equal performance, if CPI of B is 1.2, what is A's CPI?
  - $\text{Time}(B)/\text{Time}(A) = 1 = (N \times 2 \times 1.2)/(N \times 1 \times \text{CPI}(A))$
  - $\text{CPI}(A) = 2.4$

ECE437, Spring 2016

(14)

## Iron Law Example

- Let CPI of A = 2.0 and CPI of B = 1.2
- For equal performance, if clock of B is 2 ns, what is A's clock?

ECE437, Spring 2016

(15)

## Iron Law Example

- Let CPI of A = 2.0 and CPI of B = 1.2
- For equal performance, if clock of B is 2 ns, what is A's clock?
  - $\text{Time}(B)/\text{Time}(A) = 1 = (N \times 2.0 \times \text{clock}(A))/(N \times 1.2 \times 2)$
  - $\text{clock}(A) = 1.2 \text{ ns}$

ECE437, Spring 2016

(16)

## Other Metrics

- MIPS and MFLOPS
- MIPS
  - = instruction count/(execution time  $\times 10^6$ )
  - = clock rate/(CPI  $\times 10^6$ ) (How?)
- BUT MIPS has problems

$$\text{MIPS} = \frac{\text{I}}{10^6 \times \text{clocktime}} = \frac{1}{\text{CPI} \times \text{clocktime} \times 10^6}$$

ECE437, Spring 2016

(17)

*not considers  
instruction count  
directly*

## Problems with MIPS

- E.g., without FP H/W, an FP op may take 50 single-cycle instrs
- with FP H/W only one 2-cycle instr
- Thus adding FP H/W
  - CPI increases (why?) The FP op goes from 50/50 to 2/1
  - but instrs/prog decreases more (why?) each of the FP op reduces from 50 to 1, factor of 50
  - total execution time decreases
- For MIPS
  - instrs/prog ignored
- MIPS gets worse!

ECE437, Spring 2016

(18)

## Problems with MIPS

- Ignore program
- Usually used to quote peak performance
  - ideal conditions => guarantee not to exceed!!
- When is MIPS ok?
  - same compiler and same ISA
  - e.g., same binary running on Pentium Pro and Pentium
  - why? **instrs/prog is constant and may be ignored**

ECE437, Spring 2016

(19)

## Other Metrics

- PFLOPS = FP ops in program/(execution time  $\times 10^{12}$ )
- Assuming FP ops independent of compiler and ISA
  - Assumption not true
  - may not have divide instruction in ISA
  - optimizing compilers can remove some insts
- Relative MIPS and normalized MFLOPS
  - adds to confusion!

ECE437, Spring 2016

(20)

## Rules

- Use ONLY Time
  - Beware when reading, especially if details are omitted
  - Beware of Peak

ECE437, Spring 2016

(21)

## Outline

- Time and performance
- Iron law
- MIPS and MFLOPS
- Which programs
- How to average
- Amdahl's law

ECE437, Spring 2016

(22)

## Which Programs

- Execution time of what
- Best case - you run the same set of programs everyday
  - port them and time the whole "workload"
- In reality, use benchmarks
  - programs chosen to measure performance
  - predict performance of actual workload (hopefully)
  - saves effort and money
  - representative? honest?

ECE437, Spring 2016

(23)

## Benchmarks: SPEC2006

- **SPEC**: System Performance Evaluation Cooperative
- Latest is SPEC2006, before SPEC89, SPEC92, SPEC95, SPEC 2000
- 12 integer and 18 floating point programs
  - normalize run time with a Gold Standard processor (*SPEC ratio*)
  - GM of the normalized times (why GM?)

ECE437, Spring 2016

(24)

## Recap Exercise

- 60% instructions run in 1 ns and 40% instructions need 1.45 ns
  - Machine A: Cycles time = 1.45 ns
    - All instructions execute in 1 cycle
  - Machine B: Cycle time = 1 ns
    - 60% instructions execute in 1 cycle
    - 40% instructions execute in 2 cycles
  - Which is the better architecture? And by what factor/percentage?

ECE437, Spring 2016

(27)

## How to average

### Example

|           | Machine A | Machine B |
|-----------|-----------|-----------|
| Program 1 | 1s        | 10s       |
| Program 2 | 1000s     | 100s      |
| Total     | 1001s     | 110s      |

- One answer: total execution time, then how much faster than A is B?  $1001/110 = 9.1$

ECE437, Spring 2016

(28)

assuming P1 & P2 are equally weighted.

## How to average

- Another: arithmetic mean (same result: B 9.1 times faster than A)
- Arithmetic mean of times:  $\left\{ \sum_{i=1}^n \text{time}_i \right\} / n$  for n programs
- $AM(A) = 1001/2 = 500.5$
- $AM(B) = 110/2 = 55$
- $500.5/55 = 9.1$
- Valid only if programs run equally often, else use "weight" factors
- Weighted arithmetic mean:  $\left\{ \sum_{i=1}^n \text{weight}_i \times \text{time}_i \right\} / n$

ECE437, Spring 2016

(30)

## Other Averages

- E.g., 30 mph for first 10 miles
- 90 mph for next 10 miles. Average speed?
- Average speed =  $(30+90)/2 = 60$  mph? WRONG
- Average speed = total distance / total time  
 $= (20 / (10/30 + 10/90))$   
 $= 45$  mph
- What if it was 10 hours at each speed?  
  - instead of 10 miles

ECE437, Spring 2016

(31)

## Harmonic Mean

- Harmonic mean of rates =  $\frac{1}{\left\{ \sum_{i=1}^n \frac{1}{\text{rate}_i} \right\} / n}$
- Use HM if forced to start and end with rates
- Trick to do arithmetic mean of times but using rates and not times

ECE437, Spring 2016

(32)

## Practice

- Machine A runs 10M instructions at 15 MIPS and the next 10M instructions at 45 MIPS
  - Average MIPS = ?? harmonic mean
- Machine A runs for 10 seconds at 15 MIPS and the next 10 seconds at 45 MIPS
  - Average MIPS = ?? normal mean

ECE437, Spring 2016

(33)

$$\frac{1}{\text{rate}} = \text{time}$$

time can average.  
 the avg. time is reciprocated  
 for avg. rate.

SPEC normalized run time with a Gold Standard processor. (over value)  
Due to this, must make geometric mean.

## Dealing with Ratios

- Absolute times

|           | Machine A | Machine B |
|-----------|-----------|-----------|
| Program 1 | 1s        | 10s       |
| Program 2 | 1000s     | 100s      |

- Now consider ratios (w.r.t. A)

|           | Machine A | Machine B |
|-----------|-----------|-----------|
| Program 1 | 1         | 10        |
| Program 2 | 1         | 0.1       |

- Averages:  $A = 1, B = 5.05$

ECE437, Spring 2016

(34)

## Dealing with Ratios

- Absolute times

|           | Machine A | Machine B |
|-----------|-----------|-----------|
| Program 1 | 1s        | 10s       |
| Program 2 | 1000s     | 100s      |

- Now consider ratios (w.r.t. B)

|           | Machine A | Machine B |
|-----------|-----------|-----------|
| Program 1 | 0.1       | 1         |
| Program 2 | 10        | 1         |

- Averages:  $A = 5.05, B = 1$

- Both cannot be true

ECE437, Spring 2016

(35)

## Geometric Mean

↓ for ratios

- Don't use arithmetic mean on ratios (normalized numbers)
- Use geometric mean for ratios
  - geometric mean of ratios =  $\sqrt[n]{\prod_{i=1}^n \text{ratio}_i}$
  - Use GM if forced to use ratios
- Independent of reference machine (math property)
- In the example, GM for machine A is 1, for machine B is also 1
- normalized with respect to either machine

ECE437, Spring 2016

(36)

## But..

- Geometric mean of ratios is not proportional to total time
- AM in example says machine B is 9.1 times faster
- GM says they are equal
- If we took total execution time, A and B are equal only if
  - program 1 is run 100 times more often than program 2
- Generally, GM will mispredict for three or more machines

ECE437, Spring 2016

(37)

## Previous Midterm Question

- Machine A runs 9 times faster than machine B when running program P.
- Machine B runs 4 times faster than machine A when running program Q.
- Which machine is faster (and by what factor) when averaged across both programs?

ECE437, Spring 2016

(38)

## Summary for Averages

- Use AM for times
- Use HM if forced to use rates
- Use GM if forced to use ratios
- Better yet, use unnormalized numbers to compute time

ECE437, Spring 2016

(39)

~3 times.

A better than

B by:  $\frac{1.5}{1}$   
factor

|   | A | B   |
|---|---|-----|
| P | 1 | 9   |
| Q | 1 | 1/4 |

FM 1 1.5

when a feature is faster by factor ( $f$ ) and others are same:

major pitfall



## Amdahl's Law

- Why does the common case matter the most?
- Let an optimization speed  $f$  fraction of time by a factor of  $s$
- assuming that old time =  $T$ , what is the speedup?
  - $f$  is the "affected" fraction of  $T$
  - $(1-f)$  is the unaffected fraction

$$\begin{aligned} \text{Speedup} &= \frac{\text{time}_{\text{old}}}{\text{time}_{\text{new}}} = \frac{\text{unaffected}_{\text{old}} + \text{affected}_{\text{old}}}{\text{unaffected}_{\text{new}} + \text{affected}_{\text{new}}} \\ &= \frac{(1-f) \times T + f \times T}{(1-f) \times T + \frac{f}{s} \times T} \end{aligned}$$

ECE437, Spring 2016

(40)

## Amdahl's Law Example

- Your boss asks you to improve processor performance
- Two options: What should you do?
  - improve the ALU used 95% of time, by 10%
  - improve the square-root unit used 5%, by a factor of 10

| $f$ | $s$      | Speedup |
|-----|----------|---------|
| 95% | 1.10     | 1.094   |
| 5%  | 10       | 1.047   |
| 5%  | $\infty$ | 1.052   |

ECE437, Spring 2016

(41)

## Amdahl's Law: Limit

- Make common case fast because:

$$\lim_{f \rightarrow 0} \left( \frac{1}{1-f+f/s} \right) = \frac{1}{1-f}$$



ECE437, Spring 2016

(42)

## Amdahl's Law

- "Make common case fast"
  - Heuristic, not commandment
  - Use for intuition verify with numbers
- 60% can be improved by a factor of 2
  - Speedup =  $1/(0.4+0.6/2) = 1/0.7$
- 40% can be improved by a factor of 8
  - Speedup =  $1/(0.6+0.4/8) = 1/0.65$
- Second option is better
  - Less common case, but higher speedup compensates

ECE437, Spring 2016

(43)

## Summary

- Time and performance:
  - Machine A  $n$  times faster than Machine B
  - iff  $\text{Time}(B)/\text{Time}(A) = n$
- Iron Law: Time/prog
  - Instr count  $\times$  CPI  $\times$  Cycle time
- Other Metrics: MIPS and MFLOPS
  - Beware of peak and omitted details
- Benchmarks: SPEC2006 (int + FP)
- Summarize performance:
  - AM for time, HM for rate, GM for ratio
- Amdahl's Law: Speedup =  $\left( \frac{1}{1-f+f/s} \right)$  common case fast

ECE437, Spring 2016

(44)

improvement  
ratio amount