

# NUMBER REPRESENTATION

| TYPE             | RANGE                         | LOWEST  | HIGHEST | # OF REPRESENTABLE NUMBERS                       |
|------------------|-------------------------------|---------|---------|--------------------------------------------------|
| unsigned         | $[0, 2^{n-1}]$                | 00...00 | 11...11 | $2^n$                                            |
| sign & magnitude | $[-(2^{n-1}), 2^{n-1}]$       | 11...11 | 01...11 | $2^n - 1 \leftarrow 2 \text{ nps of } \emptyset$ |
| 2's complement   | $[-2^{n-1}, 2^{n-1}-1]$       | 10...00 | 01...11 | $2^n$                                            |
| Biased           | bias, $2^n - 1 - \text{bias}$ | 00..00  | 11...11 | $2^n$                                            |

TWO'S COMPLEMENT: first bit=1 → negative

to find magnitude: ① invert bits of negative # ② add 1 ③ Negate unsigned result

shift left logical in 2's comp: multiply by 2  
shift right in 2's comp: floor divide by 2

Bias:  $x + b = v$   
 $\uparrow \text{bias} \uparrow$   
 unsigned binary to represent char str1 =  $\star((\text{char}*)p)$   
 # you want char str2 =  $\star((\text{char}*)q)$

C (pass by value)

types: char: 8 bits = 1 byte int 8-t = 8 bit signed  
int: 32 bits = 4 bytes

False values: '\0', 0, NULL

STRUCTS: USE `(*struct).field`  
STRUCT → FIELD

POINTERS: USE `&` to get address  
use `*` to get value

pointer arithmetic: increment n by  
(ex: int = +4) `sizeof(pointer-type)`

arrays: int arr [] = {1, 2, 3};  
char \* hello = "hello";  $a[i] \leftrightarrow (a+i)$

char \* hello = malloc(sizeof(char) \* 6);

strlen does NOT include NULL acct for null terminator

\* strings end w/ null terminator '\0'

MEMORY ALLOCATION:

malloc (size\_t size): returns void \* pointer  
↑ check if return NULL initializes w/ garbage  
free (void \* ptr) : must free any malloc'd pointer once

realloc(void \* ptr, size\_t size): reallocate memory

calloc (num\_items, size): allocates & initializes to zero

void \* memcpy (void \* dest, const void \* src, size\_t n)

void \* memmove (void \* dest, const void \* src, size\_t n)

endianness:



(RISC-V) OPOCOBAD



logical left shift: shift left add zeros = multiply pow2

logical right shift: shift right zero extend = divide unsigned by power of 2

arithmetic right shift: shift right sign extend ≠ divide signed by power of 2

SLII x12 X10 0X10 shifts LEFT by 16 bits 0X000034FF → 0X34FF0000

SRII x12 X12 0X08 shifts RIGHT by 8 bits 0X34FF0000 → 0X0034 FF00

SRAI a0 a0 0X02 shifts RIGHT by 4 bits 0XF000000 → 0XF000000

# MEMORY MANAGEMENT



grows down (LIFO), function / local vars, erase when fn return  
`int * foo = malloc(sizeof(int))` #DEFINE X  
`foo = stack` ⇒ `foo = heap` X in code/text  
  
 dynamically allocated (malloc, calloc, realloc), bottom up, persists over calls  
 global vars outside of funcs, immutable, string literal, no grow/shrink  
 loaded @ start, executable read-only code, functions, machine instructions

# Floating Point

32 bit single precision

BIAS = -127

Normalized:  $\text{Value} = (-1)^{\text{sign}} \times 2^{\text{exp} + \text{bias}} \times 1.\text{significand}_2$

Denormalized:  $\text{Value} = (-1)^{\text{sign}} \times 2^{\text{exp} + \text{bias} + 1} \times 1.\text{significand}_2$

| EXPONENT | SIGNIFICAND | MEANING  |
|----------|-------------|----------|
| 0        | Anything    | DENORM   |
| 1-254    | Anything    | NORMAL   |
| 255      | 0           | INFINITY |
| 255      | Nonzero     | NAN      |

VALUE → BINARY

-47.34375 ① 47 = 10111

② 0.34375 = 0.25 + 0.0625 + 0.03125

=  $2^{-3} + 2^{-4} + 2^{-5} = 0.01011$  floating point closest to 0 = denorms

③ 10111, 01011 =  $1.011101011 \times 2^5$   $\times 2^5 - 1$  pos NaN,  $2^{25} - 1$  neg NaN

④  $X - 127 = 5$

$X = 5 + 127 = 10000100$

1 10000100 0111010111 pad w/ zeros,

RISC V

callee: 80-811, SP

caller: a0-a7

60-6C

ra

Save & restore

PC = program counter  
→ where we are in running code

JUMP instructions MODIFY PC

funct7, funct3, opcode = SPECIFIC operation

RType = arithmetic (2 source registers)

IType = immediate arithmetic (1 source, 1 immediate)

I+T-type = immediate shift (slli, srli, srai)

S-type = store

B-type = branches

U-type = upper immediate

J-type = jump (jal, j)

add/mov  
RISC commands

lb: load byte will go to LSB & MSB smear to fill

\* lb x12 i(x5) = 0x93 → 0b1001... → 0xFFFFFFF93

lbu: unsigned smear w/ 0 → 0x00000093

andi with 000000FF to isolate LSB

lui: writes 8 bits to LSB and fills other 24 bits w/ 0

ori PC: build 32 bit with a lui

→ add to this

j label  
(pseudo for jal x0 label)

j label

jump there, don't come back

jal ra label  
(pseudo for jal x0 ra 0)

jump there, come back

jal ra sl 0

(jump to 0+address in sl)

CALL

gcc = C compiler

C → RISC V outputs labels, pseudo instructions, relative addressing, error checking, reorders code

A → object files replaces pseudo instructions w/ assembly relocation table (external labels)

symbol table (labels, static data, relative addresses) arithmetic, logical shifts

L → object files → single binary executable put tgt data & text from .o files → put on end of text resolve refs w/ reloc table → fill in addresses

L → load program into memory for execution initialize registers run a.out part of operating system



|              |                 |                        |
|--------------|-----------------|------------------------|
| commutative  | AND<br>AB=BA    | OR<br>A+B=B+A          |
| associative  | AB(C)=A(BC)     | A+(B+C)=(A+B)+C        |
| identity     | IA=A            | O+A=A                  |
| null         | OA=0            | I+A=I                  |
| absorption   | A(A+B)=A        | A+AB=A                 |
| distributive | (A+B)(A+C)=A+BC | AC(B+C)=AB+AC          |
| deMorgan's   | A(A')=0         | !A !B !C + !ABC + ABC  |
| inverse      | A(A')=0         | !AB(C+!C) + ABC        |
| DeMorgan's   | AB = A+B        | !AB + ABC              |
|              |                 | B(!A+A C) * Absorption |
|              |                 | B(!A+C)                |

### COMMON GATES



## TIMING

**SETUP**: time before clock edge that input must be stable

**Hold**: time after clock edge that input must be stable

**Clock-to-Q**: time after clock edge that input appears on Q

**Critical path**: longest path delay between 2 state elements

**Path delay**:  $t_{clock-to-Q} + \frac{\text{combinational logic}}{\text{FIRST element}} + \frac{\text{setup time}}{\text{2nd element}}$

$IUI_i : \text{lower 12 bits} = 0$   
upper 20 = instruction type imm

$IUI_i + 0 \text{ } 0xABCDEF : t_0 = 0x123456000$

addi to 10 0x678 :  $t_0 = 0x12345678$

auipc x5 0x0300 :  $x5 = PC + 0x30000000$

$PC = 0x40000000 : x5 = 0x40000000 + 0x03000000$

$= 0x43000000$

jalr x0 0xc00(x5) : jump to x5 + sign extend  
 $0x c00 + \text{jump to address}$



min clock period = critical path

max clock freq =  $\frac{1}{\text{min clock period}}$

**MAX**  $t_{clock-to-Q} + t_{shortest-comb} \geq t_{hold}$

**MIN**  $t_{clock-to-Q} + t_{longest-comb} + t_{setup} \leq t_{Q-period}$

longer hold = value has potential to change  
shorter period = values may not finish computing

## DATAPATH



| INSTR | BREQ | BRLT | PCSEL            | ImmSel | BrUn                 | ASel             | BSel             | ALUSel | MemRW              | RegWEn | WBsel            |
|-------|------|------|------------------|--------|----------------------|------------------|------------------|--------|--------------------|--------|------------------|
| add   | ✓    | ✓    | +4 <sup>0</sup>  | +      | ✓                    | Reg <sup>0</sup> | Reg <sup>0</sup> | Add    | Read <sup>0</sup>  | 1      | ALU <sup>1</sup> |
| sub   | ✓    | ✓    | +4               | ✓      | ✓                    | Reg              | Reg              | Sub    | Read               | 1      | ALU              |
| addi  | ✓    | ✓    | +4               | I      | ✓                    | Reg              | Imm <sup>1</sup> | Add    | Read               | 1      | ALU              |
| lw    | ✓    | ✓    | +4               | I      | ✓                    | Reg              | Imm              | Add    | Write <sup>1</sup> | 0      | →                |
| sw    | ✓    | ✓    | +4               | S      | ✓                    | Reg              | Imm              | Add    | Write              | 0      | →                |
| beq   | 0    | ✓    | +4               | B      | ✓                    | PC <sup>1</sup>  | Imm              | Add    | Read               | 0      | →                |
| beq   | 1    | ✓    | ALU <sup>1</sup> | B      | ✓                    | PC               | Imm              | Add    | Read               | 0      | →                |
| bne   | 0    | ✓    | ALU              | B      | ✓                    | PC               | Imm              | Add    | Read               | 0      | →                |
| bne   | 1    | ✓    | +4               | B      | ✓                    | PC               | Imm              | Add    | Read               | 0      | →                |
| bit   | ✓    | ✓    | ALU              | B      | signed 0<br>unsigned | PC               | Imm              | Add    | Read               | 0      | →                |
| bltu  | ✓    | ✓    | ALU              | B      | ✓                    | PC               | Imm              | Add    | Read               | 0      | →                |
| jalr  | ✓    | ✓    | ALU              | I      | ✓                    | Reg              | Imm              | Add    | Read               | 1      |                  |
| jal   | ✓    | ✓    | ALU              | J      | ✓                    | PC               | Imm              | Add    | Read               | 1      |                  |
| auipc | ✓    | -    | +4               | U      | ✓                    | PC               | Imm              | Add    | Read               | 1      | ALU              |

\* longest instruction = load instructions

## PIPELINING

max efficiency in n-stage pipeline = n instructions in pipeline

\* add register for every wire carrying value across stages

latency = time for CPU to execute 1 instr

throughput = # of instructions executed per unit time

processor performance

$$\frac{\text{Time}}{\text{Program}} = \frac{\text{Instruction}}{\text{Program}} \cdot \frac{\text{Cycles}}{\text{Instruction}} \cdot \frac{\text{Time}}{\text{Cycles}}$$

|             | SINGLE                                     | N-STAGE                                      |
|-------------|--------------------------------------------|----------------------------------------------|
| CLOCK CYCLE | full datapath clock per                    | max of n stages                              |
| latency     | clock cycle                                | clock cycle * n                              |
| throughput  | $\frac{1}{\text{clock per full datapath}}$ | $\frac{1}{\text{clock per max of n stages}}$ |

**STRUCTURAL HAZARD**: same piece of hardware being accessed by more instructions than access points

↳ SOLNS: stall / add more hardware, separate MEM & DMEM [can it be solved by adding another regfile?]

**DATA HAZARD**: data dependency → data needed by following instr before its updated

ex reading register before prev finished writing

↳ SOLN: stall / NOP, forwarding

**CONTROL HAZARD**: flow of execution has more than 1 option

\* JUMPS + BRANCHES

↳ SOLN: stall, branch prediction, flushing

**FSM** = represented by states + transition  
transition  $i \rightarrow i$  output 1



**DLP** = data level parallelism

\* optimize instruction program → increase work per instruction

SIMD = single instr, multiple data

vectorized calculations

notation: - mm\_ <instr\_ops> <suffix>



↳ remember edge case for elements that don't fit in vector

## AMDAHL'S LAW

$$\text{SPEEDUP} = \frac{1}{(1-F) + \frac{F}{S}}$$

F = fraction of program that can be optimized

S = speedup factor for how much that fraction can be optimized

## OPENMP

#pragma omp parallel

```

{ 
  // fun stuff
  ...
  // copied & executed across all threads INDEPENDENTLY
  // if around a loop: copies + executes same loop across many threads
}
```

#pragma omp parallel for

```

for (l=0; l<h; l++) {
  // fun stuff
  ...
  // SPLIT across # of threads
  // each thread is independent
  // scheduled by OS
}
```

## PLP : process level parallelism

program on multiple processes @ once  
Manager-worker framework

↳ manager assigns work every other process (workers) ask for work

## OPENMPI

```

int main (int argc, char ** argv) {
  // set up openMPI
  if (manager process) {
    // manager node code
  } else {
    // worker node code
  }
  // terminate openMPI
}
```



## CACHES

= small memory storage w/ FASTER ACCESS than DRAM

Logic:

Need to access memory → check cache  
if not in cache → grab from main mem & load into cache  
if in cache → grab from cache

**locality**: principle for saving/updating data in cache

- ① **temporal**: keep most recently accessed items
- ② **spatial**: move contiguous blocks of memory into cache



### DIRECT MAPPED



distinguish between all main memory address

# of bits

$$\text{INDEX} = \log(\# \text{ blocks}/\text{associativity}) = \log_2(\#\text{of indices})$$

$$\text{offset} = \log_2(\text{size of blocks})$$

$$\text{tag} = (\#\text{bits in mem addr}) - (\#\text{of index bits}) - (\#\text{of offset bits})$$

$$\text{cache size} = \#\text{ of blocks} \times \text{block size}$$

### CACHE EVICTION POLICIES

LRU = least recently used

MRU = most recently used

FIFO = first in first out

LFU = last in first out

### FULLY ASSOCIATIVE



→ NO CONFLICT MISSES

→ need comparator for every entry to check valid

cache = \$ main mem/mm

① write through: on every write

→ data written to \$ + MM

② no write allocate: \$ line NOT

written to cache → directly to MM

③ write back: data written to \$ and

DIRTY BIT = 1 → written to MM in eviction

④ write allocate: \$ line pulled to \$ and written to in specified write-hit policy

to in specified write-hit policy

t0 = 01111

s11 = 0111100000

srl1 = 0111100000000000

t0 = 0x F000 0000

t1 = 0x F000 0000

t0 = 0x DF00 0000

t1 = 0x FF00 0000

→ ml28i \_mm\_seti\_epi32(first)

set 4 signed 32-bit ints to i

→ ml28i \_mm\_loadi\_epi32(...ml28i &p)

load into pointed to by p into vector

→ ml28i \_mm\_andi\_epi32(...ml28i &a, ...ml28i &b)

(a & b, a & b, a & b, a & b)

→ ml28i \_mm\_andi\_epi32(...ml28i &a, ...ml28i &b)

(a & b, a & b, a & b, a & b)

void \_mm\_storei32(...ml28i &p, ...ml28i n)

store 32-bit vector at pointer p

→ ml28i \_mm\_andi\_epi32(...ml28i a, ...ml28i b)

bitwise AND of 32 bits in a and b

→ ml28i \_mm\_cmphi\_epi32(...ml28i a, ...ml28i b)

inv element of return vector set of 0xffffffff

if inv element of a and b = → 0xffffffff

↳ multithreaded CANNOT assume order threads are executed in

fork(): master → child threads

& divides work

join(): wait until threads finish

→ synchronize → terminates all except master



$$kilo = 10^3 \text{ kibi} = 2^{10}$$

$$10^3 \approx 2^{10} \text{ mega} = 10^6 \text{ mibi} = 2^{20}$$

## AMAT = average memory access time

hit time + miss rate (miss penalty)

↓ time to get data if in cache already

↓ # misses / # accesses

↓ time to get data if it's not in the cache

$$L2 \text{ AMAT} = L1 \text{ HT} + L1 \text{ MR} (L2 \text{ HT} + L2 \text{ MR} (\text{main mem access}))$$

### CACHE CODE ANALYSIS

① count access per inner & outer loop

② look for compulsory misses, thrashing (going back and forth between filling cache block) and count hits + misses

$$\text{Hit rate} = \frac{\text{hits}}{\text{hits} + \text{misses}}$$



### CACHE MISSES

① **compulsory miss**: never tried to pull block into cache (ALWAYS happens first time we request)

↳ SOLN: increase cache block size, prefetching

② **conflict miss**: had to evict block bc another block REPLACED it and CACHE NOT FULL

↳ does NOT exist for fully associative

↳ SOLN: increase associativity improve replacement policy

③ **capacity miss**: evicted block bc cache was FULL

↳ increase cache size

↳ request for this exact block before?

$$B = \frac{\text{# total blocks in cache}}{\text{block size}} = \frac{\text{cache size}}{\text{block size}}$$



# Virtual memory

- protection + security
- combine disk + memory into larger space
- allows programs to "see" more memory than exists
- PAGES** = uniformly sized block of data
- ↳ **page table** maps between virtual and physical



offset bits =  $\log_2$  (page byte size)  
div by 1 when write



**TLB** = translation lookaside buffer  
\* cache for page table  
\* usually fully associative  
\* RESET when new process started

|             |
|-------------|
| 0xB055ADE1  |
| 0xC160EFFF  |
| 0x7890ABCD  |
| 0x97543210  |
| <truncated> |
| page table  |

0x 00000445  
VPN offset

index 0 = 0xB055ADE1

PPN

PPN + offset = ADE1445  
MSB = 1 so valid

## STEPS

- ① check if VPN in TLB
    - ↳ check valid bit
    - 1 = hit
    - 0 = check page table
  - ② check page table
    - ↳ check valid bit
    - 1 = hit (NO page fault)
    - 0 = page fault
- ↓  
go to next physical address

## \* FOR RISC + C coding questions :

- ③ can use bitwise AND (&) to check if a digit exists with another digit
- ④ can use slli to multiply
- ⑤ malloc(sizeof(int) \* n)
- ⑥ calloc(num\_elems, sizeof(int))

↑  
1 elem

## MULTITHREADING TASKS

- \* find combo of tasks to maximize parallelism (multiple chips doing work @ same time) and chip specialization (chip that's good @ memory doing tasks more demanding on memory)

