

## Pre-M1 (c)

\*: dereferencer  
&: referencer.

## C / Memory

- `ptr = malloc(size);` Allocates uninitialized block, size size  
(@ pointer `ptr` in memory)
- `ptr = calloc(num_items, size);` allocates num\_items blocks of zeroed memory of size size  
(@ pointer `ptr`)
- `free(ptr);` free memory allocated at `ptr`. do this if you'll never use that mem. again.
- `realloc(ptr, new_size);` Either
  - Expands or contracts memory allocated by `ptr` to `new_size`
  - Allocates a new block, copies the memory it can from old block, frees old block.

## Alignment

- char: 1 byte, no alignment rules
- short: 2 bytes, aligned by halfword
- int: 4 bytes, aligned by word
- void\*

## Stack vs. Heap Space

- Stack space: memory allocated on the call stack
  - most primitive datatypes
- Heap space: memory allocated during execution
  - allocated memory
- Static storage: global vars, static literals
- Code storage: bytecode that comprises the code being run

### Struct:

- multiple blocks of data in one structure.
- info is bunched adjacently & spaced according to alignment rules.

### Unions:

- multiple data types stored in one block of memory.
- allocated to be the largest data type
- set Block to 0 = set everything to 0.

## CALL: Compiler

### Assembler

passes over twice  
for PC-relative  
labels.

translates C into a lower-level asm lang. like RISC-V

translates from Assembly into machine code.

↳ reads & user directives, which give directions to the assembler for how to handle mem, etc.

↳ object file format:

; header - size & pos of other pieces

; text segment - code

; data segment - binary rep of static data in source file

### Linker

resolves all assembled symbols, possibly w/ libraries, etc.

and references to static data.

↳ stores this info in a relocation table & symbol table.

↳ lowest part of CALL

### Loader

Running the executable, putting data into memory.

Is the OS

## Control Logic

- PCSel: If ≠ branch or jump, PC = label, not PC+4.

- ImmSel: diff instruction types have diff ways of making immediates
  - ↳ immSel determines which to use.

- RegWEN: do we write to register? No for store & branch

- BSEL, ASel: picking what to add to ALU. ALU PC & RS1, while B bwn imm & RS2

- ALU Sel: which ALU op

- MemRW: write to memory?

- WB Sel: What to write to dest reg? Memory, ALU, or PC+4?

## Number Rep:

### Unsigned

- representable range:  $[0, 2^n - 1]$
- just ignores signs, best for nonnegative-only metrics like length.

- if we wanted signs..,

### Signed magnitude

- Representable Range:  $[-2^{n-1}, 2^{n-1} - 1]$
- one bit determining sign, rest comprising int's as normal

### bias

- Representable Range:  $[B, 2^n - 1 + B]$ , usually  $B > 0$ .
- allows us to represent neg. #'s as nonneg.
- would need to perform bias externally and it would work like unsigned.

### 2's complement

- Representable Range:  $[2^{n-1}, 2^{n-1} - 1]$
- then, to repr. a negative number, flip all bits in original number & add 1 to the result.

## Floating Point

### Four Components



Change behavior of 0.b00...0 st. range extends to exactly 0.  
↳ instead of this being 1.2 bits, make it  $0.2^{B_{\text{bias}}+1}$  when exponent = 0  
↳ decreases precision but at least we can represent #'s < 1

If exponents are all 1's, then our value is  $\infty$   
only if mantissa = 0. Otherwise, is NaN.

### MTTF / MTTR

mean time to failure: higher = better  
mean time to repair: lower the better

$$MTBF = MTTF + MTTR, \text{ Availability} = \frac{MTTF}{MTBF}$$

if we have smth running for X days out of Y, avail. =  $X/Y$ . Use this to find MTTF / MTTR.  
If we have one, can find the other.

## RAID

- RAID 0: just put all memory inside one
  - not dependable

- RAID 1: copy disk n times
  - restricts amt. of storage we can have

- RAID 4: store info across disks w/ parity bit.
  - can run ECC, and more mem efficient.

- RAID 5: same as 4, but stratified.

### ECC

→ parity = even-odd of sum of the bits, found via XOR.

→ Hamming Distance = # bits different b/w two strings.

→ Hamming ECC: Use library value to determine corrupted bit.

↳ Check if parity bits need to be 1 or 0 after adding infill bits & encoding.

↳ if p's have errors, find bit overlap errors & flip the sf.



## Pre-Midterm (ASM)

### Syntax:

- upper immediate is anything  $\geq 2^8 = 1000$ .
  - this is basically adding values above 12<sup>th</sup> bit to an index.
  - example adds upper immediate of an offset to the PC.
- jalr stores pc+4 in a register. commonly use "ra" to find where we jump back
- calling conventions!!
  - values of a, t registers do not carry over.
  - values of s registers & ra must be stored on stack, done w/ sp.



clock repeatedly turns on & off. this is the clock cycle.  
 Registers only update once per clock cycle, when clk turns on. however, you must wait until after clk-to-q timer to see updates.

- each component in the circuit has a propagation time.
- setup time: amount of time before a clock interval that input must be stable (constant)
- hold time: amount of time after a clock interval that input must be stable. depends on components.
- $t_{clk-to-q} + t_{shortest-combo-path} \leq t_{hold}$
- $t_{clk-to-q} + t_{longest-combo-path} + t_{setup} \leq t_{clock-period}$ .

SHORTEST comb. path is determined by finding least time between two timed components. (NOT necessarily reg's!)

LONGEST comb. path is determined by finding the longest time between two timed components

### Finite States

- detect a bit sequence or state
- model as state graph



## Caches

- intermediate b/w energy cost & time for registers & DRAM.

$$\lg(\text{mem size}) = \text{addr. bit width} = |T| + |I| + |O|$$

$$\text{Cache size} = (\text{blk size})(\# \text{blk}) = N \cdot 2^{101} + |I|$$

- n-way set associativity: we can put n addresses in a certain index
  - fully associative: there is no index, we just put addresses in the cache.
  - more associativity: ↓ hit speed, ↑ hit time, ↑ hit rate.
    - however, cache misses are much more costly than cache hits, so a respectable hit rate is crucial.
  - direct-mapped = one block per index.



### Cache Misses

- 1) compulsory miss — have not yet fetched block.
- 2) capacity miss — req'd data evicted because no space left.
- 3) conflict miss — need to update cache to fit the tag.



### Eviction Policy

- FIFO: first block added is evicted
- LRU: least recently used block is evicted
  - b/c of locality, this is better.

### AMAT = avg. memory access time.

$$(\text{hit time}) + (\text{miss rate}) \cdot (\text{miss penalty})$$

### Multi-level Caches

↳ usually increasing in associativity.

L1 cache → L2 cache → etc.

$$\text{AMAT} = \text{L1 HT} + \text{L1 MR} \cdot (\text{L2 HT} + \text{L2 MR} \cdot (\text{L3 ...}))$$

## Post - MT:

### Pipeline: (Data path in ref sheet)

- latency: delay to process operation
- throughput: # ops done in one op.
- pipelining increases latency, but also throughput.

There are 5 stages:

- IF  $\rightarrow$  instruction fetch. what's the instruction
- ID  $\rightarrow$  instruction decode - find the meaning of instruction
- EX  $\rightarrow$  execute - do stuff w/ branch comp, ALU, etc.
- MEM  $\rightarrow$  memory - put info into memory
- WB  $\rightarrow$  write back - what to reinsert into memory?

Must put registers between stages

Hazards:

- structural hazard
  - hardware doesn't support running these insts. @ same time,
  - solutions are either stalling or + HW, and cur. regfile implementation seems to work.
- data hazard
  - instructions have data dependency,
  - need to wait for memory to behave as intended before using that block.
  - 1) register access — same reg. written & read in 1 cycle.  
solution: write, then read
  - 2) ALU result — inst. depends on WB's register write from previous.  
solution: forwarding: grab info from pipeline stage, requiring add'l hardware.
  - 3) load data hazard — needs stall.  
need to access memory only changed during end of clock cycle, @ beginning of this cycle.  
 $\hookrightarrow$  need load delay, perf. can be salvaged by code scheduling.

### Control Hazard

- how to update the pipeline after branch is taken
  - $\hookrightarrow$  must kill all processes and start @ label if branch taken.
  - Sol. to speedup is to take a branch predictor.
  - Loss of 0 cc if correct, 3CC if wrong.

## Virtual Memory

Purpose: illusion of isolated processes, important for security.

- $\hookrightarrow$  Memory is stored at fixed-size pages.
- $\hookrightarrow$  can spoof addr. of physical memory w/ virtual memory.
- $\hookrightarrow$  can have arbitrary amounts of physical & virtual memory.
  - $\hookrightarrow$  each index can be represented as [P/N]PN  $\rightarrow$  offset.
- $\hookrightarrow$  memory manager translates, i.e.

TLB: translation lookaside buffer.

- $\hookrightarrow$  cache for VPN-PPN mappings.

Parallelism: Can use the fact that CPU has multiple cores to do multiple things @ once

SIMD: does vector addition, in order to save time for array addition. should have a speed increase proportional to vector size.

from Intel AVX library.

data-level parallelism

Amdahl's Law: Speedup bounded by anti. not ||izable.

$$\frac{1}{(1-F) + \frac{F}{S}}, \text{ where } F \text{ is \% code sped up. } S \text{ = sped factor.}$$

Open MP:

- split a looped procedure into a series of threads:

- #pragma omp parallel
  - $\hookrightarrow$  runs n threads in following block
- #pragma omp for
  - $\hookrightarrow$  splits for loop automatically
- #pragma critical
  - $\hookrightarrow$  prevents data races.

Multiple threads receiving + modifying same data @ unsync'd times

Whenever there's C code:  
for example,  $x = x + 1$   
this is a load, add, and store. All 3 actions can happen at different times, with any # of others in between.

- #pragma omp reduction  $\rightarrow$  compute the aggregate of some value in parallel

- run multiple processes = multiple independent instances.
- worker & manager processes.

Manager: assign work & inform user of progress  
Worker: performs task from manager.

Needs:

- Code that sends tasks w/ procID = D.
- Code that recvs tasks, every other process.

$\hookrightarrow$  Want to minimize size of message assigning new work.

- Page faults.



# OpenMPI

## MAP REDUCE & SPARK

- Many workers do small parts of large tasks.

map(*in-key*, *in-value*)

↳ (*int-key*, *int-value*)

reduce(*int-key*, *int-value*)

↳ (*out-value*)

↳ continuation of multiprocessing section above.

• initialize multiprocessing: MPI-Init(*argc*, *argv*)

• get # of nodes / proc ID (*MPI\_Comm\_size(comm, &size)*; *MPI\_Comm\_rank(comm, &rank)*)

• functions taken in pointers & values written to pointers;

• returns # of processes with successful.

• can use *MPI\_COMM\_WORLD* for common value.

• run acc. to worker / manager model?

• MPI-Recv(*buf*, *count*, *datatype*, *source*, *tag*, *comm*, *status*)

• MPI-Send(*buf*, *count*, *datatype*, *dest*, *tag*, *comm*)

• *buf*: array of data to be sent / buffer to receive data

• *datatype*: constants used to specify input type

• *count*: # elements of datatype to receive

• *dest*: forward, the procID of intended msg recipient

• *source*: for recv, the procID of expected msg sender.

• clean up w/ MPI-Finalize()

• *tag*: for when you want to further classify msgs.  
o send/recv only matches when tags identical

• *comm*: communication group; for this class  
set w/ MPI-Comm-World

• *status*: for recv, contains the source/tag, error

msg. of communication.

## HOW MUCH faster?

reg/func  
inlining: < 2x



loop  
unrolling: < 2x



cache opt: ~10x



SIMD: ~8x



openMP: #cores/node



openMPI: #cores.



### Pros?

• easy change

• reduces # accesses

• reduce branching

• surprisingly good

• fairly applicable  
• minimal overhead

• more flexible than SIMD  
or MPI

• can be extended  
arbitrarily large

### Cons?

• minimal effect

• compiler might not do

• minimal effect

• penalty to maintainability

• req's alg changes

• Amdahl  
• restricted by HW.

• concurrency issues  
• high overhead

• expensive comm.  
• high overhead.