

- Abstraction (Layers of representation/interpretation)
- Moore's Law (Designing through trends)
  - [2x transistors / chip every 2 years]
- Principle of Locality (Memory Hierarchy)
- Parallelism [requests / threads] / insts / data]
- Performance Measurement and Improvement
- Dependability via Redundancy

Number representation:  $2^6 = 64$ ,  $2^8 = 256$ ,  $2^{10} = 1024$   
 Aten  $\rightarrow$  7-bit  $\leftarrow$  7-bit Aten, then add 1  $\rightarrow$  8-bit  
 $\leftarrow$  7-bit n bits ( $2^n$ ) two's complement overflow: carry into MSB + carry out MSB  
 $0111111$  1's complement  
 $1000000$  0's complement  
 $111\cdots1$  sign magnitude  
 $000\cdots0$  0's exponent  
 $111\cdots1$  1's exponent  
 $000\cdots0$  0's fraction  
 $111\cdots1$  1's fraction  
 $000\cdots0$  0's bias  
 $111\cdots1$  1's bias  
 $000\cdots0$  0's exponent  
 $111\cdots1$  1's exponent  
 $000\cdots0$  0's fraction  
 $111\cdots1$  1's fraction  
 $000\cdots0$  0's bias

Decimal to Binary:  
 $\oplus$  2's complement:  $x = 2^n \cdot \text{frac} + \text{bias}$   
 $\ominus$  2's complement:  $x = 2^n \cdot \text{frac} - \text{bias}$   
 $\oplus$  2's complement:  $x = 2^n \cdot \text{frac} + \text{bias}$   
 $\ominus$  2's complement:  $x = 2^n \cdot \text{frac} - \text{bias}$

gcc: -Werror: make all warnings into errors.  
 -Werror=xxx: xxx is the specific type of error  
 -Wall: enable all warnings about constructions, some language specific warnings.  
 -Wextra: enable some extra warning flags that are enabled by wall.  
 -Wpedantic/-pedantic: issue all warnings demanded by strict C99/C++11; reject all warnings that use forbidden extensions  
 -g1-ggdb: enable debugging mode; add debug infos.  
 -O0: turn off all the optimizations  
 -m32: compile it for 32-bit machine  
 -fno-strict-aliasing: see result of preprocessing

Macro: #if str == "str" val#123 -> val123  
 Inline: not guaranteed to be inlined, decide by compiler.  
 -Os - ignore inline function; GCC try to inline most function if O2, -O3, recursive doesn't inline static doesn't behave well in macro, macro can't access private/protected variables.  
 #define SQUARE(x) ({ typeof(x) k=(x); k\*k; })

Enum a { RED, GREEN, BLUE }  
 enum a my\_a = RED #Automatic 0, 1, 2, 3, assign  
 typedef enum { ... } a; a my\_a = BLUE #2.

Integer: int should be integer type that target processor works with most efficiently, only guarantee sizeof: long long r, long i, int s, short t; short l, 16 bits, long z, All could be 64bit?  
 32bits machine, pointer, size-t, int size is 4.

Boolean: no explicit. False: 0 | NULL; True: all that is not 0 or 1, all b bits wise; a & b all b logical.  
 order:  $\rightarrow$  !t  $\rightarrow$  \*. In all,  $\rightarrow$ , () function call [Jyoti].

Structure: a = b, not a deep copy, members V things pointed to by pointers X.  
 Pointer: may cause segmentation faults (\*NULL) and bus error. defines that one element past end of array must be a valid address.

int main(int argc, char\* argv[]): argc: number of strings, count the path of executable file; char\* argv[]: an array of string. args: 8bytes -> pointer to 8bytes. dangling pointer: point to memory location that has been deleted. wild pointer: uninitialized pointer point to some arbitrary memory location.

void\* incre\_ptr( int\* h ) { \*h = \*h + 1; }  
 char a[] = "a message"; can't change a. Att X.  
 char b = "a message"; constant string. in static data

### C Memory:

void\* malloc(size\_t n); use headers at start of allocated blocks and space in unallocated memory to hold user's internal data structures.  
 void free(void\* p);

void\* memcpy(void\* dst, const void\* src, size\_t num); copy num bytes from location pointed to by src to dst. binary copy. does not check null.

strlen: size of dst & src (array) < num bytes. Should not overlap. (memmove is option)

void\* memmove(void\* dest, const void\* src, size\_t num); feel like an intermediate exists,

void\* realloc(void\*\* ptr, size\_t size); If ptr==NULL, then realloc behaves like malloc.

If size = 0, then behaves like free. memory place change.

void\* memset(void\* pcr, int value, size\_t num); set the first num bytes of the block of memory pointed by ptr to the specified value (aschar).

char\* strcpy(char\* dest, const char\* src); copy the C string pointed by src to dst, include \0

ASCII hex 30 -'0' hex41 -'A' hex11 -'a'. hex0 -null

RISC: instruction set architecture

RISC: Reduced instruction set computing. keep it simple, build fast hardware; let software do complicated ops by compiling simple ones.

Intel x86, CISC - opposite. (complex ISE) Assembly operands are registers (limited), built directly in hardware.

Standard no-op: add x0, x0, x0. Improve the disassembler and can potentially improve the processor.

Little-endian: Byte > Bytes > Byte > word address

Big-endian: Byte > Byte2 > Byte3 word address

Byte needs sign extended:  $\pm 128$  i.e. 0x80 aligned

Memory align: RISC-V does not require integer to be word aligned i.e. 0x1111, 4x1111, 8x1111... But in practice, aligned access is faster

RISC-V instr:  $\pm 16$  bits  $\pm 16$  inst. sign-extended.

jal rd, offset. rd = pc+4, pc = pc + offset <1>

jal offset(label) =  $\pm 32$  XI, offset pseudo-instr j offset = jal x0 offset

jalr rd, rs, offset. rd = pc+4, pc = rs + offset

If don't want to record where to jump: jr rs = jal x0, rs jrra function return & jal function(label).

ABI: Application Binary Interface. Define calling convention

When you call another function, that function promises not to overwrite certain registers.

Preserved across function call: sp, gp, tp, sa - C11 saver is callee; caller can rely on values being unchanged

Not preserved across function call: Argument / return register go - C7, ra, temporary registers to - t6 restored

ra - caller, callee can change them without saving

C: automatic - stack variables exist across exits from and entries to functions.

static - variables exist across exits from and entries to procedures.

extern - variables exist across exits from and entries to global variables.

Basic structure of a function:

entry-label: add i sp, sp, - frame/420

sw ra, framesize - 4(sp)

sw...

lw ra, framesize - 4(sp)

addi sp, sp, framesize

jr ra

RISC-V convention: global pointer gp points to static

Arguments > 9. > stack.

Same 32 bits instrs used for RV32, RV64, RV128

not to, to == y0, to, to, -

RISC-V features, nx/16-bit instrs. Extensions to RISC-V base 256A support 16-bit compressed instrs and also variable-length instrs that are multiples of 16-bits in length  $\pm 16$  32-bit instrs on either side of PC

LUI X10, 0XDEADB, H 0XDEAD0000  $\rightarrow$  0XDEAD0000

ADDI X10, X10, 0XEEF  $\rightarrow$  X10 = 0XPEAD0EEF

X10, 0x00000000

Branch & jal  $\rightarrow$  273. jal imm original value

quips X10, < 1020 bits > too big for a jal

jalr X0, X1, < 1020 bits >

Stack: local variables move up function, grows downward.

Heap: space requested for dynamic data via malloc();

Reallocate dynamically, grows upward.

Static data: variables declared outside functions, doesn't grow or shrink. Loaded when program starts, can be modified.

Code: load when program starts, does not change

Assembler directives: & that, double, align reserved

.data store subsequent items in the static segment at the next available address

.text store subsequent instructions in the text segment at the next available address (code segment)

.byte store bitvalues as 8-bit bytes

.asciz store subsequent strings in data segment and add null terminator

.word store listed values as unaligned 32-bit words

.globl makes the given label global. label e.g. sym

Environmental calls:

DLlod ID into a0 @ load any arguments into a1-a7.

(@) Any returnvalue is stored in argument registers (a1)

ID NAME Description

1 print-int prints integer in a1 (address is in a1)

2 print-string prints the null-terminated string whose

3 exit exit the program

4 print-character prints ascii character in a1

5 print-hex prints hex in a1. SOURCE:

data the name of the data.

STR: .asciz "Hello world"

6 word 3 successive

7 word 4 memory

8 word 8 block.

9 text

10 a0, 4

11 a1, STR. Load the address where STR is stored.

each

RISV-ISA Specification: does not define assembly syntax/registers

Class: RV32I V characterized by the width of the integer

RV32 RV64 registers and the corresponding size of various extension:

Compressed: some instrs 16 bits. M: integer x, y, V: reg

F: floating D: double Q: quad floating point A: atomic instr

Interpreter: a program that executes other programs.

Directly executes a program in the source language.

Translator: converts a program from the source language (compile) to an equivalent program in another language.

How to choose: Interpreter machine better error message

I: easier to write interpreter; closer to high level; slower but code smaller; provides IS independence

J: efficient; hide source from users.

Compiler: assembler linker

1. preprocessing (CPP) 16-0

2. change source program according to commands begin with H. e.g. include, define, ifendif.

gcc -E foo.c -o foo.o

Let compiler stop after precompiling

1. compiler (CC) may contain pseudo-insts.

High-level language  $\rightarrow$  Assembly language

gcc -S foo.c -o foo.s

stop after compiling and output assembly program

2. assembly (AS) Pe-relative addressing

① Read and use directives

② produce machine language, save in an objectfile.

Pseudo-replacement: mv X1,X2  $\rightarrow$  addi X1,X2,0

nop  $\rightarrow$  addi X0,X0,0

not rd,rs  $\rightarrow$  xor rd,rs,-1

begin, rs, offset  $\rightarrow$  beg rs, X0, offset

bgt rs, rt, offset  $\rightarrow$  blt rt, rs, offset

offset  $\rightarrow$  jal x0, offset | ret  $\rightarrow$  jalr x0,x1,offset

call offset  $\rightarrow$  ampe x0, offset[3:12]; jalr x0,x0,offset[11:12]

tail offset  $\rightarrow$  ampe x0, offset[3:12]; jalr x0,x0,offset[11:12]

too far for ...

Tail Call Optimization: doesn't preserve memory

and jump twice in fun(a){ ... return fool(y)} alike.

It can directly return to where it needs to return to rather than return to where it was called.

Assembly produce machine code

- ① 16 bits RV32I offer shorter 16-bit versions of common 32-bit instructions when a imm/offset is:
- b. one of register is x0/AB1 (link), AB1 stack (x2)
- c. rd=r31 d. used registers are the most 8 popular ones.

Assembly can pattern match and turn 32-bit instructions → 16-bit instructions.

i. Only assembler and RISC-V processor itself needs to know the presence of the 16-bit instructions.

- ② Forward reference e.g. beq ... L2 ... L2 ...

Taking 2 passes over the program.

First pass numbers position of labels; Second pass uses label positions to generate code.

Internal: jump to other file/stack data;

**Symbol Table:** list of items in this file that may be (relative) used by other files. (Label's function; address) Data: anything in the .data; global also, variables used across files.

**Relocation Table:** list of items whose address this file needs (Instr address). External (label jumped to (jnlr, jst)). include lib files / la insts. (jnlr for jalr, base registers)

(Data address) ③ data in static section: (a insts e.g. for lw/sw base registers).

Object File Format:

① Object File Header: size & position of the other pieces of the object file

② Text Segment: machine code

③ Data Segment: binary representation of the static data in the sourcefile

④ Relocation Information: lines of code needs to be fixed up

⑤ symbol table: -

⑥ Debugging infos.

gcc -c test.s → test.o

precompiled.

### 3. Linker (LD)

lib.o

Input: object code files + information tables  
Output: executable code under

- ① Take text segment from each .o, put together

- ② Take data ..., and concatenate them to end of

- ③ **Resolve Reference:** go through relocation table and fill in all absolute addresses [External jump, lw]

\* Address PIC relative addressing (beq, bne, ja) relocate PIC: position independent code. no need to

④ External Function Reference (jal), always

⑤ Static Data Reference (la/lw/lh/lw/lh): relocate

⑥ Absolute addressing [External jump, lw, sw to static data]

[Enable separate compilation of files].

### 4. LOADER (OS)

Input: executable file on disk

Output: load into memory and run start program.

If consider statically, don't choose loader time

All code bits are generated = always X compilation, no machine code generated

We have all info needed = code can also be known after compilation. we know.

Assembly is, internal branch, TAL language

Linker is, static or external jump V

Static linking: done by the linker after compiling.

The library which is static linked doesn't need to be provided when running the program.

Dynamically linking: done by loader before the program starts or by the program itself after running.

Library needs to be provided when running the program.



Law of Boolean Algebra:

$$\bar{x} \bar{z} = 0 \quad x \bar{z} = 1 \quad x z = 0 \quad x z = 1 \quad x \bar{z} = x \quad x z = x$$

$$xx = x \quad x+x = x \quad xy = yx \quad x+y = y+x$$

$$(xy)z = x(yz) \quad (x+y)z = x+z = y+z$$

$$x+y+z = x+y = x+y+z = x+y$$

$$xy = \bar{x} + y \quad \bar{x}y = \bar{y}$$

Combinational logic (CL): ALU → output is f of input

Sequential Logic (SL): state element store info

Max Delay = setup time + CLq + CL delay

Setup Time: when the input must be stable before the edge of the clock

Hold Time: when the input must be stable after the edge of the clock

Clock-to-Q: how long it takes the output to change measured from the edge of CLK

Flip-Flop: one bit of state that samples every rising edge of the CLK

Register: several bits of state that samples on rising edge of CLK or on LOAD

Critical path: CLK → Q + necessary to get the divide by reg output of the reg + worst case a + set up time for next reg

Hold Time Violation:

CLK → Q + best case CL < Hold time enough

Don't hold the input to the flip flop long

solution: add delay on the best case path

FSM: Finite State Machine

Stage 1: Instruction Fetch

① Fetch inst from memory ② PC=PC+4

Stage 2: Instruction decode, gather data from field

① faster than critical path length,

② Read opcode to determine inst type 2-field

time ③ Read in data from all necessary reg.

④ Imm generation controller get the inst

Stage 3: ALU + branch comparison + imm generate

Stage 4: Memory access: only load & store

read/write read/write

Stage 5: Register write

Store & branch skip this stage

asserted (1). negated/deasserted (0).

Register file 16x32 bits.

All 2-type sign-extend the imm.

PC sel & regsel.

(Hardware Description)

RTL: Register Transfer Level ⊂ HDL language

LW uses all 5 stages (exercise the critical path)

Hazard:

(Can always be solved by having more hardware.)

1. Structural hazard: a required resource is

① Regfile hazard solution: having separate ports

② independent read ports + one indep. write port)

③ Memory Access Hazard: use 2 separate memory

RISC-V designed to avoid structural hazards e.g. at most one memory access / inst.

2. Data hazard: data dependency between instrs.

① Register Access Hazard: exploit high speed of regfile: first half write second half read.

② ALU result hazard: 結果被重用導致錯誤。

solution: stalling for 2 insts.

Bubble: effective NOP - compiler can do that

solutions ⑦ Forwarding: grab operand from pipelined stage, rather than register file. Require connections in the datapath.

Forwarding control logic:

if r1s = m.rd, select the alu result of m to be an operand of x's EX stage. Must ignore write to x. \* no stall needed then.

### ⑧ Load Hazard:

Even use forwarding, 1 cycle stall unavoidable slot after a load is called a load delay slot.

if that inst use result of load, hardware stall for one cycle.

solution: code rescheduling, put irrelevant code right after load, compiler / modern cpus hardware

### ⑨ Control Hazard

solution: kill the following 2 insts if branch is taken

solution: Branch prediction. Do that in ZF. only flush pipeline if branch prediction was X.

single cycle data path: t setup + t memread + t pipeline + t wait + t memread + t regfile setup

Pipelined: MEM includes read/write to memory and muxing where to write to

EX includes branch comp, imm generator, mux and ALU. same time happen

Pipelined relative speed < 5x reasons:

① add pipeline registers have cost and set up times ② need to set the clock to the max of the 5 stages, which usually takes different amount of time. ③ hazard require additional logic to resolve, actual speedup may be even.

### Instruction-Level Parallelism (ILP)

#### 1. Hyperthreading

duplicate all elements that hold the state (reg); use the same CL block (ALU, memory)

2 independent processes, 2 threads

Small/no speedup: 在同一stage - 方向是正確的結果

2. Multiple issue "superscalar" ways cycle. replicate pipeline stages, start 3 insts per clock CPI < 1: use IPC (instruction per cycle)

3. Out-of-order execution

reorder insts dynamically in hardware to reduce impact of hazard.

Iron law of processor performance.

Time per program = insts per program × CPI × Time per clock

CPI = Cycles / inst = Time program / (insts prog × Time clock)

Complex pipeline: multiple Functional Units parallel execute

1. static Multiple Issues (Very Long Instruction Word) e.g. 2 insts per cycle. simpler tasks core of hazard

ALU X2, Imm gen X2.

IF → ID → Issue → ALU → MEM → WB

General Purpose Regs FPR Floating-point Regs FPR

Issue: assign insts to functional unit

2. Dynamic Multiple Issues.

Reorder insts. Out of Order OoD

IF → I-cache → Fetch Buffer → Decode/Rename → Issue → Buffer

Functional units → Result Buffer → Commit → Architectural state.

Fetch: inst bits retrieved from inst cache

Decode: inst dispatched to appropriate issue buffer

Execute: inst and operands issued to FU. When

|      | 0  | 0 |
|------|----|---|
| 0000 | 0  | 0 |
| 0001 | 1  | 1 |
| 0010 | 2  | 2 |
| 0011 | 3  | 3 |
| 0100 | 4  | 4 |
| 0101 | 5  | 5 |
| 0110 | 6  | 0 |
| 0111 | 7  | 7 |
| 1000 | 8  | 8 |
| 1001 | 9  | 9 |
| 1010 | 10 | A |
| 1011 | 11 | B |
| 1100 | 12 | C |
| 1101 | 13 | D |
| 1110 | 14 | E |
| 1111 | 15 | F |



VLIW: can issue multiple very  
ALU or branch inst long Inst  
Load or store inst Instropy  
combine 2 to 1/large

### Cache

Tag Set Index Block offset

Size of Block offset =  $\log_2 \# \text{of bytes / block}$

Size of index =  $\log_2 (\text{number of sets}) = \log_2 (\frac{\text{address}}{\text{block size}})$

Size of Tag = Address size -  $\log_2 \# \text{of bytes / block}$

Valid Bit: indicate whether this tag is valid for

Cache Flush: invalidate all entries

Victim Cache: a small fully associative cache

Cache Philosophy: programmer-invisible hardware mechanism to give illusion of speed of fastest memory with size of largest memory.

Use of Principle of Locality: program access small portion of address space at any instant of time (spatial) and repeatedly access that portion

Cache write:

write back: write only to cache and then write cache block back to memory only when evicted. array [i] = array[j] + j

read ① write ② write back ③ write back

Write Through: write cache + write through to memory. If store miss, first load the line

Write allocate:

No write allocate: Just write to the memory.

Cache Replacement Policy: second order effect: only happen on misses

① Random

② Least Recently Used small sets

③ First In First Out highly associative (Round Robin)

④ Not Most Recently Used (FIFO with exception of the most recently used)

Direct Mapped Cache:  $N=1$  = associativity

Line goes 1 place; only need 1 comparator.

If multiple words in a block, then also need a mux to select which word's data to return

Fully Associated:  $N = \text{Number of lines}$

Line can go anywhere; 1 sets; no index field

N-way set Associative:  $N$  places for a line

1 comparator. Nx1 select mux for comparators to check hit or not.

Total cache capacity = Associativity (W)  $\times$  # of

sets (S)  $\times$  blocksize (B) C = NSB

Average Memory Access Time (AMAT) = Hit time + Miss Rate  $\times$  Miss Penalty

Local miss rate  $\# L_1 = L_1 \text{ misses}$

Global miss rate  $\# L_2 = L_2 \text{ misses}$

$\hookrightarrow \# L_1 + \# L_2 = \text{Total access}$

= Local  $\# L_1 \times \text{local } \# L_1$

AMAT =  $L_1 \text{ hit} + L_1 \text{ miss} \times L_1 \text{ hit} + L_2 \text{ miss}$

$\times (L_2 \text{ hit} + L_2 \text{ miss} \times L_2 \text{ miss penalty})$

3C: total defines miss rate.

1. Compulsory (cold start / process migration)

ID: talking + turned + setup

EX: talking + turned + ALU + setup  
 $\#_{\text{ex}} = 1$

MEM: talking + turned + turned + setup

WB: talking + turned + setup + turned

$\#_{\text{wb}} = 1$

$\#_{\text{turn}} = 1$

$\#_{\text{turn}} = 1$

$\#_{\text{turn}} = 1$

Solution: block size ++, increase miss penalty, very large increase miss rate).

Capacity:

solution cache size ++ (may increase access time)

Conflict:

solution: ① cache size ++ (may increase access time)

Bad for conflict: block size ++.

Performance  $\Rightarrow 1/\text{Execution time}$

{Latency (execution time)}

Bandwidth (throughput) Task completed per unit of time

Iron Law of performance:

CPU time  $\frac{\text{Clock cycles}}{\text{Program}} \times \text{Clock cycle time}$

instruction count  $= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Average clock cycles}}{\text{Instruction}}$

Workload & Benchmark:

workload: set of programs run on a computer

benchmark: standardized programs selected for use in comparing performance

SPEC: (system performance evaluation cooperative)

12 integer programs

17 FP programs

SPEC ratio =  $\frac{\text{old reference execution time}}{\text{new execution time}}$

Execution time ratio  $i$ : Give same relative answer no matter what computer is used as reference

Flynn Taxonomy

Single Instruction / Single Data Stream SISD

e.g. superscalar, RISC processor

Single Instruction / Multiple Data Stream SIMD

Multiple Instruction / Multiple Data Stream MIMD

e.g. Multicore, Warehouse-scale computers

Multiple Instruction / single data Stream MISD

(SIMD: simple program multiplexed data)

single program runs on all processors of a MIMD, use synchronization primitive to do cross processor execution coordination

Amahl's Law

Speedup =  $\frac{1}{1 + \frac{P}{N} - \frac{1}{N}}$  fraction 1

Strong scaling:  $\#_{\text{processors}} > 1$ , speed up

Weak scaling:  $\#_{\text{processors}} > 1$ , speed up  $\#_{\text{processors}} \rightarrow \infty$

Load Balance:  $\#_{\text{processors}} \rightarrow \infty$  processor work  $\rightarrow$  queue

Intrinsics: c functions and procedures for inserting assembly language into C code

Parallelism: the only path to higher performance

① Sequential processor performance not expected to much, might even drop. ② Apps with more capability > 1.

③ In mobile systems, use multiple cores 2xP.

④ WSC, use multiple nodes, and all the MIMD/SIMD capability of each node.

|                                   | out of order superscalar             | inorder superscalar                 | VLIW                                        |
|-----------------------------------|--------------------------------------|-------------------------------------|---------------------------------------------|
| Instruction per cycle             | $\leq N$                             | $\leq N$                            | 1 large inst.<br>Same works<br>Normal insts |
| How do we find independent insts. | Look at $> N$ insts in program order | Look at next insts in program order | Just do next large inst                     |
| Hardware cost                     | Expensive                            | Less Expensive                      | Even less Expensive                         |
| Help from compiler                | Compiler can help                    | Needs help                          | Completely depend on compiler               |

In-order superscalar: CPI  $\geq 1$ . SISD that supports SIMD.

Fetch 2 insts per cycle; issue both simultaneously if one is integer/memory and the other is floating point (FPU → under issue)

Out of order superscalar:

Reservation station, Functional units (ALU, Mem), Commit state (Reg file), waiting for resources

when execution completes, all results and exception flags are available.

Commit: instruction irrevocably updates architectural state (aka "graduation"), or take precise trap / interrupt.

Separating completion from commit:

Re-order buffer (ROB) holds register results from completion until commit.

In-order: IF / ID / rename / dispatch for issue

Out of order: EX put into machine buffers to wait

In-order issue: stalls when dependencies / structural hazard / Read After Write / Write After Read / Write After Write

Out of order: Inorder to reservation station, but what functional units are free & avoid data hazard.

In-order completion: classic RISC integer pipeline.

Out of order completion: all machine FUs (integers & FPs, but desire to dispatch FPs)

Register renaming:

Ideas: Re-name on the fly to resolve fake dependencies

Additional benefit: CPU can have more physical reg than logical

Alternative to Data-in-RDP: Unified Physical Register File

When next writer of same architectural register commits, we can reuse a physical register.

| Exponent | Significant | Object               |
|----------|-------------|----------------------|
| 0        | 0           | 0                    |
| 0        | nonzero     | denom                |
| 1-254    | anything    | +/- floating point # |
| 255      | 0           | +/- 0.0              |
| 255      | nonzero     | NaN                  |

Denom: No leading 1. implicit expo-126. ..smallest pos #: 2<sup>-23</sup> × 2<sup>-126</sup>

**Thread**: a sequence flow of insts that perform some tasks.  
**OS Thread**: time multiplexing software threads onto hardware threads.

Active a software thread / load its reg & PC from memory into a hardware thread's reg & jump to its saved PC.  
**Directive**: interrupt, save its PC & reg.

**Hardware Multithreading**: utilize expensive processor (Hyper-threading) resource

### OPENMP

#include <omp.h>

compile: gcc prog.c -fopenmp

Programming model: Fork/join

1. #pragma: #! shared,

#pragma omp parallel private(x) → make private

#pragma omp parallel for → make loop index private

2. openmp thread: OS thread

    omp\_set\_num\_threads(n) → set manually

    omp\_get\_num\_threads() → # of thread

    omp\_get\_thread\_num() → ID

3. Data race: Different threads try to access a

    same location in memory, and at least 1 is write

RISC-V 2 solutions

<> Read-write Pair

    lr rdwr: Load the word pointed to by rs1 into rd, and add a reservation.

    sc rd,rs1,rs2: Store rs2 to MEM(rs1) only if the reservation of rs1 is valid and set status in rd. Return 0 when rs1 is valid (r); Return nonzero (failure) when rs1 is 0. Store it to rd.

try:

```
(r t1,s1    # read reserved
 sc t0,s1,s4 # store conditional
 bne t0,x0,try
    If another thread runs sc,
    this will fail.
    add s4,x0,t1)
```

Test-and-set

    Test to see if a memory location is set; set it to 1 if it is not, otherwise keep testing.

try:

```
lr t1,s1      Locked:
             # critical section
bne t1,x0,try
sc t0,s1,t2
            unlock:
            sw x0,0(s1)
bne t0,x0,try
```

=> RISC-V Atomic Memory Operations (AMO).

Automatically perform an operation on an operand in memory and set the destination register to the original memory value.

e.g. `lwnadd.w rd,rs2,(rs1):`

$t = M[x[rs1]]$

$x[rd] = t$

$M[x[rs2]] = t + \alpha[x[rs2]]$

4. for (1) for this thread's elements are private.

    ① Divide index region sequentially  $0 \rightarrow \frac{n}{2} \rightarrow n$ .

    ② No premature exits from the loop allowed.

5. Timing double omp\_get\_wtime(void)

6. OpenMP reduction (if no data race)

#pragma omp for reduction (private:变量名)

Var is private to each thread.

Memory Mapped I/O:

- I/O address is I/O registers

<> Polling: processor reads control register of device in loop until device sets Ready Bit in control reg to 1.

Then processor load / write, e.g. mouse

<> Interrupt: device interrupt program when I/O is ready

return when done with data transfer.

interrupt handler: function called when an interrupt exists. CPU 寻找 interrupt table 找到地址。

jal to handler & store any state. Handler run on current stack and returns on finish. Thread 1 会继续运行

Interrupt: external (e.g. key press) Asynchronous interrupt

Exception: internal (e.g. page fault, bus error, illegal

insts). Synchronous, 同步的 must.

Trap: action of servicing interrupt or exception

by hardware jump to "trap handler code"

Supervisor mode: enforce resource constraints to apps.  
 Process enters the supervisor mode by using an interrupt, and change out of it using a special inst.

System call into an OS routine

Set up function arguments in registers, and then raise software interrupt. Then OS does it work and return to user mode.

=> OS mediate access to all resources, including device I/O

/Multiprogramming

Context switch: When jumping into process, set timer interrupt. When it expires, clear PC, registers,

etc; pick a different process to run and load its state.

Set timer, change to user mode. Jump to new PC.

Scheduling: Which process decide.

/Dynamically-linked library & static

DLL advantage: ① program requires less disk space

if DLL is shared between programs ② execution of 2 programs

requires less memory, also better performance again due to better caching ③ a replaced library doesn't

improves the program that uses this library

DLL disadvantage: ① At loading or runtime there is

overhead to dynamically link ② Having the executable

is not enough, we still need the correct major version of the library.

/Single cycle CPU inefficiency:

Not all instructions exercise the critical path. It is not parallelized.

Each component can be active concurrently slow.

/What does OS do?

① One of the 1st things that runs when your computer starts (right after firmware/bootloader)

② Load, run and manage programs:

a. Multiple programs at the same time (time sharing)

b. Isolate programs from each other (isolation)

c. multiplex resources between apps, e.g. apps

d. Services: File system, network stack, pointer, etc.

e. Find and control all the devices in the machine in a general way (using "device driver")

/What does OS core do?

① Provide interaction with outside world (with devices)

② Provide isolation between running programs (processes)

/Handling I/O (HDD, Flash, Network)

① BIOS: find storage device and load first sector

② Bootloader: stored on disk) load OS kernel + ramdisk

of the HDD into memory and jump into it

③ OS: boot: initialize services, drivers, etc.

④ Init: Launch an app that waits for input in loop.

/Basic Input/Output System

/Unified Extensible Firmware Interface: successor of BIOS

/Handling Trap in Inorder Pipeline

① Add Exception flags in pipeline until commit point

② Exceptions in earlier inst / pipe stage override exceptions in later inst / pipe stage

③ Inject external interrupt at commit point

④ If exception/interrupt at commit: update cause and

SEPC registers, kill all stages, inject handler PC into fetch stage.

/Launching Application:

① Created by another process calling into an OS routine using syscall

② Loads exefile from disk using the file system service

and puts instrs & data into memory, prepare stack & heap

③ set argc and argv, jump into main function.

3 Reasons for Virtual Memory:

① Add Disks to Hierarchy

② Simplify memory for apps (give them a virtual view of memory).

③ Protection between processes (OS controls translation mechanism between virtual and physical address).

/Field Programmable Gate Array

④ Functional Block: all can be reprogrammed with

I/O Block: configurations to implement Routing network, app-specific digital circuit.

Configuration File: store a bit stream.

1. Functional Block = Look UP Table type.

consists of different # of LUTs. LUT has different size.

Input: n, Output: 1. → represent  $2^n$  function.

2. Routing: switch block at \*

Hardware Description Language: VHDL & Verilog.

Warehouse Scale Computers

1. Simple Parallelism

① Request-level parallelism e.g. web search

② Data-level parallelism e.g. image classifier training

2. Scale

① Scale of economy: low per-unit cost

High number of risks: disk fail / hour

3. Operation cost count

long lifetime

cost of equipment purchases & cost of ownership

Server rack ... Array cluster

Transfer bottleneck distributed data

Processor FPU ALU ALC (configurable)

Mem: SRAM, DRAM, ROM (configurable)

order: sequential, parallel, pipelined, no wire

power: by switching activity

delay: Vdd

Energy

## Wise Storage Hierarchy:

Lower latency to DRAM in another server than local disk; Higher bandwidth to local disk than to DRAM in another server.

## Wise Power Use

$$\text{Power Usage Effectiveness} = \frac{\text{Total Building Power}}{\text{IT Equipment Power}} \times \text{center/networking efficiency}$$

PEA: consume no power when idle; gradually consume more power as the activity level ↑. PEA diagram

## Request-Level Parallelism RLP

e.g. Web search

### Implementation Strategy:

- ① randomly distribute entries
- ② make many copies of search data (carts)
- ③ load balance requests across replicas.
2. Redundant copies of indices and documents.
3. Breakup hot spots
  - ① increase opportunity for request-level parallelism
  - ② make system more tolerant of failures

## Data Level Parallelism DLP

{ SIMD: 单一机器, 处理多线程文件  
DLP on WS(s): 多台机器, map/reduce & scalable

### Map Reduce:

1. underlying runtime system:
  - ① automatically parallelize the computation across large scale clusters of machines
  - ② handle machine failure
  - ③ schedule inter-machine communication to make efficient use of the networks
2. 运行
  - ① Map( key,  value) → list()
  - e.g. map(key, value) key: doc name  
value: doc content  
for each word w in value:  
emit(w, 1).
  - ② Reduce() key, list() → list()
  - e.g. reduce(key, values) key: word  
result = 0  
for each v in values:  
result += v  
emit(key, result)

- ~~Map Reduce~~:
- ① split inputs, startup programs on a cluster of machines
  - ② assign map & reduce to idle workers
  - ③ perform a map task, generate intermediate key/value pairs.
  - ④ write to the buffers
  - ⑤ read intermediate key/value pairs, sort them by key.
  - ⑥ perform a reduce task for each intermediate key > write the result to the output files.

### Map Reduce

- Coin Pair:
1. it's the number of coins of each denomination that a person has
  2. it's the total amount of value of coins (string coin type)
1. Map(CoinPair, pair):
 

```
emit(pair, 1)
```
  - ② Reduce(CoinPair, Iterable<int> count):
 

```
total = 0
      for num in count:
          total += num
      emit(pair, total)
```
  2. Map(tuple<CoinPair, int> output):
 

```
pairs, amount = output
      emit(pair, person, valueOfCoin(pair.coinType), amount)
```
  - ② Reduce(string person, Iterable<float> values):
 

```
total = 0
      for i in values:
          total += i
      emit(person, total)
```

## Spark & Hadoop

Hadoop: open source Map Reduce framework,

has a Hadoop Distributed File System

Spark: high abstraction of map/reduce

Resilient Distributed Dataset (RDD) are the primary abstraction of a distributed collection of items.

• Transform: RDD → RDD

• map(f): return a new transformed item formed by calling f on a source element

• flatmap(f): return a sequence, rather than a single item, each input item can be mapped to 0 or more output items.

• reduceByKey(f): when called on a dataset of (k, v) pairs, returns a dataset of (k, v) pairs where values for each key are aggregated using f: (v, v) → v.

• Actions: RDD → value

reduce(f). ~~key~~ aggregate.

sc.parallelize(data): parallelize a python collection, data.

Example:  $\rightarrow (\text{person}, \text{coinpairs})$

coinData = sc.parallelize(coinPairs)

out1 = coinData.map(lambda (k, v): ((k, k), v))  
reduceByKey(lambda (k, v): v[0])

out2 = out1.map(lambda (k, v): (k[0], v[0].value))  
reduceByKey(lambda (k, v1, v2): v1 + v2)

WordCount example:

text-RDD, flatmap(lambda x: x.split(''))  
map(lambda x: (x, 1))  
reduceByKey(lambda x, y: x + y)

## Hamming ECC

|     |    |    |    |    |    |    |    |    |     |     |     |     |     |     |     |     |
|-----|----|----|----|----|----|----|----|----|-----|-----|-----|-----|-----|-----|-----|-----|
| 1   | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10  | 11  | 12  | 13  | 14  | 15  | 16  | 17  |
| P1  | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | P12 | P13 | P14 | P15 | P16 | P17 |
| X   | X  | X  | X  | X  | X  | X  | X  | X  | X   | X   | X   | X   | X   | X   | X   | X   |
| P2  | X  | X  | X  | X  | X  | X  | X  | X  | X   | X   | X   | X   | X   | X   | X   | X   |
| P4  |    | X  | X  | X  | X  | X  | X  | X  | X   | X   | X   | X   | X   | X   | X   | X   |
| P8  |    |    | X  | X  | X  | X  | X  | X  | X   | X   | X   | X   | X   | X   | X   | X   |
| P16 |    |    |    | X  | X  | X  | X  | X  | X   | X   | X   | X   | X   | X   | X   | X   |

Hamming code.

Hamming distance = different in # of bits

Min Hamming distance of valid parity code is 2.

1 bit parity X bit detect odd number of errors.

Hamming code 8 bit → 10 bits error, 1 bit error

Hamming code 7 bit → detect all 3 bit errors.

## Dependability via redundancy

Spatial redundancy: replicated data or check information or hardware to handle hard & soft failures.

Temporal redundancy: redundancy in time to handle soft failures

## Dependability measures:

Dependability: Mean Time To Failure (MTTF)

Service interruption: Mean Time To Repair (MTTR)

MTBF: mean time between failure

$$MTBF = MTTF + MTTR$$

$$\text{Availability} = \frac{MTTF}{MTTF + MTTR}$$

$$\text{Number of 9s: } 3: 91.9\%$$

## Reliability Measures:

AFR: annualized failure rate

e.g. 1000 disks, 10000 hours MTTF

$$1000 \text{ disk} \times 24 \text{ hrs} \times 115 \text{ days} / 10000 \text{ hours} = 87.6 \text{ failed.} \Rightarrow AFR = 8.76\%$$

## Dependability Design Principle:

Design principle: no single point of failure

Amazfit's law #1

soft error:  $\rightarrow \text{1/2} \times 10^{-12}$

hard error: disk

EDC / ECC

RAID: redundant arrays of inexpensive disks  
#1000TB.

## Direct Memory Access DMA

Allow I/O devices to directly read/write main memory; hardware supported by the DMA Engine.

### 1. DMA Engine

Contains registers written by CPU:  
Memory address to place data, # of bytes;  
I/O device #, direction of transfer;  
unit of transfer, amount to transfer per burst.

### 2. DMA: Incoming data

- ① receive interrupt from device
- ② CPU takes interrupt, begins transfer - instruct DMA engine/device to place data at certain address
- ③ Device / DMA engine handle the transfer - CPU free to do other things
- ④ Upon completion, device / DMA Engine interrupt the CPU again

### 3. DMA: Outgoing data

- ① CPU decides to initiate transfer, config that external device is ready
- ② CPU begins transfer - instruct DMA engine/device that data is available at certain address
- ③ Device / DMA engine handle the transfer - CPU free to do other things
- ④ Device / DMA engine interrupt the CPU again to signal completion.

### 4. Arbitrate between CPU and DMA

Engine / Device access to memory.

#### ① Burst mode

start transfer of data block, CPU cannot access memory in the meantime

#### ② Cycle stealing mode

DMA engine transfer a byte, release control, then repeats - interleaves processor / DMA engine access.

#### ③ Transparent mode

DMA transfer only occurs when CPU is not using the system bus.

|                   | Configuration                                         | Pro / Good for                    | Con / Bad for                                                                     | Picture                                                                            | Meltdown                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|-------------------|-------------------------------------------------------|-----------------------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| RAID 0 (striping) | split data across multiple disks                      | No overhead, fast read & write    | Reliability                                                                       |  | out of order, leak/dump memory<br>we want to read from OS pages that hold data of other processes. Those are in our VirtualMem space in order for the OS to process them quickly, but it is forbidden for our process to read them. If we try to read them, the OS will check (with an if) if we are allowed to read them and thus return with a page fault. But speculative execution will nevertheless cause those pages to be read, but the process doesn't see the result. |
| RAID 1            | mirrored disks: exact copy of data                    | Fast read/write, fast recovery    | High overhead (redundant writes)                                                  |  | If we try to read them, the OS will check (with an if) if we are allowed to read them and thus return with a page fault. But speculative execution will nevertheless cause those pages to be read, but the process doesn't see the result.                                                                                                                                                                                                                                     |
| RAID 3            | Byte level striping with single parity disk           | smallest overhead to check parity | Need to read all disks, even for read to detect errors                            |                                                                                    | We use this to take the to-be-read value and use it as an offset in an array. We later check all the array if it is in the cache (using timing) - if so we know which range of value the variable that we were not supposed to read had.                                                                                                                                                                                                                                       |
| RAID 4            | Block level striping with single parity disk          | Higher throughput for small reads | Still slow for small writes                                                       |                                                                                    | raise exception();<br>access (probe-array [secret * 4096])                                                                                                                                                                                                                                                                                                                                                                                                                     |
| RAID 5            | Block level striping, parity distributed across disks | Higher throughput of small writes | The time to repair a disk is longer than another disk might fail at the same time |                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

Disk

$\frac{1}{2} \cdot \text{one cycle (s)}$

Disk access time = seek time + rotation time + transfer time + controller overhead.

$\frac{1}{2} \cdot \text{number of tracks} \times \text{time to move across one track}$

## Network

### 1. Shared vs Switch:

Data frame pairs communicate at same time

Aggregate bandwidth (BW) in switched network is many times that of shared.

2. Links connect switches and/or routers to each other and to computers / devices

3. Layering, redundancy, protocols, encapsulation  $\rightarrow$  abstraction

### 4. Hierarchy of layers

- App (chat client, game, ...)
- Transport (TCP, UDP)
- Network (IP)
- Data Link layer (Ethernet)
- Physical link (Copper, wireless, etc)

5. Protocol: packet structure and control commands to manage communication

Protocol family: a set of cooperating protocols that implement the network stack.

TCP / IP: Transmission Control

Protocol / Internet Protocol.

| Disk | Definition                                                                                                                                                 | Pro / Good for                                                                                                                                                                          | Con                                                                                                                                                                           |
|------|------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      | Polling: Forces the hardware to wait on ready bit. Alternatively, if timing of device is known, the ready bit can be polled at the frequency of the device | Low Latency; low overhead when data is available.<br>Good for devices that are always busy or when you can't make progress until the device replies                                     | Can't do anything else while polling,<br>Can't sleep while polling (CPU 全部运行)                                                                                                 |
|      | Edge interrupts: Hardware fires an interrupt when it becomes ready. CPU changes PC register to execute code in the interrupt handler when this occurs      | Can do useful work while waiting for response; can wait on many things at once.<br>Good for devices that take a long time to respond, especially if you can do other work while waiting | Nondeterministic when interrupt occurs; interrupt handler has some overhead (e.g. save all registers, flush pipeline, etc).<br>Higher latency per event.<br>Worse throughput. |

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                        |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <p>Spectre: branch prediction<br/> <math>\text{if } (x &lt; \text{array\_size}) \leq \text{cache miss}</math>, so run next<br/> <math>y = \text{array2}[\text{array1}[x] * 4096]</math><br/> <math>\uparrow</math><br/> <math>\text{array1}[x] \text{ cache hit, as } k \text{ is cached, so load } \text{array2}[k * 4096]</math></p> <p>Prerequisite:</p> <ol style="list-style-type: none"> <li>1. <math>\text{array1}[x]</math>, with an out of bound <math>x</math> (larger than <math>\text{array1\_size}</math>), resolves to a secret byte <math>k</math> that is cached.</li> <li>2. <math>\text{array1\_size}</math> and <math>\text{array2}</math> uncached</li> <li>3. previous <math>x</math> value have been valid.</li> </ol> <p>Regarding a misprediction with an illegal <math>x</math>, <math>\text{array2}[k * 4096]</math> will not be used, but <math>k</math> has been loaded into CPU cache. We can use Flush + Reload to guess <math>k</math> with <math>\text{array2}</math>.</p> | <p>Flush + Reload.</p> <ol style="list-style-type: none"> <li>① Permission check for offflush</li> <li>② disallowance of memory sharing</li> <li>③ Tuning software programs</li> </ol> |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                        |

### T-SEUDO INSTRUCTIONS

| MNEMONIC | NAME           | DESCRIPTION                                     | USES       |
|----------|----------------|-------------------------------------------------|------------|
| beqz     | Branch = zero  | $iR[rs1] == 0 \rightarrow PC = PC + [imm, 1b0]$ | beq<br>bne |
| branch   | Branch # zero  | $iR[rs1] < 0 \rightarrow PC = PC + [imm, 1b0]$  | bne        |
| absval   | Absolute Value | $F[rd] =  F[rs1]  < 0 \rightarrow F[rs1]$       | fsgnx      |
| tpmove   | TP Move        | $F[rd] = F[rs1]$                                | tpmove     |
| fpnegate | FP negate      | $F[rd] = -F[rs1]$                               | fpnegate   |
| jmp      | Jump           | $PC = [imm, 1b0]$                               | jmp        |
| jx       | Jump register  | $PC = R[rs1]$                                   | jx         |
| la       | Load address   | $R[rd] = address$                               | la         |
| lui      | Load imm       | $R[rd] = imm$                                   | lui        |
| move     | Move           | $R[rd] = R[rs1]$                                | move       |
| neg      | Negative       | $R[rd] = -R[rs1]$                               | neg        |
| nop      | No operation   | $R[rd] = R[0]$                                  | nop        |
| not      | Not            | $R[rd] = ~R[rs1]$                               | not        |
| rmt      | Return         | $PC = R[1]$                                     | rmt        |
| seteqz   | Set = zero     | $R[rd] = (R[rs1] == 0) ? 1 : 0$                 | seteqz     |
| setneqz  | Set ≠ zero     | $R[rd] = (R[rs1] != 0) ? 1 : 0$                 | setneqz    |

### ARITHMETIC CORE INSTRUCTION SET

| MNEMONIC | FNITNAME               | DESCRIPTION (in Verilog)                     | NOTE |
|----------|------------------------|----------------------------------------------|------|
| mult     | MULHigh                | $R[rd] = R[rs1] * R[rs2]$                    | 1)   |
| multlu   | MULHigh Unsigned       | $R[rd] = R[rs1] * R[rs2] \text{ (0:27, 64)}$ | 2)   |
| multlh   | MULHighUpperHalfSigned | $R[rd] = R[rs1] * R[rs2] \text{ (0:27, 64)}$ | 6)   |
| div      | Divide (Signed)        | $R[rd] = (R[rs1] / R[rs2])$                  | 1)   |
| divu     | Divide (Unsigned)      | $R[rd] = (R[rs1] / R[rs2])$                  | 2)   |
| rem      | Remainder (Signed)     | $R[rd] = (R[rs1] % R[rs2])$                  | 1)   |
| remu     | Remainder Unsigned     | $R[rd] = (R[rs1] % R[rs2])$                  | 1,2) |

### RV64A Atomic Extension

| R, ADD                      | 9)                  |                                                                                                    |      |
|-----------------------------|---------------------|----------------------------------------------------------------------------------------------------|------|
| atomicadd, w, atomicadd.d   | R AND               | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] + R[rs2]$                                                    | 9)   |
| atomicand, w, atomicand.d   | R AND               | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] \& R[rs2]$                                                   | 9)   |
| atomicmax, w, atomicmax.d   | R MAXIMUM           | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] \text{ if } (R[rs2] > M[R[rs1]]) \text{ M[R[rs1]] = R[rs2]}$ | 9)   |
| atomicmin, w, atomicmin.d   | R MAXIMUM Unsigned  | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] \text{ if } (R[rs2] > M[R[rs1]]) \text{ M[R[rs1]] = R[rs2]}$ | 2,9) |
| atomicmin, w, atomicmin.u.d | R MINIMUM           | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] \text{ if } (R[rs2] < M[R[rs1]]) \text{ M[R[rs1]] = R[rs2]}$ | 9)   |
| atomicmin, w, atomicmin.u.d | R MINIMUM Unsigned  | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] \text{ if } (R[rs2] < M[R[rs1]]) \text{ M[R[rs1]] = R[rs2]}$ | 2,9) |
| atomicor, w, atomicor.d     | R OR                | $R[rd] = M[R[rs1]], R[rd] = M[R[rs1]] \text{ if } (R[rs2] < M[R[rs1]]) \text{ M[R[rs1]] = R[rs2]}$ | 9)   |
| atomicswap, w, atomicswap.d | R SWAP              | $R[rd] = M[R[rs1]], M[R[rs1]] \text{ if } (R[rs2] & M[R[rs1]] == M[R[rs1]] \wedge R[rs2])$         | 9)   |
| atomicxor, w, atomicxor.d   | R XOR               | $R[rd] = M[R[rs1]], M[R[rs1]] \text{ if } (R[rs2] & M[R[rs1]] == M[R[rs1]] \wedge R[rs2])$         | 9)   |
| lrx, w, lrx.d               | R Load Reserved     | reservation on [R[rs1]]                                                                            |      |
| swc, w, swc.d               | R Store Conditional | if reserved, $M[R[rs1]] = R[rs2]$ ,<br>$R[rd] = 0$ else $R[rd] = 1$                                |      |

### CORE INSTRUCTION FORMATS

| 31 | 27    | 25                  | 24  | 20  | 19    | 15    | 14          | 12 | 11 | 7 | 6 | 0      |
|----|-------|---------------------|-----|-----|-------|-------|-------------|----|----|---|---|--------|
| R  | func7 |                     | rs2 | rs1 | func3 | rd    |             |    |    |   |   | Opcode |
| S  |       | imm[11:0]           |     | rs2 | rs1   | func3 | imm[4:0]    |    |    |   |   | Opcode |
| SB |       | imm[12][0:5]        |     | rs2 | rs1   | func3 | imm[4:1][1] |    |    |   |   | Opcode |
| U  |       | imm[31:12]          |     | rd  |       | rd    | opcode2     |    |    |   |   | Opcode |
| UJ |       | imm[20][0:11][9:12] |     |     |       | rd    | opcode      |    |    |   |   | Opcode |

### REGISTER NAME, USE, CALLING CONVENTION

| REGISTER | NAME                    | USE                                 | SAVER  |
|----------|-------------------------|-------------------------------------|--------|
| x0       | zero                    | The constant value 0                | N.A.   |
| x1       | ra                      | Return address                      | Callee |
| x2       | sp                      | Stack pointer                       | Callee |
| x3       | gp                      | Global pointer                      | -      |
| x4       | tp                      | Thread pointer                      | -      |
| x5,x7    | t0-t2                   | Temporary                           | Callee |
| x8       | s0-s2                   | Saved register/Frame pointer        | Callee |
| x9       | s1                      | Saved register                      | Callee |
| x10-x11  | a0-a1                   | Function arguments/Return values    | Callee |
| x12-x17  | a2-a7                   | Function arguments                  | Callee |
| x18-x27  | s0-s11                  | Saved registers                     | Callee |
| x28-x31  | t3-t6                   | Temporaries                         | Callee |
| x32      | f0-f9                   | FP Temporaries                      | Callee |
| x33      | f10-f11                 | FP Saved registers                  | Callee |
| x34      | f12-f17                 | FP Function arguments/Return values | Callee |
| x35      | f18-f27                 | FP Saved registers                  | Callee |
| x36      | f28-f31                 | FP Function arguments               | Callee |
| x37      | R[rd] = R[rs1] + R[rs2] | R[rd] = R[rs1] + R[rs2]             | Callee |

(3)

### IEEE 754 FLOATING-POINT STANDARD

(-1)<sup>s</sup> × (1 + Fraction) × 2<sup>(Exponent - bias)</sup>

where Half-Precision Bias = 15, Single-Precision Bias = 127,  
Double-Precision Bias = 1023, Quad-Precision Bias = 16383

IEEE Half-, Single-, Double-, and Quad-Precision Formats:

| S   | Exponent | Fraction  |
|-----|----------|-----------|
| 15  | 14       | 10 9 0    |
| 31  | 30       | 23 22 0   |
| 63  | 62       | 52 51 0   |
| S   | Exponent | Fraction  |
| 127 | 126      | 112 111 0 |



Text  $\leftarrow$   $\begin{matrix} SP \\ \downarrow \end{matrix}$   $\leftarrow$   $\begin{matrix} \text{Saved Registers} \\ \text{Stack Grows} \end{matrix}$

SIZE PREFIXES AND SYMBOLS

| SIZE             | PREFIX | SYMBOL | SIZE              | PREFIX | SYMBOL |
|------------------|--------|--------|-------------------|--------|--------|
| 10 <sup>3</sup>  | Kilo-  | K      | 2 <sup>30</sup>   | Kibi-  | Ki     |
| 10 <sup>6</sup>  | Mega-  | M      | 2 <sup>30</sup>   | Mebi-  | Mi     |
| 10 <sup>9</sup>  | Giga-  | G      | 2 <sup>30</sup>   | Gibi-  | Gi     |
| 10 <sup>12</sup> | Tera-  | T      | 2 <sup>30</sup>   | Tebi-  | Ti     |
| 10 <sup>15</sup> | Peta-  | P      | 2 <sup>30</sup>   | Pebi-  | Pi     |
| 10 <sup>18</sup> | Etha-  | E      | 2 <sup>30</sup>   | Ebi-   | Ei     |
| 10 <sup>21</sup> | Zetta- | Z      | 2 <sup>30</sup>   | Zebi-  | Zi     |
| 10 <sup>24</sup> | Yotta- | Y      | 2 <sup>30</sup>   | Yebi-  | Yi     |
| 10 <sup>27</sup> | milli- | m      | 10 <sup>-3</sup>  | femto- | f      |
| 10 <sup>24</sup> | micro- | μ      | 10 <sup>-6</sup>  | atto-  | a      |
| 10 <sup>21</sup> | nano-  | η      | 10 <sup>-9</sup>  | zepto- | z      |
| 10 <sup>18</sup> | pico-  | ρ      | 10 <sup>-12</sup> | yocto- | y      |

7 10

$$\frac{1}{3} \times 10^{10}$$

$$\frac{10^9}{10^3}$$

$$10^{6400}$$

## RV64 BASE INTEGER INSTRUCTIONS, in alphabetical order

| MNEMONIC      | FORMAT | NAME                              | DESCRIPTION (in Verilog)                                                                                                     | NOTE |
|---------------|--------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------|------|
| R             | I      | ADD (Word)                        | $R[Rd] = R[rs1] + R[rs2]$                                                                                                    | 1)   |
| add, addw     | I      | ADD Immediate (Word)              | $R[Rd] = R[rs1] + imm$                                                                                                       | 1h   |
| and           | R      | AND                               | $R[Rd] = R[rs1] \& R[rs2]$                                                                                                   | 1w   |
| andi          | I      | AND Immediate                     | $R[Rd] = R[rs1] \& imm$                                                                                                      | 1d   |
| auipc         | U      | Add Upper Immediate to PC         | $R[Rd] = PC + [imm, 12'b0]$                                                                                                  | 1m   |
| bsq           | SB     | Branch EQ=Equal                   | $if(R[rs1]==R[rs2])$<br>$PC=PC+[imm, 1b'0]$                                                                                  | 1ma  |
| blt           | SB     | Branch Less Than                  | $if(R[rs1]<R[rs2])$<br>$PC=PC+[imm, 1b'0]$                                                                                   | 1mb  |
| bltu          | SB     | Branch Less Than Unsigned         | $if(R[rs1]<R[rs2])$<br>$PC=PC+[imm, 1b'0]$                                                                                   | 1mc  |
| brne          | SB     | Branch Not Equal                  | $if(R[rs1]!=R[rs2])$<br>$PC=PC+[imm, 1b'0]$                                                                                  | 1md  |
| break         | I      | Environment BREAK                 | Transfer control to debugger                                                                                                 | 1me  |
| ecall         | I      | Environment CALL                  | Transfer control to operating system                                                                                         | 1mf  |
| jal           | UJ     | Jump & Link                       | $R[Rd] = PC+4, PC = R[rs1]+imm$                                                                                              | 1mg  |
| jalr          | I      | Jump & Link Register              | $R[Rd] =$<br>$\{36bM[7], M[R[rs1]+imm][7:0]\}$<br>$R[Rd] = \{56b0,M[R[rs1]+imm][7:0]\}$<br>$R[Rd] = M[R[rs1]+imm][63:0]$     | 1mh  |
| lb            | I      | Load Byte                         | $R[Rd] = \{48bM[15], M[R[rs1]+imm][15:0]\}$<br>$R[Rd] = \{48b0,M[R[rs1]+imm][15:0]\}$<br>$R[Rd] = \{32bimm<1>, imm, 12'b0\}$ | 1mi  |
| lbu           | I      | Load Byte Unsigned                | $R[Rd] = R[rs1]   R[rs2]$                                                                                                    | 1mj  |
| ld            | I      | Load Doubleword                   | $R[Rd] = R[rs1]   imm$                                                                                                       | 1mk  |
| lh            | I      | Load Halfword                     | $M[R[rs1]+imm][63:0] = R[rs2](63:0)$                                                                                         | 1ml  |
| lw            | I      | Load Word                         | $M[R[rs1]+imm][15:0] = R[rs2](15:0)$                                                                                         | 1mm  |
| lwu           | I      | Load Word Unsigned                | $R[Rd] = \{32b0,M[R[rs1]+imm][31:0]\}$                                                                                       | 1mn  |
| or            | R      | OR                                | $R[Rd] = R[rs1]   R[rs2]$                                                                                                    | 1mo  |
| ori           | I      | OR Immediate                      | $R[Rd] = R[rs1]   imm$                                                                                                       | 1mp  |
| sb            | S      | Store Byte                        | $M[R[rs1]+imm][7:0] = R[rs2](7:0)$                                                                                           | 1mq  |
| sd            | S      | Store Doubleword                  | $M[R[rs1]+imm][63:0] = R[rs2](63:0)$                                                                                         | 1mr  |
| sh            | S      | Store Halfword                    | $M[R[rs1]+imm][15:0] = R[rs2](15:0)$                                                                                         | 1ms  |
| sw            | R      | Shift Left (Word)                 | $R[Rd] = R[rs1] << R[rs2]$                                                                                                   | 1mt  |
| slli, sllw    | I      | Shift Left Immediate (Word)       | $R[Rd] = R[rs1] << imm$                                                                                                      | 1mu  |
| slt, sllw     | R      | Set Less Than                     | $R[Rd] = (R[rs1] < R[rs2]) ? 1 : 0$                                                                                          | 1mv  |
| slti          | I      | Set Less Than Immediate           | $R[Rd] = (R[rs1] < imm) ? 1 : 0$                                                                                             | 1mw  |
| sltiu         | I      | Set < Immediate Unsigned          | $R[Rd] = (R[rs1] < imm) ? 1 : 0$                                                                                             | 1mx  |
| sltu          | R      | Shift Right Unsigned              | $R[Rd] = (R[rs1] < R[rs2]) ? 1 : 0$                                                                                          | 1my  |
| srar, sraw    | R      | Shift Right Arithmetic (Word)     | $R[Rd] = R[rs1] >> R[rs2]$                                                                                                   | 1mz  |
| srari, srariw | I      | Shift Right Arithmetic Imm (Word) | $R[Rd] = R[rs1] >> imm$                                                                                                      | 1na  |
| srl, srlw     | R      | Shift Right (Word)                | $R[Rd] = R[rs1] >> R[rs2]$                                                                                                   | 1nb  |
| srlt, srltw   | I      | Shift Right Immediate (Word)      | $R[Rd] = R[rs1] >> imm$                                                                                                      | 1nc  |
| sub, subw     | R      | SUBtract (Word)                   | $R[Rd] = R[rs1] - R[rs2]$                                                                                                    | 1nd  |
| sw            | S      | Store Word                        | $M[R[rs1]+imm][31:0] = R[rs2](31:0)$                                                                                         | 1ne  |
| xor           | R      | XOR                               | $R[Rd] = R[rs1] ^ R[rs2]$                                                                                                    | 1nf  |
| xori          | I      | XOR Immediate                     | $R[Rd] = R[rs1] ^ imm$                                                                                                       | 1ng  |

## OPCODES IN NUMERICAL ORDER BY OPCODE

| FMT | OPCODE  | FUNCTION | CODE     | HEX/DECIMAL |
|-----|---------|----------|----------|-------------|
| 1b  | 0000011 | 000      | 00000000 | 03/1/0      |
| 1h  | 0000011 | 001      | 00000001 | 03/2/0      |
| 1w  | 0000011 | 011      | 00000011 | 03/3/0      |
| 1d  | 0000011 | 100      | 00000100 | 03/4/0      |
| 1m  | 0000011 | 101      | 00000101 | 03/5/0      |
| 1na | 0000011 | 110      | 00000110 | 03/6/0      |
| 1nb | 0000011 | 111      | 00000111 | 03/7/0      |
| 1nc | 0001011 | 010      | 00010010 | 13/2/0      |
| 1nd | 0001011 | 011      | 00010011 | 13/3/0      |
| 1ne | 0010011 | 101      | 00100011 | 13/4/0      |
| 1nf | 0010011 | 110      | 00100010 | 13/5/0      |
| 1ng | 0100000 | 101      | 01000000 | 13/6/0      |
| 2)  | 0010011 | 000      | 00100000 | 18/1/0      |
| 3)  | 0011011 | 011      | 00110011 | 23/1/0      |
| 4)  | 0011011 | 010      | 00110010 | 23/2/0      |
| 4)  | 0010011 | 011      | 00100011 | 23/3/0      |
| 4)  | 0011011 | 100      | 00110010 | 23/4/0      |
| 4)  | 0011011 | 101      | 00110011 | 23/5/0      |
| 4)  | 0010000 | 101      | 00100000 | 33/5/2/0    |
| 4)  | 0010000 | 000      | 00100000 | 33/1/2/0    |
| 4)  | 0010000 | 001      | 00100001 | 33/1/1/0    |
| 4)  | 0010000 | 011      | 00100000 | 33/3/0/0    |
| 4)  | 0010000 | 100      | 00100000 | 33/4/0/0    |
| 4)  | 0010000 | 101      | 00100000 | 33/5/0/0    |
| 4)  | 0010000 | 110      | 00100000 | 33/6/0/0    |
| 4)  | 0010000 | 111      | 00100000 | 33/7/0/0    |
| 3)  | 0011011 | 000      | 00110000 | 38/0/0/0    |
| 3)  | 0011011 | 001      | 00110001 | 38/1/0/0    |
| 3)  | 0011011 | 100      | 00110010 | 38/2/0/0    |
| 3)  | 0011011 | 101      | 00110011 | 38/3/0/0    |
| 3)  | 0011011 | 110      | 00110010 | 38/4/0/0    |
| 3)  | 0011011 | 111      | 00110011 | 38/5/0/0    |
| 3)  | 0011011 | 000      | 00110000 | 38/6/0/0    |
| 3)  | 0011011 | 001      | 00110001 | 38/7/0/0    |
| 3)  | 0011011 | 101      | 00110010 | 38/8/0/0    |
| 3)  | 0011011 | 110      | 00110011 | 38/9/0/0    |
| 3)  | 0011011 | 111      | 00110011 | 38/10/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/11/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/12/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/13/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/14/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/15/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/16/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/17/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/18/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/19/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/20/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/21/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/22/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/23/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/24/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/25/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/26/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/27/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/28/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/29/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/30/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/31/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/32/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/33/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/34/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/35/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/36/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/37/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/38/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/39/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/40/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/41/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/42/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/43/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/44/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/45/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/46/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/47/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/48/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/49/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/50/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/51/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/52/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/53/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/54/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/55/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/56/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/57/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/58/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/59/0/0   |
| 3)  | 0011011 | 111      | 00110011 | 38/60/0/0   |
| 3)  | 0011011 | 000      | 00110000 | 38/61/0/0   |
| 3)  | 0011011 | 001      | 00110001 | 38/62/0/0   |
| 3)  | 0011011 | 101      | 00110010 | 38/63/0/0   |
| 3)  | 0011011 | 110      | 00110011 | 38/64/0/0   |

Note: The word version only operates on the rightmost 32 bits of a 64-bit register.

Operation assumes unsigned integers (instead of 2's complement).

The least significant bit of the branch address in jalr is set to 0.

(signed) Load instructions extend the sign bit of data to fill the 64-bit register.

Replicates the sign bit to fill the leftmost bits of the result during right shift.

Multiply with one operand signed and one unsigned.

The single version does a single-precision operation using the rightmost 17 bits of a 64-bit F register.

Classifies a 10-bit mask to show which properties are true (e.g., -inf, -0, +0, +inf, denorm, ...).

Atomic memory operation, nothing else can interpose itself between the read and the write.

The immediate field is sign-extended in RISC-V.