

Moore's law: 布尔管数量每两年翻一番

Dennard Scaling:

SRAM  $\rightarrow$  cache

DRAM  $\rightarrow$  mem

higher bandwidth: interleaved mem (multibank)  $\rightarrow$  independent

pos = block addr % # of blocks

direct-mapped (unique block)

fully-associative (any)

4 Questions:

① block placement

pos = block addr % # of sets

set associative (unique set)

n-way: n blocks in one set

(usually n=4)

valid bit

block address index offset

reference bit

# bits = log<sub>2</sub>(# sets) # bits = log<sub>2</sub>(size of block)

② block identification

block address index offset

dirty bit

# bits = log<sub>2</sub>(# sets) # bits = log<sub>2</sub>(size of block)

③ block replacement

(for set/full associative)

random, LRU, FIFO

$\rightarrow$  write buffer (stall  $\downarrow$ )

approx write through  $\rightarrow$  write around (only mem)

(write to mem as well)

miss write back  $\rightarrow$  write allocate (mem  $\rightarrow$

unified v.s. split cache (write dirty block when replaced) (mem  $\rightarrow$

cache)

IC access miss rate miss penalty

exec time = (CPU cc + mem stall cycle) cc time

misses  $\rightarrow$  3C model: compulsory, capacity, conflict

(widely used) 最常见的模型是LRU替换的LRU多块在同组

Optimizations 小且简单 L1 cache (also power  $\downarrow$ )

1. hit time  $\downarrow$  - way prediction (提高L1命中率的快, 性能高)

2. bandwidth  $\uparrow$  - addr translation X  $\rightarrow$  virtual mem  $\rightarrow$  virtual indexed

3. cache index size: 2<sup>index</sup> = block size  $\cdot$  set associativity

4. cache size: block size  $\cdot$  set associativity

Rules of Thumb very complex (for out-of-order exec)

miss penalty  $\uparrow$  miss  $\rightarrow$  stall

1. Anand/Case Rule: A balanced computer sys need  $\approx$

1MB main mem capacity and 1Mb/s I/O bandwidth per MIPs

2. bandwidth Rule: bandwidth grows by 2 (improvement in latency)  $\downarrow$

3. Bandwidth Rule: bandwidth grows by 2 (improvement in miss rate)  $\downarrow$

4. 2:1 Cache Rule: dm cache(size N) = 2-way so cache  $\frac{N}{2}$

5. Dependability Rule: Design with no single point of failure

$$1. CPU \text{ time} = \text{inst cnt} \times CPI \times cc \text{ time} (\text{or cc rate})$$

$$2. \frac{\text{performance}}{\text{performance}} = \frac{\text{exec time}_\text{old}}{\text{exec time}_\text{new}} = \frac{1}{(1-f_e) + f_\text{e}}$$

$$3. Amdahl's Law: speedup = \frac{\text{exec time}_\text{old}}{\text{exec time}_\text{new} (1-f_e) + f_\text{e}}$$

$$4. \text{energy dynamic} \propto \frac{1}{2} \cdot \text{capacitive load} \cdot \text{Voltage}^2$$

$$5. \text{power dynamic} \propto \frac{1}{2} \cdot \text{capacitive load} \cdot \text{voltage}^2 \cdot \text{frequency}$$

$$6. \text{power static} \propto \text{current static} \cdot \text{voltage}$$

$$7. \text{availability} = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} \rightarrow \text{倒数为故障率} \in [0.5, 1.5]$$

$$8. \text{Die yield} = \text{wafer yield} \cdot (1 + \text{defects/unit area} \cdot \text{die area})$$

$$9. AM = \frac{1}{n} \sum_{i=1}^n \text{Time}_i; WAM = \sum_{i=1}^n w_i \cdot t_i; GM = \sqrt[n]{\prod_{i=1}^n t_i}$$

$$10. AMAT = \text{hit time} + \text{miss rate} \cdot \text{miss penalty}$$

$$11. \text{misses/inst} = \text{miss rate} \cdot \text{mem access/inst}$$

$$12. \text{cache index size: } 2^{\text{index}} = \frac{\text{cache size}}{\text{block size} \cdot \text{set associativity}}$$

$$\text{Rules of Thumb} \quad \text{very complex (for out-of-order exec)}$$

$$\text{miss penalty} \uparrow \text{miss} \rightarrow \text{stall}$$

$$1. \text{Anand/Case Rule: A balanced computer sys need } \approx$$

$$1. \text{bandwidth } \uparrow \text{same V, diff P diff V, same P}$$

$$2. \text{bandwidth } \uparrow \text{pipelined cache: cc } \uparrow, (\text{hit}) \text{ time } \downarrow \text{ benefit to larger cache}$$

$$3. \text{Bandwidth Rule: bandwidth grows by 2 (improvement in latency) } \downarrow$$

$$4. \text{2:1 Cache Rule: dm cache(size N) = 2-way so cache } \frac{N}{2}$$

$$5. \text{Dependability Rule: Design with no single point of failure}$$

$$1. \text{3 Walls: ILP, mem, power}$$

$$2. \text{Flynn's taxonomy: SISD, SIMD, MISD, MIMD}$$

$$3. \text{miss penalty } \downarrow$$

$$4. \text{compiler optimization - inst: reorder, analyse}$$

$$5. \text{parallelism: parallelism, spatial, temporal}$$

$$6. \text{prefetch: compiler, cache, register}$$

$$7. \text{Virtual memory: physical, mem, disk, page, segment}$$

$$8. \text{segmentation: 1 word addr, invisible, 2 word addr, may visible, global ~ trivial, internal fragmentation, difficult, external ~, not always ~ efficient}$$

$$9. \text{Pipelining classes: static / dynamic (diff) (serial), component / processor / integrated processor}$$

$$10. \text{throughput: # of issue} \rightarrow \# \text{ of stage order / disorder}$$

$$11. \text{solution: scalar / vector} \rightarrow \text{subdivision, repetition}$$

$$12. \text{efficiency } \eta = \frac{n}{m} \text{ bottleneck segment: max}(a_i)$$

$$13. \text{speed up} = \frac{nm}{m+n-1}$$

## Chap 2. Memory Hierarchy Design

avoid conflicts

higher bandwidth: interleaved mem (multibank)  $\rightarrow$  independent

pos = block addr % # of blocks

direct-mapped (unique block)

fully-associative (any)

4 Questions:

① block placement

pos = block addr % # of sets

set associative (unique set)

n-way: n blocks in one set

(usually n=4)

② block identification

block address index offset

dirty bit

# bits = log<sub>2</sub>(# sets) # bits = log<sub>2</sub>(size of block)

③ block replacement

(for set/full associative)

random, LRU, FIFO

$\rightarrow$  write buffer (stall  $\downarrow$ )

approx write through  $\rightarrow$  write around (only mem)

(write to mem as well)

miss write back  $\rightarrow$  write allocate (mem  $\rightarrow$

unified v.s. split cache (write dirty block when replaced) (mem  $\rightarrow$

cache)

IC access miss rate miss penalty

exec time = (CPU cc + mem stall cycle) cc time

misses  $\rightarrow$  3C model: compulsory, capacity, conflict

(widely used) 最常见的模型是LRU替换的LRU多块在同组

Optimizations 小且简单 L1 cache (also power  $\downarrow$ )

1. hit time  $\downarrow$  - way prediction (提高L1命中率的快, 性能高)

2. bandwidth  $\uparrow$  - addr translation X  $\rightarrow$  virtual mem  $\rightarrow$  virtual indexed

3. cache index size: 2<sup>index</sup> = block size  $\cdot$  set associativity

4. cache size: block size  $\cdot$  set associativity

Rules of Thumb very complex (for out-of-order exec)

miss penalty  $\uparrow$  miss  $\rightarrow$  stall

1. Anand/Case Rule: A balanced computer sys need  $\approx$

1. bandwidth Rule: bandwidth grows by 2 (improvement in latency)  $\downarrow$

3. Bandwidth Rule: bandwidth grows by 2 (improvement in miss rate)  $\downarrow$

4. 2:1 Cache Rule: dm cache(size N) = 2-way so cache  $\frac{N}{2}$

5. Dependability Rule: Design with no single point of failure



## Scoreboarding

reg

data buses

full

IF to LS (有 buffer)

operand T/F (F, R/W)

→ full reg from req





SIMD Extension for Multimedia

no vec-len reg, stride, gather-scatter, pred reg cost  $\downarrow$  bandwidth  $\downarrow$

roofline performance model

attainable GFLOPs/s = min(peak mem BW  $\times$  arithmetic intensity, peak fp performance)



loop-level parallelism

- find dependence  $m \in j, k \in n$  index
  - compile time GCD test:  $\text{GCD}(c, a) \% (d-b) = 0$
  - NPC problem
- eliminate dependence scalar expansion reduction

Chap 5 TLP MIMD multiprocessor (multicore)

software model: parallel processing request-level parallelism

memory organization

- SMP (centralized shared-mem arch)
- UMA (时间相等)
  - processor 有 1 个 private cache/mem
  - main mem/I/O sys

challenge: 独立 task  $\rightarrow$  limited parallelism  
 资源间协作  $\rightarrow$  high communication cost  
 完成的并行任务

cache coherence (满足以下条件)

- 处理器 P 写入 X 后立马读 X, 中间无对 X 的操作
- P2 在 P1 写入 X 后读 X, 中间无对 X 的操作
- write serialization: 2xR write 读果待合顺序

shared data  $\rightarrow$  local  $\sim$  migration replication

protocol: directory based: 目录有 block 的 shared status

\* snooping: cache 与 trace block 的 shared status using bus (broadcast media)

\* write invalidate

处理器写入数据项时对其他 processor 有 exclusive access, 其他 processor 将该块 (copy) 会 invalidate, 且 fetch updated one miss

\* write back or write through easy bandwidth  $\uparrow$  to replace old bandwidth  $\downarrow$

\* dirty copy  $\rightarrow$  3d mem owner: 拥有 cache block 的 core

MSI protocol - invalid

shared: block 被多核共享 modified: block updated, exclusive (other copies are invalid)

basic hardware primitives

atomic exchange: 变换 reg 为 mem 的值

lock: 0 表示 free, 1 表示 occupied

test-and-set

fetch-and-increment

memory consistency (涉及不同 mem addrs 的 R/W)

① sequential consistency: 每个 processor 的内存访问

保证程序顺序  $\rightarrow$  性能  $\downarrow$  (R  $\rightarrow$  W, R  $\rightarrow$  R, W  $\rightarrow$  R, W  $\rightarrow$  W)

relaxed consistency model: (total store)

MESI: to ZE (exclusive): 处理某 cache

对 exclusive block  $\rightarrow$  unmodified

对 write miss, P 是修改 modified 来处理

