

# **Machine Learning Systems**

Build efficient, reliable and scalable ML services through the vertical integration of algorithms, system software, and hardware

# What is ML Systems

**Our view** : The process of Machine learning algorithm design, development and deployment with scalability in mind.

**Vertically integrated optimization hierarchy** : Algorithms, system software, hardware architecture, with biggest opportunities in the algorithm layer.

# Computer system design is driven by the workload, so is ML sys.

- Machine intelligence is all about learning representations.
- Learning representations require tons of matrix multiplications.
- Processing is fast, latency is physics, and large-fast storage is impossible.
- Bending the laws of physics, is that possible?
- Who sells memory better, CPU or GPU?

# Machine Learning Systems

ML Sys. is the discipline of designing, implementing, and operating artificially intelligent systems across computing scales—from resource-constrained embedded devices to warehouse-scale computers.

It focuses on algorithm-software-hardware co-design to create systems that are efficient, scalable, and reliable for their deployment context.

It encompasses the complete lifecycle of AI applications: from requirements engineering and data collection through model development, system integration, deployment, monitoring, and maintenance.

# Schedule

# Topics

|       |                         |                                                                                                     |
|-------|-------------------------|-----------------------------------------------------------------------------------------------------|
| 1     | Introduction            | ML Sys Intro.<br>CPU architecture review                                                            |
| 2-3   | CPU-GPU Architecture    | Overview CPU-GPU architecture<br>Advanced features                                                  |
| 4-5   | CUDA Programming        | Fundamentals and architecture<br>Management optimization and debugging<br>Hands-on CUDA programming |
| 6-7   | Compiler & Optimization | DL compiler<br>LLM compiler                                                                         |
| 8-9   | Parallel Training       | LLM basics<br>Parallelism<br>Computation communication memory optimization                          |
| 10    | Inference               | LLM inference infra                                                                                 |
| 11-12 | Model Optimization      | Stability, Quantization<br>Pruning & Sparsification                                                 |
| 13    | Beyond Performance      | Privacy, Safety, Security, Responsibility, Sustainability                                           |
| 14    | Embodied AI             | Automated driving systems<br>Humanoid robotic systems                                               |
| 15    | AI Accelerators         | Customized AI accelerators                                                                          |
| 16    | Project                 | Course project presentation                                                                         |

# **Computer Architecture**

An overview of CPU/GPU from an optimization perspective.

# Performance

$$\begin{aligned}\text{Processor Performance} &= \frac{\text{Time}}{\text{Program}} \\ &= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instructions}} \times \frac{\text{Time}}{\text{Cycles}} \\ &\quad (\text{code size}) \qquad \qquad (\text{CPI}) \qquad \qquad (\text{cycle time})\end{aligned}$$

## **Assembly Instruction**

ADD r1, r3, #5

## **Machine Code (binary)**

000**1**0000**0001**0011**000000000000000101**



# Instruction Execution



**IF:** Instruction fetch

**ID:** Instruction decode and register read

**EX:** Execute, address generation

**MEM:** Memory access

**WB:** Write back to registers

# Datapath



## How to speed up the code

```
int sum_array(int *A, int n) {  
    int sum = 0;  
    for (int i = 0; i < n; i++) {  
        sum += A[i];  
    }  
    return sum;  
}
```

```
int sum_array(int *A, int n) {  
    int sum = 0;  
    for (int i = 0; i < n; i++) {  
        sum += A[i];  
    }  
    return sum;  
}
```

```
L1: cmp    i, n          ; compare i with n  
      jge    L2          ; if i >= n, exit loop  
      load   r1, [A+i*4]  ; load A[i] from memory  
      add    sum, r1       ; sum += A[i]  
      inc    i             ; i++  
      jmp    L1          ; repeat  
L2: ret           ; return sum
```

# Pipelining

How to improve CPI (cycles per instruction)?

- Instruction Latency = 5 cycles
- Instruction throughput = 1 instr/cycle
- CPI was 5 cycles per instruction
- CPI = 1 cycle per instruction
- CPI = cycle between instruction completion = 1



# Pipelining

1. IF: Instruction Fetch
2. ID: Decode & register fetch
3. EX: Execute (ALU op, branch calc)
4. MEM: Memory access
5. WB: Write-back (to register file)

| Cycle | IF        | ID        | EX        | MEM       | WB       |
|-------|-----------|-----------|-----------|-----------|----------|
| 1     | cmp(k)    |           |           |           |          |
| 2     | jge(k)    | cmp(k)    |           |           |          |
| 3     | load(k)   | jge(k)    | cmp(k)    |           |          |
| 4     | add(k)    | load(k)   | jge(k)    | cmp(k)    |          |
| 5     | inc(k)    | add(k)    | load(k)   | jge(k)    | cmp(k)   |
| 6     | jmp(k)    | inc(k)    | add(k)    | load(k)   | jge(k)   |
| 7     | cmp(k+1)  | jmp(k)    | inc(k)    | add(k)    | load(k)  |
| 8     | jge(k+1)  | cmp(k+1)  | jmp(k)    | inc(k)    | add(k)   |
| 9     | load(k+1) | jge(k+1)  | cmp(k+1)  | jmp(k)    | inc(k)   |
| 10    | add(k+1)  | load(k+1) | jge(k+1)  | cmp(k+1)  | jmp(k)   |
| 11    | inc(k+1)  | add(k+1)  | load(k+1) | jge(k+1)  | cmp(k+1) |
| 12    | jmp(k+1)  | inc(k+1)  | add(k+1)  | load(k+1) | jge(k+1) |

What is the catch?

# Branch Prediction

## Avoid pipeline stalls

- The `jge L2` is a **branch**.

```
L1: cmp    i, n          • If the CPU waits every time to know whether to continue, the pipeline stalls.  
      jge   L2  
      load  r1, [A+i*4]   • Modern CPUs predict:  
      add   sum, r1        • “Most of the time the loop continues” → fetches next iteration’s instructions ahead of time.  
      inc   i  
      jmp   L1  
L2: ret
```

- If prediction is correct (99% of iterations), the pipeline stays full.
- If wrong (last iteration), CPU discards speculative work, small penalty.

# Out-of-Order Execution

Hide memory latency

```
L1: cmp    i, n
     jge    L2
     load   r1, [A+i*4]
     add    sum, r1
     inc    i
     jmp    L1
L2: ret
```

- In this sequence, the **load from memory** (`load r1, [A+i*4]`) is slow compared to the integer add.
- A modern CPU doesn't just sit idle — it looks ahead:
  - While waiting for  $A[i]$  to arrive, it issues the `load [A+(i+1)*4]` early.
  - It may also speculatively start decoding the next `cmp` and `add`.
- **Result:** instructions overlap, latency is hidden, and the loop runs much faster.

# Superscalar

```
L1: cmp    i, n
     jge    L2
     load   r1, [A+i*4]
     add    sum, r1
     inc    i
     jmp    L1
L2: ret
```



# Superscalar

- Superpipelined-Superscalar
  - Issue parallelism = IP = n inst / minor cycle
  - Operation latency = OP = m minor cycles
  - Peak IPC = n x m instr / major cycle



## ◆ Hardware Optimizations (no code change)

| Optimization                 | What CPU Does                                                                                                                               | Benefit                               |
|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| Branch Prediction            | Predicts loop will continue                                                                                                                 | Avoids pipeline stalls each iteration |
| Out-of-Order Execution (OoO) | While waiting for <code>A[i]</code> to load, speculatively executes <code>inc i, cmp i, n</code> , even pre-issues load <code>A[i+1]</code> | Hides memory latency                  |
| Multiple Issue (Superscalar) | Issues load + add in same cycle (if ready)                                                                                                  | Parallelizes independent instructions |
| Caches (spatial/temporal)    | - Spatial: $A[i \dots i+15]$ come in one cache line- Temporal: sum stays hot in registers                                                   | Reduce memory stalls                  |

# Loop Unrolling

```
int sum_array(int *A, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += A[i];
    }
    return sum;
}
```

```
int sum_array_unroll4(const int *A, int n) {
    int s0 = 0, s1 = 0, s2 = 0, s3 = 0;
    int i = 0;
    int n4 = n & ~3; // floor to multiple of 4
    for (; i < n4; i += 4) {
        s0 += A[i+0];
        s1 += A[i+1];
        s2 += A[i+2];
        s3 += A[i+3];
    }
    int sum = (s0 + s1) + (s2 + s3); // combine partials
    for (; i < n; ++i) sum += A[i]; // tail (remainder) loop
    return sum;
}
```

# Loop Unrolling

```
int sum_array_unroll4(const int *A, int n) {
    int s0 = 0, s1 = 0, s2 = 0, s3 = 0;
    int i = 0;
    int n4 = n & ~3; // floor to multiple of 4
    for (; i < n4; i += 4) {
        s0 += A[i+0];
        s1 += A[i+1];
        s2 += A[i+2];
        s3 += A[i+3];
    }
    int sum = (s0 + s1) + (s2 + s3); // combine partials
    for (; i < n; ++i) sum += A[i]; // tail (remainder) loop
    return sum;
}

; Assume rA = base of A, ri = i
; s0..s3 kept in registers (e.g., r4..r7)

L1:
    load r10, [rA + ri*4 + 0] ; A[i+0]
    load r11, [rA + ri*4 + 4] ; A[i+1]
    load r12, [rA + ri*4 + 8] ; A[i+2]
    load r13, [rA + ri*4 + 12] ; A[i+3]

    add r4, r4, r10 ; s0 += A[i+0]
    add r5, r5, r11 ; s1 += A[i+1]
    add r6, r6, r12 ; s2 += A[i+2]
    add r7, r7, r13 ; s3 += A[i+3]

    add ri, ri, 4 ; i += 4
    cmp ri, n4
    jl L1

; combine: r4=r4+r5, r6=r6+r7, then r4=r4+r6
```

- **Fewer branches:** 1 branch per 4 elements (vs 1 per element).
- **More ILP:** four independent load+add pairs. The CPU can issue loads early and overlap them; adds don't wait on a single running dependency (since s0..s3 are independent).
- **Better use of caches/prefetchers:** sequential loads hint the hardware prefetcher; each miss can be overlapped with others.

# Vectorization

Vectorization improves parallelism by using **SIMD registers** to perform multiple loads and additions in one instruction. In the summation loop, AVX2 can sum 8 integers at once, cutting loop iterations by 8x and reducing branch/control overhead, while also aligning with memory prefetching for high throughput.

SIMD: Single instruction multiple data

```
#include <iimmintrin.h> // AVX intrinsics

int sum_array_vectorized(const int *A, int n) {
    __m256i acc = _mm256_setzero_si256(); // vector accumulator
    int i = 0;

    // Process 8 ints per loop
    for (; i <= n-8; i += 8) {
        __m256i v = _mm256_loadu_si256((__m256i*)&A[i]); // load 8 ints
        acc = _mm256_add_epi32(acc, v); // acc += v
    }

    // Horizontal add (sum across vector lanes)
    int tmp[8];
    _mm256_storeu_si256((__m256i*)tmp, acc);
    int sum = tmp[0]+tmp[1]+tmp[2]+tmp[3]+tmp[4]+tmp[5]+tmp[6]+tmp[7];

    // Handle tail elements
    for (; i < n; i++) sum += A[i];

    return sum;
}
```

# Summary

| Processing Side        |                                                                                                                                             |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Compiler<br>(Software) | <ul style="list-style-type: none"><li>  Loop unrolling, vectorization</li><li>  → expose parallel ops</li></ul>                             |
| CPU<br>(Hardware)      | <ul style="list-style-type: none"><li>  Out-of-order execution</li><li>  Multiple-issue (superscalar)</li><li>  Branch prediction</li></ul> |

# Summary

| Dimension               | Software (Compiler)                                                             | Hardware (CPU)                                        | How They Help Each Other                                |
|-------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------|---------------------------------------------------------|
| Instruction Parallelism | Loop unrolling, vectorization → expose more independent ops                     | OoO & multiple-issue → schedule them in parallel      | Compiler exposes → CPU exploits                         |
| Control Hazards         | Loop unrolling reduces branch frequency                                         | Branch predictor speculates next iterations           | Compiler reduces branch count, CPU hides branch penalty |
| Data Locality           | Ensure arrays contiguous; unroll to encourage cache line use; software prefetch | Cache hierarchy (L1/L2/L3), hardware prefetchers      | Compiler arranges memory, CPU hardware delivers fast    |
| Latency Hiding          | Unrolled loops → more outstanding loads                                         | OoO overlaps loads with adds, prefetchers fetch ahead | More outstanding memory ops → more latency hidden       |

# Data Locality

```
int sum_array(int *A, int n) {  
    int sum = 0;  
    for (int i = 0; i < n; i++) {  
        sum += A[i];  
    }  
    return sum;  
}
```

| Locality type | Meaning                | Example in code                         |
|---------------|------------------------|-----------------------------------------|
| Temporal      | Reuse same data soon   | sum += A[i] → sum reused each iteration |
| Spatial       | Reuse nearby data soon | A[i], then A[i+1] in a loop             |

# Memory Hierarchy



# Why Memory Hierarchy?

- Fast and small memories
  - Enable quick access (fast cycle time)
  - Enable lots of bandwidth ( $1 + L/S/I$ -fetch/cycle)
- Slower larger memories
  - Capture larger share of memory
  - Still relatively fast
- Slow huge memories
  - Hold rarely-needed state
  - Needed for correctness
- All together: provide appearance of large, fast memory with cost of cheap, slow memory

# Why Does a Hierarchy Work?

- Locality of reference
  - Temporal locality
    - Reference same memory location repeatedly
  - Spatial locality
    - Reference near neighbors around the same time
- Empirically observed
  - Significant!
  - Even small local storage (8KB) often satisfies >90% of references to multi-MB data set

# Memory Hierarchy



# Cache: Towards ideal hardwired storage

| Criteria                      | Reason                                  |
|-------------------------------|-----------------------------------------|
| Low Hit Time                  | High performance and speed              |
| Low Miss Rate                 | Maximizes cache hit rate                |
| Low Miss Penalty              | Minimizes memory fetch delay            |
| High Bandwidth                | Supports multiple simultaneous requests |
| Effective Replacement Policy  | Improves hit ratio                      |
| Efficient Write Handling      | Optimizes memory bandwidth usage        |
| Coherence (Multiprocessor)    | Ensures data consistency                |
| Energy Efficiency             | Reduces power consumption               |
| Scalability                   | Adapts to diverse workloads             |
| Reliability & Fault Tolerance | Ensures robust operation                |

# Cache: Placement

- Address Range
  - Exceeds cache capacity
- Map address to finite capacity
  - Called a hash
  - Usually just masks high-order bits
- Direct-mapped
  - Block can only exist in one location
  - Hash collisions cause problems



32-bit Address



# Cache: Placement

- Address Range
  - Exceeds cache capacity
- Map address to finite capacity
  - Called a hash
  - Usually just masks high-order bits
- Direct-mapped
  - Block can only exist in one location
  - Hash collisions cause problems



# Cache: Placement

- *Fully-associative*
  - Block can exist anywhere
  - No more hash collisions
- *Identification*
  - How do I know I have the right block?
  - Called a *tag check*
    - Must store address tags
    - Compare against address
- Expensive!
  - Tag & comparator per block



32-bit Address



# Cache: Placement

- Set-associative
  - Block can be in a locations
  - Hash collisions:
    - a still OK
- Identification
  - Still perform *tag check*
  - However, only a few in parallel



- $2^8 = 256$  sets each with four ways (each with one block).



The formula for the total size of the cache is:

$$\text{Total Size} = BS \times B = BS \times S \times \left(\frac{B}{S}\right)$$

Where:

- **BS** = Block Size (the size of each cache line, usually measured in bytes)
- **B** = Number of blocks (total cache lines in the cache)
- **S** = Number of sets (the number of different locations in the cache where data can be placed)
- **B/S** = The number of blocks per set (this is how many blocks are grouped in each set, e.g., 1 block per set in direct-mapped caches or multiple blocks per set in a set-associative cache)

# Summary

|                        | Processing Side                                                                                                                             | Data Side                                                                                                                                                 |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Compiler<br>(Software) | <ul style="list-style-type: none"><li>  Loop unrolling, vectorization</li><li>  → expose parallel ops</li></ul>                             | <ul style="list-style-type: none"><li>  Ensure contiguous arrays,</li><li>  unroll <b>for</b> cache-line use,</li><li>  software prefetch hints</li></ul> |
| CPU<br>(Hardware)      | <ul style="list-style-type: none"><li>  Out-of-order execution</li><li>  Multiple-issue (superscalar)</li><li>  Branch prediction</li></ul> | <ul style="list-style-type: none"><li>  Cache hierarchy (L1/L2/L3),</li><li>  hardware prefetchers,</li><li>  spatial + temporal locality</li></ul>       |

# CPU



## Overall Pipeline Flow Summary

- Fetch:** Instructions are fetched from instruction cache (I-Cache) with the help of ITLB.
- Pre-decode:** Instructions are pre-decoded and buffered for the decoding stage.
- Decode:** Instructions are translated into μops by simple or complex decoders or micro-code ROM.
- Rename & Allocate:** Registers are renamed (RAT), μops are allocated into the reorder buffer (ROB).
- Issue & Execute:** μops wait in reservation stations until operands are ready, then dispatched to execution units.
- Memory Access:** Load/store instructions handled by MOB, data cache, and DTLB.
- Write-back & Commit:** Execution results are written back via internal result buses, and instructions commit in order via ROB, updating the Retirement Register File.

# CPU



## Instruction Fetch and Decode Stage (Red Boxes)

- **Instruction Fetch Unit:**
  - Fetches instructions from memory.
  - Uses the Instruction Translation Lookaside Buffer (ITLB, 128 entries) to speed up virtual-to-physical address translation for instructions.
- **32 KB Instruction Cache (I-Cache, 8-way set associative):**
  - Holds recently accessed instructions to quickly supply the fetch stage, improving performance by reducing memory access latency.
- **32-byte Pre-decode Fetch Buffer:**
  - Stores prefetched instructions (up to 6 instructions).
  - Performs initial decoding/preparation, easing decoding complexity downstream.
- **18-entry Instruction Queue:**
  - Buffers decoded instructions to manage pipeline flow and maintain throughput.
- **Decoders (Simple & Complex decoders + Micro-code ROM):**
  - **Simple Decoders** handle common, straightforward instructions, translating them into single micro-operations (μops).
  - **Complex Decoder** handles more intricate instructions, decoding them into multiple μops.
  - **Micro-code ROM** handles special or extremely complex instructions by translating them into micro-code sequences.

# CPU



## Micro-operation Handling Stage (Orange Boxes)

- **Micro-operation ( $\mu$ op) Buffer (7+ entries):**
  - Temporarily stores decoded micro-operations before they enter the execution pipeline.
- **Register Alias Table (RAT) and Allocator:**
  - Implements Register Renaming, mapping architectural registers to physical registers.
  - Eliminates false dependencies (WAR and WAW hazards) by dynamically renaming registers, allowing parallel execution.
- **96-entry Reorder Buffer (ROB):**
  - Facilitates out-of-order execution and precise exceptions.
  - Tracks micro-operations, allowing the CPU to commit instructions in program order, ensuring correct execution semantics.
- **Retirement Register File (Program Visible State):**
  - Maintains the committed state visible to software (architectural state), updated as instructions retire from the ROB.

# CPU



## Execution Stage (Yellow Boxes)

- **32-entry Reservation Station:**
  - Buffers µops until operands and execution resources become available.
  - Dispatches ready µops to available execution units.
- **Execution Units (multiple functional units):**
  - **ALUs (Arithmetic Logic Units):** Perform integer arithmetic and logic operations.
  - **SSE/Vector Units (Shuffle, MUL, ADD, FMUL, FDIV):** Handle floating-point and SIMD instructions.
  - **Branch Unit (ALU Branch):** Evaluates branch conditions and computes target addresses.
- Each execution unit receives µops from specific ports (Port 0, Port 1, Port 5, etc.), enabling parallel execution of independent instructions.
- Results from execution units are communicated via the **Internal Results Bus** back to the reservation station, ROB, and other pipeline elements.

# CPU



## Memory Access and Data Handling (Green Boxes)

- Memory Ordering Buffer (MOB):**
  - Ensures correct memory ordering semantics for load/store operations, preserving memory consistency.
  - Coordinates execution of load/store operations in the correct order.
- 32 KB Dual-Ported Data Cache (D-Cache, 8-way set associative):**
  - Holds recently accessed data to reduce latency of data memory accesses.
  - "Dual ported" allows simultaneous load/store operations, improving memory throughput.
- Data Translation Lookaside Buffer (DTLB, 16 entries):**
  - Speeds up virtual-to-physical address translation for data accesses.
- Shared L2 Cache (16-way set associative):**
  - Provides a unified second-level cache for both instruction and data.
  - Interfaces with the external memory system through the **Shared Bus Interface Unit**.

# CPU



## Key Design Features

- Out-of-order execution:** Instructions can be executed out-of-order but commit results strictly in program order.
- Register renaming:** Prevents pipeline stalls caused by register conflicts.
- Multiple execution units and ports:** Enables parallel execution of multiple instructions per cycle.
- Separate instruction and data caches:** Improves cache locality and reduces latency.
- Multi-level caching (L1, L2):** Balances latency, throughput, and hit-rate.
- Translation Lookaside Buffers (ITLB, DTLB):** Accelerates virtual-to-physical address translations.
- Sophisticated branch prediction (implied):** Ensures high instruction throughput and pipeline efficiency.

This pipeline structure illustrates how Intel processors achieve high performance by combining deep pipelines, extensive parallel execution, and advanced memory-hierarchy management.

# CPU



# Why Multicore

- Instruction-level parallelism
  - Reaps performance by finding independent work in a single thread
- Thread-level parallelism
  - Reaps performance by finding independent work across multiple threads
- Historically, requires explicitly parallel workloads
  - Originate from mainframe time-sharing workloads
  - Even then, CPU speed  $\gg$  I/O speed
  - Had to overlap I/O latency with “something else” for the CPU to do
  - Hence, operating system would schedule other tasks/processes/threads that were “time-sharing” the CPU

# Why Multicore

$$\text{Speedup} = \frac{1}{(1-f) + \frac{f}{n}}$$

$$\lim_{n \rightarrow \infty} \frac{1}{1-f + \frac{f}{n}} = \frac{1}{1-f}$$



- Fixed power budget forces slow cores
- Serial code quickly dominates

# Matrix Multiplication

矩阵乘法计算复杂度为  $O(n^3)$ , 典型形式:

$$C = A \times B, \quad C_{ij} = \sum_k A_{ik} \cdot B_{kj}$$

$$\begin{bmatrix} b_{0,0} & b_{0,1} & \cdots & b_{0,n-1} \\ b_{1,0} & b_{1,1} & \cdots & b_{1,n-1} \\ \vdots & \vdots & \ddots & \vdots \\ b_{n-1,0} & b_{n-1,1} & \cdots & b_{n-1,n-1} \end{bmatrix}$$

$$\begin{bmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,n-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,n-1} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n-1,0} & a_{n-1,1} & \cdots & a_{n-1,n-1} \end{bmatrix} \quad \begin{bmatrix} & & & \\ & & & \\ & & & \\ & & & \end{bmatrix}$$

# Matrix Multiplication

矩阵乘法计算复杂度为  $O(n^3)$ , 典型形式:

$$C = A \times B, \quad C_{ij} = \sum_k A_{ik} \cdot B_{kj}$$

```
for (i=0; i<n; i++)
    for (j=0; j<n; j++)
        for (k=0; k<n; k++)
            C[i][j] += A[i][k] * B[k][j];
```

# Matrix Multiplication

## (1) Cache blocking / Loop tiling

- 将矩阵分块 (block)，让计算尽量在 L1/L2 cache 内完成，减少反复从内存加载。
- 思路：把大矩阵拆为  $B \times B$  小块，三重循环改为分块嵌套循环。
- 好处：利用 数据的空间/时间局部性。

# Matrix Multiplication

## (2) Loop unrolling

- 展开循环体，让 CPU 每次处理多条指令，减少分支开销。
- 同时引入多个累加器，消除指令间依赖，提升 指令级并行度 (ILP)。

# Matrix Multiplication

## (3) SIMD 向量化 (Vectorization)

- 利用 CPU 的 SIMD 指令集 (如 AVX/AVX-512) , 一次对多个浮点数做乘加。
- 例如 AVX-512 可一次执行 16 个单精度浮点数的 FMA (Fused Multiply-Add) 。
- 典型实现：在内层循环用 `_mm256_fmadd_ps` 等指令。

# Matrix Multiplication

## (4) 多核并行 (Multithreading)

- 不同线程处理矩阵的不同块。
- OpenMP / TBB 可以快速并行化外层循环。
- 例如把矩阵 C 按行或按块划分给不同核。

# Matrix Multiplication

```
// Blocked + vectorized GEMM (伪代码)
for (ii = 0; ii < n; ii += B)           // ← 外层按 B×B 分块：把 A、B、C 划成能装进 L2/L1 的小块
    for (jj = 0; jj < n; jj += B)       //     这样同一块数据被多次复用，减少反复从内存拿数据（时间/空间局部性）
        for (kk = 0; kk < n; kk += B)     //     三重分块(ii,jj,kk) ≈ 经典 cache blocking (也叫 loop tiling)

            for (i = ii; i < ii + B; i++)   // — 下面在一个 B×B 子块上做“小GEMM”
                for (j = jj; j < jj + B; j++) {
                    __m256 c_vec = _mm256_setzero_ps();           // ← SIMD 向量累加器：8×float 并行累加 (AVX 256-bit)

                    for (k = kk; k < kk + B; k += 8) {           // ← 内层步长=8：一次处理 8 个乘加，匹配 AVX 向量宽度
                        __m256 a_vec = _mm256_loadu_ps(&A[i][k]); //     向量加载：取 A[i, k..k+7]
                        __m256 b_vec = _mm256_loadu_ps(&B[k][j]); //     (示例直取一列不太贴近真实高效实现，真实代码多用
                                            //         pack/转置/寄存器重排或 micro-kernel 以提升 B 的访存局部性)
                        c_vec = _mm256_fmadd_ps(a_vec, b_vec, c_vec); //     FMA：c_vec += a_vec * b_vec (8 路并行，指令级并行/SIMD)
                    }

                    C[i][j] += horizontal_add(c_vec);           // ← 把 8 路向量累加成标量写回（水平求和）
                }
            }
        }
    }
}
```

What are the lessons learned?

*–Tom Jerry*

# GPU Architecture for Machine Learning

Workload is the king, as always

# GPU was born differently

- Throughput matters and single threads do not.
- Hide memory latency through parallelism.
- Let programmer deal with “raw” storage hierarchy.
- Avoid high frequency clock speed, desirable for portable devices, consoles, laptop

# Modern GPU

**Objective** : Flexible, programming graphics and high-performance computing

**Architecture** : Unified graphics and parallel computing architecture

**Scalability** : Parallel array of processors are massively multithreaded

**Programming Flexibility** : CUDA programming model for high-performance GPGPU programming



# CPU vs. GPU



The GeForce RTX 5090 is powered by the NVIDIA Blackwell architecture, and is estimated to have a memory bandwidth of 1,920 GB/s.

100+ TFLOPs Shader, ray tracing, and tensor operations



The Intel Core i9-15900K is a 15th generation processor, also known as the Intel Core Ultra 9 285K. It supports DDR5-8400 RAM in a dual-channel configuration, yielding a system memory bandwidth of about 134.4 GB/s.

Thread-level and ILP optimizations

# CPU vs. GPU









## Instruction Fetch/Decode







NVIDIA P100: 56 “cores” with  
4 32-way SIMT units



Intel E7-8894 v4: 24 hyper-threading  
cores with 256 bit AVX2 instructions

# GPU Architecture

- GPU: From computer graphics to deep learning
- SIMD: single instruction multiple thread computing
- Memory hierarchy: Global/shared/register storage
- Advanced GPU Features and Future Directions

# A 30-year Journey

- **1993:** NVIDIA is founded by Jensen Huang, Chris Malachowsky, and Curtis Priem, focusing on graphics processing technology.
- **1997:** The company gains prominence in the gaming industry with the release of the RIVA series of graphics processors.
- **1999:** NVIDIA introduces the term “GPU” (Graphics Processing Unit) with the launch of the GeForce 256, marking a significant advancement in graphics processing.
- **2000:** NVIDIA acquired assets from 3dfx, a former competitor in the graphics card market, enhancing its technological portfolio and market position.
- **2006:** The company launches CUDA, a parallel computing platform and programming model, enabling GPUs to be used for general-purpose processing.
- **2018:** NVIDIA’s GPUs become the powerhouse to AI research and applications, solidifying its position in the AI industry.
- **2024:** The company’s market capitalization surpasses \$3 trillion, reflecting its leadership in AI and high-performance computing.

# A Bumpy Journey



# Ups and Downs

- **1995:** NVIDIA's first graphics accelerator, the NV1, was designed to process quadrilateral primitives, differing from competitors who preferred triangle primitives. When Microsoft introduced DirectX, it exclusively supported triangles, leading to the NV1's market failure.
- **1996:** NVIDIA partnered with Sega to supply the graphics chip for the Dreamcast console. However, NVIDIA's technology lagged behind competitors, and Sega chose another vendor. Sega's president, Shoichiro Irimajiri, invested \$5 million in NVIDIA, which kept the company afloat during this challenging period.
- **2008:** NVIDIA faced a significant setback due to manufacturing defects in certain mobile chipsets and GPUs, leading to a \$200 million write-down and a class-action lawsuit.
- **2022:** The company announced the termination of its planned \$40 billion acquisition of Arm Holdings, citing regulatory challenges.
- **January 2025:** NVIDIA experienced a substantial market capitalization loss of \$589 billion following the emergence of the Chinese startup DeepSeek, which introduced a low-cost AI model. This event marked the largest single-day loss in stock market history.

# Historical question: why is quadratic texture mapping bad?





ASPA 美中半导体协会

CHINESE AMERICAN SEMICONDUCTOR PROFESSIONAL ASSOCIATION

since 1991

ASPA

Every great technology has a humble origin

*–Tom Jerry*



# GPU V0.1: Barrel Shifter

The Memory Constraints of Early Graphics Systems. Early arcade machines could not afford to store a full framebuffer due to cost and space limitations.

**The Solution:** Instead of pre-storing the entire frame in memory, these systems used real-time compositing where video chips assembled graphics dynamically as the screen was being drawn (a process called raster scan output).

## 1 Alien sprite data is stored in memory

- The CPU loads the **bitmap** from memory.

## 2 The barrel shifter shifts the sprite left or right instantly

- The CPU **doesn't modify the original data**, just controls how much to shift.

## 3 The new shifted data is sent directly to the display

- The arcade hardware **composites the sprite** in real-time onto the video output.



Key Barrel Shifter Operations in Graphics

| Operation         | Purpose                                          | Example in Early Games                              |
|-------------------|--------------------------------------------------|-----------------------------------------------------|
| Bit Shifting      | Moves sprite data horizontally/vertically        | Sliding backgrounds in <b>Space Invaders</b> (1978) |
| Bit Rotation      | Wraps bits around during shifts                  | Looping animations without recomputing data         |
| Masking & Merging | Composites multiple bitmaps together efficiently | Overlaying characters over scrolling backgrounds    |

# GPU: Three Generations

- GPU evolution: From fixed-function accelerator to programmable processor.
- Early adoption in machine learning: The massive parallel nature suitable for linear algebra.
- GPU in the deep learning : Architecture evolution driven by rapid growth of deep learning.

| Phase                                                        | Era                        | Key Features                                                                                                                                    | Examples                                                                                                              |
|--------------------------------------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| <b>1 Hardwired GPU (Fixed-Function Graphics)</b>             | <b>1980s – Early 2000s</b> | - Specialized <b>fixed-function pipelines</b> for rendering - No programmability, only hardware-based transformations & rasterization           | - <b>1981:</b> IBM CGA (First consumer graphics card) - <b>1999:</b> NVIDIA GeForce 256 (First GPU)                   |
| <b>2 Programmable GPU (Shader-Based Graphics Processing)</b> | <b>Early 2000s – 2010s</b> | - <b>Vertex &amp; Pixel Shaders</b> introduced for <b>custom effects</b> - Unified Shader Architecture allows <b>software-defined rendering</b> | - <b>2001:</b> NVIDIA GeForce 3 (First programmable vertex shader) - <b>2006:</b> NVIDIA Tesla (First Unified Shader) |
| <b>3 GPGPU for AI/ML (General-Purpose GPU Computing)</b>     | <b>2010s – Present</b>     | - <b>CUDA/OpenCL</b> enable <b>parallel computing</b> for AI, HPC - <b>Tensor Cores &amp; AI Accelerators</b> introduced                        | - <b>2007:</b> NVIDIA CUDA (GPGPU programming) - <b>2017:</b> NVIDIA Volta (First Tensor Cores for ML)                |

# Phase 1: Hardwired GPU with fixed function pipeline

- **1980s-2000s:** GPUs were fixed-function, meaning all graphics tasks were hardcoded in hardware.
- **Key Limitations:** No programmability; only predefined transformations, rasterization, and texture mapping.
- **Examples:** IBM CGA (1981), NVIDIA GeForce 256 (1999) (first consumer GPU).

A single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second.



# GPU is a Massive Shader

## 1. 3D Model Representation:

- A complex 3D model, such as a human figure, is represented in a digital environment.

## 2. Mesh Generation:

- The surface of the 3D model is divided into numerous small triangles, forming a mesh.
- Each triangle consists of three vertices, which are points in 3D space.

## 3. Vertex Data Extraction:

- The coordinates and other attributes (like color, normal vectors, texture coordinates) of each vertex are extracted.
- This data is organized into structures known as Vertex Buffer Objects (VBOs).

## 4. Storage in Vertex Buffers:

- The VBOs are stored in the GPU's memory to facilitate efficient rendering during the graphics pipeline process.



# Vertex

1. **Vertex Position:** Specifies the 3D coordinates of the vertex in object space.
2. **Vertex Color:** Defines the color associated with the vertex, allowing for vertex-based color interpolation.
3. **Texture Coordinates:** Indicate how textures are mapped onto the surface of the geometry.
4. **Normal Vectors:** Provide information about the surface orientation at the vertex, essential for lighting calculations.
5. **Tangent and Bitangent Vectors:** Used in advanced shading techniques, such as normal mapping, to handle texture orientation.
6. **Bone Weights and Indices:** Utilized in skeletal animation to determine the influence of bones on the vertex.



# GPU is a Massive Shader

The graphics rendering pipeline consists of several stages, each responsible for specific tasks in transforming 3D models into 2D images.

- **Vertex Transformation Stage:** Transforms individual triangle vertices from 3D world space to 2D screen space.
- **Shape Assembly Stage:** A fixed-function stage that assembles transformed vertices into geometric shapes, typically triangles.
- **Rasterization Stage:** Another fixed-function stage that converts assembled triangles into a set of pixels or fragments.
- **Pixel Shader Stage:** Determines the color of each pixel. Programmed using a pixel shader (also known as a fragment shader in OpenGL and Metal), which is executed once per pixel.



# MVP: Modeling-View-Projection translation

把三维空间中的物体投影到二维平面上展示，需要经过MVP变换，MVP变换指的是：模型变换(model)、视图变换(view)、投影变换(projection)。可以设想一下拍照的过程，通常需要三步，人物摆pose、调整相机、按下快门拍照。把三维物体投影到二维平面则需要经过MVP变换，MVP变换与拍照的类比关系如下：

- 摆Pose-模型变换 (modeling translation)
- 调整相机位置-视图变换 (view translation)
- 拍照-投影变换 (projection translation)



# Modeling Translation

模型变换(Modeling translation)是将物体从局部空间变换到世界空间的一个矩阵，这个矩阵是随着物体的运动而改变的。



$$P_r = M\theta \cdot \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \quad \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = M\theta \cdot \begin{bmatrix} x \\ \sin \alpha \\ \cos \alpha \\ 1 \end{bmatrix}$$

$$y = \sin(\theta + \alpha) = \sin \theta \cos \alpha + \cos \theta \sin \alpha \\ z = \cos(\theta + \alpha) = \cos \theta \cos \alpha - \sin \theta \sin \alpha$$

$$y \sin \theta = \sin^2 \theta \cos \alpha + \sin \theta \cos \theta \sin \alpha \\ z \cos \theta = \cos^2 \theta \cos \alpha - \sin \theta \cos \theta \sin \alpha$$

$$\cos \alpha = y \sin \theta + z \cos \theta \\ \sin \alpha = y \cos \theta - z \sin \theta$$



$$Translate = \begin{bmatrix} 1 & 0 & 0 & Tx \\ 0 & 1 & 0 & Ty \\ 0 & 0 & 1 & Tz \\ 0 & 0 & 0 & 1 \end{bmatrix} \quad Scale = \begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

$$M\theta x = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \theta & -\sin \theta & 0 \\ 0 & \sin \theta & \cos \theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

$$M\theta z = \begin{bmatrix} \cos \theta & -\sin \theta & 0 & 0 \\ \sin \theta & \cos \theta & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

$$M\theta y = \begin{bmatrix} \cos \theta & 0 & \sin \theta & 0 \\ 0 & 0 & 0 & 0 \\ -\sin \theta & 0 & \cos \theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

$$Model = Translate \cdot Rotate \cdot Scale$$

# View Translation

模型变换将物体变换到世界坐标后，经过视图变换(view translation)将物体从世界坐标转换到摄像机视角下的视图空间，从而确定哪些物体需要被看到（渲染）。视图空间变换分为两步：1. 确认摄像机位置 (Translate) 2. 确实摄像机朝向 (Rotate)



$$M_{view} = \begin{bmatrix} x_{\hat{g} \times \hat{t}} & y_{\hat{g} \times \hat{t}} & z_{\hat{g} \times \hat{t}} & 0 \\ x_t & y_t & z_t & 0 \\ x_{-g} & y_{-g} & z_{-g} & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & -x_e \\ 0 & 1 & 0 & -y_e \\ 0 & 0 & 1 & -z_e \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

# 投影变换 (Projection Translation)

经过视图变换以后，物体成功转换到了摄像机视角下了，我们需要一个投影矩阵 (projection translation) 把摄像机看到的画面投影到我们指定大小的屏幕空间里。而投影分为两种，正交投影和透视投影。正交投影是直接把物体坐标从视图空间转换到标准化设备空间 (NDC)，透视投影则需要先进行一轮变换，把透视投影压缩成正交投影，再通过正交投影把坐标变换到NDC空间。



$$M_{persp} = M_{ortho} \cdot M_{persp \rightarrow ortho}$$

$$M_{persp} = \begin{bmatrix} \frac{2}{r-l} & 0 & 0 & \frac{l+r}{l-r} \\ 0 & \frac{2}{t-b} & 0 & \frac{b+t}{b-t} \\ 0 & 0 & \frac{2}{n-f} & \frac{f+n}{f-n} \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} n & 0 & 0 & 0 \\ 0 & n & 0 & 0 \\ 0 & 0 & n+f & -nf \\ 0 & 0 & 1 & 0 \end{bmatrix} = \begin{bmatrix} \frac{2n}{r-l} & 0 & \frac{l+r}{l-r} & 0 \\ 0 & \frac{2n}{t-b} & \frac{b+t}{b-t} & 0 \\ 0 & 0 & \frac{n+f}{n-f} & \frac{2nf}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix}$$

# 视口变换 (Viewpoint Translation)

在MVP变换之后需要视口变换(viewpoint translation)从NDC空间变换到屏幕空间上，这里的屏幕空间并不是UV空间，而是范围是 $[0, \text{width}] \times [0, \text{height}]$ 的空间。

- Irrelevant to z
- Transform in xy plane:  $[-1, 1]^2$  to  $[0, \text{width}] \times [0, \text{height}]$



从标准正方形到屏幕空间

视口变换为  $M_{viewport}$

$$M_{viewport} = \begin{bmatrix} \frac{\text{width}}{2} & 0 & 0 & \frac{\text{width}}{2} \\ 0 & \frac{\text{height}}{2} & 0 & \frac{\text{height}}{2} \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

# MVP: Modeling-View-Projection translation

把三维空间中的物体投影到二维平面上展示，需要经过MVP变换，MVP变换指的是：模型变换(model)、视图变换(view)、投影变换(projection)。可以设想一下拍照的过程，通常需要三步，人物摆pose、调整相机、按下快门拍照。把三维物体投影到二维平面则需要经过MVP变换，最终公式如下：

$$\begin{bmatrix} x_{new} \\ y_{new} \\ z_{new} \\ 1 \end{bmatrix} = M_{view} \cdot P_{persp} \cdot View \cdot Model = M_{view} \cdot P_{persp} \cdot (Rotate \cdot Translate) \cdot (Translate \cdot Rotate \cdot Scale) \cdot \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}$$

# GPU is a Massive Shader

The graphics rendering pipeline consists of several stages, each responsible for specific tasks in transforming 3D models into 2D images.



# GPU is a Massive Shader

1. **Vertex Shader:** Processes individual vertices, transforming them from object space to clip space (e.g., applying world, view, and projection transformations).
2. **Tessellation Stage (Optional):** Subdivides geometry into smaller primitives (as explained in a previous response).
3. **Geometry Shader:** Takes entire primitives as input (e.g., a triangle with three vertices), processes them, and outputs zero or more primitives.
4. **Rasterization:** Converts the output primitives into fragments (pixels) for the Fragment Shader.
5. **Fragment Shader:** Computes the final color and other attributes of each pixel.

# GPU is a Massive Shader

## A. Vertex Processing

- **Objective:** Process individual vertices of 3D objects.
- **Tasks Involved:**
  - **Transformation:** Convert 3D vertex coordinates to screen space.
  - **Lighting:** Apply lighting calculations based on normal vectors.
  - **Clipping & Culling:** Remove unseen parts of objects to optimize performance.
  - **Texture Mapping Coordinates:** Assign texture coordinates to vertices.

# GPU is a Massive Shader

## B. Pixel-Fragment Processing

- **Objective:** Compute final color and appearance of pixels.
- **Tasks Involved:**
  - **Texturing:** Map textures onto 3D surfaces.
  - **Shading:** Apply lighting effects at the pixel level.
  - **Blending:** Mix colors from different surfaces (e.g., transparency effects).
  - **Depth Testing:** Determine which pixel is visible (z-buffering).

# Software-Programmable GPU

- **2001-2010s:** GPUs introduced programmable shaders, allowing developers to write custom graphics effects.
- Key Innovations:
  - Vertex & pixel shaders (NVIDIA GeForce 3, 2001).
  - Unified Shader Architecture (NVIDIA Tesla, 2006).
  - DirectX & OpenGL revolutionized GPU programmability.
- Examples:
  - NVIDIA GeForce series
  - AMD/ATI Radeon series



# Software-Programmable GPU

- **2001-2010s:** GPUs introduced programmable shaders, allowing developers to write custom graphics effects.
- Key Innovations:
  - Vertex & pixel shaders (NVIDIA GeForce 3, 2001).
  - Unified Shader Architecture (NVIDIA Tesla, 2006).
  - DirectX & OpenGL revolutionized GPU programmability.
- Examples:
  - NVIDIA GeForce series
  - AMD/ATI Radeon series



In the **second phase** of GPU evolution, characterized by the introduction of **programmable shaders**, several notable GPUs exemplify this transition:

1. **NVIDIA GeForce 3 Series (2001):**

- Introduced the first programmable **vertex shaders**, allowing developers to write custom programs to manipulate vertex data, enabling more dynamic and realistic graphics effects.

2. **ATI Radeon 9700 (2002):**

- Featured both programmable **vertex and pixel shaders**, providing enhanced flexibility in rendering complex visual effects and contributing to more immersive gaming experiences.

3. **NVIDIA GeForce 8 Series (2006):**

- Implemented the **Unified Shader Architecture**, where previously separate vertex and pixel shaders were combined into a single programmable unit, optimizing resource utilization and performance.

4. **AMD Radeon HD 2000 Series (2007):**

- Adopted a similar **unified shader model**, enhancing programmability and efficiency in rendering processes.

These advancements marked a significant shift from fixed-function pipelines to more flexible, programmable GPUs, laying the groundwork for subsequent developments in general-purpose computing on GPUs (GPGPU).



# Software-Programmable GPU

| Shader Stage                                  | Purpose                                           | Executed By                      |
|-----------------------------------------------|---------------------------------------------------|----------------------------------|
| <b>Vertex Shader</b>                          | Transforms vertices to screen space               | <b>CUDA Cores / Shader Cores</b> |
| <b>Tessellation Shader</b>                    | Dynamically subdivides geometry                   | <b>CUDA Cores / Shader Cores</b> |
| <b>Geometry Shader</b>                        | Processes full primitives (triangles, points)     | <b>CUDA Cores / Shader Cores</b> |
| <b>Mesh Shader (New in Vulkan &amp; DX12)</b> | Replaces Vertex & Geometry shaders for efficiency | <b>CUDA Cores / Shader Cores</b> |
| <b>Pixel/Fragment Shader</b>                  | Computes final color of each pixel                | <b>CUDA Cores / Shader Cores</b> |

# GPU: Three Generations

- GPU evolution: From fixed-function accelerator to programmable processor.
- Early adoption in machine learning: The massive parallel nature suitable for linear algebra.
- GPU in the deep learning : Architecture evolution driven by rapid growth of deep learning.

| Phase                                                        | Era                        | Key Features                                                                                                                                    | Examples                                                                                                              |
|--------------------------------------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| <b>1 Hardwired GPU (Fixed-Function Graphics)</b>             | <b>1980s – Early 2000s</b> | - Specialized <b>fixed-function pipelines</b> for rendering - No programmability, only hardware-based transformations & rasterization           | - <b>1981:</b> IBM CGA (First consumer graphics card) - <b>1999:</b> NVIDIA GeForce 256 (First GPU)                   |
| <b>2 Programmable GPU (Shader-Based Graphics Processing)</b> | <b>Early 2000s – 2010s</b> | - <b>Vertex &amp; Pixel Shaders</b> introduced for <b>custom effects</b> - Unified Shader Architecture allows <b>software-defined rendering</b> | - <b>2001:</b> NVIDIA GeForce 3 (First programmable vertex shader) - <b>2006:</b> NVIDIA Tesla (First Unified Shader) |
| <b>3 GPGPU for AI/ML (General-Purpose GPU Computing)</b>     | <b>2010s – Present</b>     | - <b>CUDA/OpenCL</b> enable <b>parallel computing</b> for AI, HPC - <b>Tensor Cores &amp; AI Accelerators</b> introduced                        | - <b>2007:</b> NVIDIA CUDA (GPGPU programming) - <b>2017:</b> NVIDIA Volta (First Tensor Cores for ML)                |

# Programmable GPU (GeForce 7800@2005)

**Vertex Shader Engine:** Handles per-vertex computations such as geometry transformations, vertex lighting, and vertex displacement.

- Coordinate transformations (model → world → screen space)
- Per-vertex lighting calculations
- Vertex morphing and animation

**Pixel Shader (Fragment Shader) Engine:** Processes individual pixels (fragments), handling color calculations, texture blending, lighting effects, and other pixel-level operations.

- Per-pixel lighting and shading
- Texture mapping, filtering, and blending
- Complex visual effects (reflections, shadows, bump mapping)

**ROP (Raster Operations Pipeline):** Final stage of the rendering pipeline, converting fragment outputs into pixels stored in the framebuffer.

- Depth (Z) testing and stencil operations
- Color blending and transparency
- Anti-aliasing (AA)
- Writing final pixel values to memory



# Single-Chip Inference Performance - 1000X in 10 years



# Tesla 2006: Unified shader architecture

|                       |                                                                                                                                                                                                   |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>Overall Design</b> | First CUDA-capable GPU (90nm, 2006), unified graphics and compute with five subsystems: control, compute, cache, memory, interconnect. Work distribution units dispatched tasks to scalable TPCs. |
| <b>Key Modules</b>    | Each TPC had a Geometry Controller, SM Controller, 2 SMs, 1 Texture Unit + L1 cache, and 4 ROPs. Shared L2 caches, DRAM partitions, and display interface supported the whole chip.               |
| <b>SM Design</b>      | Each SM: 8 scalar SPs (FP32/INT32), 2 SFUs (math functions), 16 KB shared memory, instruction & constant caches. Scalar SP design simplified CUDA C support.                                      |
| <b>Workflow</b>       | CPU sends tasks → Input Assembler builds primitives → Vertex Shaders run → Rasterization & Z-cull → Pixel Shaders process fragments → ROPs handle blending/depth/AA → output to memory/display.   |



In 2006, NVIDIA introduced the GeForce 8800. This design featured a “unified shader architecture” with 128 processing elements distributed among eight shader cores. Each shader core could be assigned to any shader task, eliminating the need for stage-by-stage balancing and greatly improving overall performance.

# Tesla 2006: Unified shader architecture

|                       |                                                                                                                                                                                                   |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>Overall Design</b> | First CUDA-capable GPU (90nm, 2006), unified graphics and compute with five subsystems: control, compute, cache, memory, interconnect. Work distribution units dispatched tasks to scalable TPCs. |
| <b>Key Modules</b>    | Each TPC had a Geometry Controller, SM Controller, 2 SMs, 1 Texture Unit + L1 cache, and 4 ROPs. Shared L2 caches, DRAM partitions, and display interface supported the whole chip.               |
| <b>SM Design</b>      | Each SM: 8 scalar SPs (FP32/INT32), 2 SFUs (math functions), 16 KB shared memory, instruction & constant caches. Scalar SP design simplified CUDA C support.                                      |
| <b>Workflow</b>       | CPU sends tasks → Input Assembler builds primitives → Vertex Shaders run → Rasterization & Z-cull → Pixel Shaders process fragments → ROPs handle blending/depth/AA → output to memory/display.   |



In 2006, NVIDIA introduced the GeForce 8800. This design featured a “unified shader architecture” with 128 processing elements distributed among eight shader cores. Each shader core could be assigned to any shader task, eliminating the need for stage-by-stage balancing and greatly improving overall performance.

# The Programming Challenge



# CUDA Compute Unified Device Architecture

- **Programming Model:** It provides a programming model that allows for the development of software that can execute parallel computations efficiently on GPUs.
- **Parallel Computing Platform:** CUDA enables developers to harness the parallel processing power of NVIDIA GPUs for general-purpose computing tasks, facilitating significant performance improvements in various applications.
- **API Support:** CUDA offers an API that allows software developers to utilize NVIDIA GPUs for general-purpose processing, enabling the development of high-performance applications across various domains.
- **Extensive Ecosystem:** Over the years, NVIDIA has built a comprehensive ecosystem around CUDA, including specialized code libraries and AI models, making it a dominant platform in AI and high-performance computing.



# The History of CUDA

SC08

Conference for High Performance Computing  
Austin Texas