

# 5D Parallelism

made with ❤️ for “Little ML book club”

# The 5D Parallelism Landscape

## ACTIVATION / DATA TENSOR



## MODEL ARCHITECTURE



Left: Data Tensor cuts (DP, TP, SP/CP) | Right: Model Architecture cuts (PP, EP)

## ZeRO Strategies: Memory Reduction per GPU



## ZeRO-3 In Action: The Fetch-Compute-Discard Cycle

Step 1: Forward Pass (Layer 1)  
Need L1 Params -> Broadcast from Owner (GPU 0)



Step 2: Forward Pass (Layer 3)  
Need L3 Params -> Broadcast from Owner (GPU 0)



Step 3: Backward Pass (Layer 3)  
1. Fetch P(L3) -> 2. Compute G(L3) -> 3. Reduce-Scatter G(L3) home



Step 4: Optimizer Step  
Update owned Parameters using owned States & Gradients



|                                      | <b>ZeRO-3</b>                                             | <b>Pipeline Parallelism</b>                  |
|--------------------------------------|-----------------------------------------------------------|----------------------------------------------|
| Each compute unit stores...          | only a fraction of a layer                                | a full layer                                 |
| Communication is used to transfer... | weights                                                   | activations                                  |
| Orchestration                        | Model-agnostic                                            | Model-agnostic                               |
| Implementation challenges            | Complex to handle model partitioning and communications   | Complex to handle efficient PP schedules     |
| Scaling considerations               | Prefers large <i>mbs</i> and <i>seq_len</i> to hide comms | Prefers large <i>grad_acc</i> to hide bubble |

Server Node A  
(e.g., DGX H100)



Server Node B  
(e.g., DGX H100)



THE BOTTLENECK  
(Ethernet: ~50 GB/s)

Must use either:  
1. ZeRO-3  
OR  
2. Pipeline Parallelism







## 1. Meta LLaMA Family

| Model     | Date     | Parameters | Hardware            | TP | PP | DP/<br>FSDP   | CP | EP | Key Innovations                                                                 |
|-----------|----------|------------|---------------------|----|----|---------------|----|----|---------------------------------------------------------------------------------|
| LLaMA 1   | Feb 2023 | 7B-65B     | 2,048 A100<br>80GB  | —  | —  | DP            | —  | —  | Basic data parallelism;<br>RSC cluster                                          |
| LLaMA 2   | Jul 2023 | 7B-70B     | RSC + prod clusters | —  | —  | FSDP          | —  | —  | Introduced FSDP; GQA<br>for 70B; 4K context;<br>1.73M GPU-hours for<br>70B      |
| LLaMA 3   | Apr 2024 | 8B-70B     | 16,384<br>H100      | 8  | 16 | FSDP<br>(128) | 1  | —  | 4D parallelism; 8K<br>context; 126 layers (not<br>128) for balanced PP          |
| LLaMA 3.1 | Jul 2024 | 8B-405B    | 16,384<br>H100      | 8  | 16 | FSDP<br>(128) | 1- | —  | 128K context via CP=16;<br>all-gather CP (not ring<br>attention); 38-43%<br>MFU |

## 2. Google PaLM/Gemini Family

| Model                   | Date     | Parameters        | Hardware            | TP | PP   | DP            | CP | EP         | Key Innovations                                                   |
|-------------------------|----------|-------------------|---------------------|----|------|---------------|----|------------|-------------------------------------------------------------------|
| <b>PaLM</b>             | Apr 2022 | 540B              | 6,144 TPU v4        | 12 | None | 256 (2D FSDP) | —  | —          | Pipeline-free; 57.8% HW utilization; Pathways system              |
| <b>PaLM 2</b>           | May 2023 | Undisclosed       | TPU v4              | ✓  | —    | ✓             | —  | ✓ (sparse) | MoE architecture; improved compute-optimal scaling                |
| <b>Gemini 1.0 Ultra</b> | Dec 2023 | Undisclosed       | Multi-DC TPU v4/v5e | ✓  | —    | ✓             | —  | ✓          | Multi-datacenter training; 97% goodput; optical circuit switching |
| <b>Gemini 1.5 Pro</b>   | Feb 2024 | Undisclosed (MoE) | TPU v5+             | ✓  | —    | ✓             | ✓  | ✓          | Sparse MoE; up to 1M context; long-context specialization         |

## The Core Tension: Topology Drives Parallelism Strategy

Google TPU Architecture  
"Uniform 3D Torus"



GPU Clusters (Meta/DeepSeek)  
"Hierarchical Topology"



### Why skip PP?

- All-Reduce is CHEAP everywhere
- PP Bubble overhead > Comm savings
- Avoid complexity & HBM stress

### Why use PP?

- Cannot do All-Reduce across nodes efficiently
- PP limits traffic to just 'Activations'

### 3. DeepSeek Family

| Model        | Date     | Total  | Active         | Hardware     | TP   | PP                 | DP     | EP | Key Innovations                                                      |
|--------------|----------|--------|----------------|--------------|------|--------------------|--------|----|----------------------------------------------------------------------|
|              |          | Params | Params         |              |      |                    |        |    |                                                                      |
| DeepSeek 67B | Jan 2024 | 67B    | 67B<br>(dense) | H800 cluster | ✓    | ✓                  | ZeRO   | —  | Baseline dense model                                                 |
| DeepSeek-V2  | May 2024 | 236B   | 21B            | H800 cluster | —    | 16<br>(ZeroBubble) | ZeRO-1 | 8  | MLA attention; DeepSeekMoE 42.5% cost reduction vs 67B               |
| DeepSeek-V3  | Dec 2024 | 671B   | 37B            | 2,048 H800   | None | 16<br>(DualPipe)   | ZeRO-1 | 64 | Aux-loss-free balancing; FP8 \$5.6M total cost; 180K GPU-hr/T tokens |

Standard DP (No Pipeline)



With Pipeline Parallelism



see you next time