

2025.01.13



2025.01.15

**multilevel command buffer:  
blit/compute encoders**

can compile from commandline or  
at runtime a/synchronously from  
`string: newLibraryWithSource:options:`  
`completionHandler/error`

Apple Metal  
quick glance

**doesn't support double (?)**

2025.01.19



2025-01-20



2025.01.21

uses cuBLAS, cuBLASLt in code for matmul forward and backward, manual kernel for backward bias term

hyperparams for loading any GPT2 or GPT3

random init scheme from GPT2, init computed on the CPU

judiciously uses size\_t to not overflow int

conserves memory by reusing buffers

manually manages memory (no per-tensor alloc/free)

manual kernels for: encoder forward and backward, layernorm forward and backward, softmax forward and (in-place) backward, Cross-Entropy forward fused with backward, AdamW

see slide  
2024.11.20

llm.c

uses ZeRO for multi-GPU

uses MPI to distribute

To Be Continued

two variants: with & w/o cuDNN, used for attention

uses stochastic rounding from FP32 to BF16

2025.01.23

Sharded Data Parallel and FSDP  
not supported yet -- only sharding  
optimizer states aka. ZeRO-1

dedicated CUDA stream  
for NCCL ops

Ilm.c/zero.cuh here,  
NCCL and MPI overview  
on future maps

Ilm.c distributed

multi\_gpu\_async\_reduce\_gradient:  
reduce-scatter if ZeRO else all-reduce

different network socket API causes some  
duplication between Windows and Linux

gpt2\_calculate\_grad\_norm: reuses the  
activations buffer; without ZeRO, gradients already  
averaged across all GPUs, sums norms locally;  
with ZeRO, need to all-reduce-sum the norms

2025.01.24

currently executing in a warp

coalesced\_group active =  
coalesced\_threads()

active.sync() no deadlock

protects from deadlocks  
wrt. \_\_syncthreads()

if size=32, gives warps

first-class thread blocks

no. of this thread in this group  
group.sync()  
group.thread\_rank()

tiled\_partition(group, size)

API for thread subsets

CUDA Cooperative Groups

CUDA Warp-level primitives

32-bit (int) masks  
pick threads of a warp

thread\_block\_tile::...

#pragma unroll  
since size is static

synchronized data exchange:

--all\_sync, --any\_sync, --uni\_sync,  
--ballot\_sync;  
--shfl\_sync, --shfl\_up\_sync,  
--shfl\_down\_sync, --shfl\_xor\_sync;  
--match\_any\_sync, --match\_all\_sync

shfl()  
shfl\_down()  
shfl\_up()  
shfl\_xor()  
any()  
all()  
ballot()  
match\_any()  
match\_all()

then, compiler can remove  
synchronizations (unsafe  
when done manually)

Get \_\_activemask;

Sync with memory fence

--syncwarp(mask=FULL\_MASK)

0xffffffff

--ffs,  
--popc

To Be Continued

2025.01.25

`--match_any_sync`: returns  
the mask of threads with the  
same value as the calling thread

`--match_all_sync`: returns  
the given mask if all its threads  
have the same value, otherwise 0

`--all_sync, --any_sync`: the value  
is non-zero for all/any of the threads

each calling thread must be in the mask,  
all masks must be the same

`--ballot_sync`: returns the mask  
of threads with non-zero value

`--reduce_add/min/max_sync`:  
reduces the int or unsigned values

CUDA warp functions:  
Vote, Match, Reduce, Shuffle

`--reduce_and/or/xor_sync`:  
logical op reduces unsigned values

`shfl`: exchange a variable  
between threads of a warp  
(faster than shared mem)

Warp Matrix Functions leverage  
Tensor Cores for Matrix Multiply Add

unlike reduce functions, works  
with all numeric types, including  
`--half2` and `--nv_bfloat162`

`--shfl_sync` specifies  
source lane explicitly

`--shfl_up/down_sync` specify  
delta, source is lower/higher

optional width (one of 2,4,8,16)  
subdivides operation into groups,  
with group-relative addressing

`--shfl_xor_sync` bitwise XORs calling  
thread lane ID with the given lane mask

See: butterfly pattern

2025.01.26

cuDNN graph performs inference: shapes  
for virtual/temp tensors, strides, precisions

configurable defaults for: io,  
intermediate, and compute data type

filtering: numerical, behavior,  
functional properties

via `Graph::validate`

Graph: fusion etc.

autotuning

policy-based selection

multiple heuristic-based execution plans

Backend: C API

opt-in to use Tensor Core  
(in backend, not in frontend?)

Frontend: C++  
and Python APIs

all tensors have from 3 to 8 axes  
(with leading dims 1 if not needed)

matmul broadcasts even  
non-1 batch axes if needed

cuDNN: CUDA Deep Neural Network

Matrix Mult.

multinode graphs do not  
support in-place operations

Convolution: forward,  
data grad, weight grad

Batch Normalization: forward,  
backprop, finalize stats

Attention:  
forward, backprop

Pointwise: add, bias, scale, sub,  
mul, rsqrt, relu, elu, gelu, cmp\_gt

Layernorm:  
forward, backprop

only for FP16, BF16, FP8

Instancenorm:  
forward, backprop

To Be Continued

2025.01.27

virtual tensors can be any type,  
but recommended FP32

mixed precision inputs  
via (pointwise) identity

on Ada Lovelace, FP8  
inputs trigger FP8  
Tensor Cores

compute type FP32 / CUDNN\_DATA\_FLOAT  
(recommended for backward pass) and  
CUDNN\_DATA\_FAST\_FLOAT\_FOR\_FP8

pre-compiled single operation engines:  
convolution and normalization ops

require exactly one batch axis

generic runtime fusion engines: only for  
pointwise ops --> matmul or convolution (or  
none) --> pointwise ops [--> reduction op]

### cuDNN Graphs

specialized pre-compiled engines

specialized runtime fusion patterns

Convolution-BatchNorm  
with ReLU activation

Fused Attention with  
max seq length 512,  
forward and backward,  
e.g. similar to BERT  
and T5

ResNet helpers: BachNorm forward (with  
optional Add, ReLU, and (> 0) side output)  
and backward (with optional dReLU and  
side grad output for fwd's Add)

FP8 Fused Flash  
Attention max  
sequence length 512

allows optional: scalar key scaling,  
padding or causal masks,  
softmax, dropout

support multi-GPU batches

Fused Flash Attention forward and backward,  
usable GPT and BERT like models

configurable with many scaling, mask and dropout options

2025.01.28

can initialize NCCL with: tcp, mpi, fs  
(fs: file system synchronization)

**MPI\_Bcast** to  
initialize NCCL rank

**MPI\_Allgather** to find the  
GPU's ordinal on a machine

1 GPU = 1 process

Ilm.c usage

Send, Recv: point-to-point  
communication

Reduce-Scatter: pointwise  
reduces a vector and  
scatters the results

**MPI: Message Passing Interface  
and collective operations**

Broadcast: send from one to all

Reduce: send from all, reduce on the fly  
into a value received by one

Barrier: achieves  
global synchronization

All-Reduce: send from all, receive  
the same reduced value by all

Gather: send from all, all values  
received by one node in a container

Scan: each node receives a partially  
reduced value depending on its rank

All-Gather: send from all, every node  
receives all values in a container

Scatter: send from a container on one  
node, a different value to every node

All-to-all: each node has a container,  
from which it sends a different value  
to every node's receiving container

vectorized versions of these, and dedicated  
versions where values are arrays

2025.01.29

PCIe, NVLINK, InfiniBand Verbs, IP sockets

deals with different GPU and interconnect types

control: single-threaded, multi-threaded, multi-process including MPI

single kernel handling both communication and computation

root rank gets the result of Reduce

All-Reduce, Broadcast, Reduce, All-Gather, Reduce-Scatter, Send, Recv

ncclUniqueId is the root rank, w/ it a given rank communicates. Can be a set ncclUniqueIds, all nodes must have the same set.

a communicator has nodes (ranks) and issues collective operations, a communicator *object* is a node

optionally can block, default async via stream queues

Cooperative Thread Arrays:  
threads of a warp?

NCCL: NVIDIA Collective Communications Library

but enqueueing can block on other ranks to arrive first

must be used when single thread manages multiple devices

dynamic scope  
ncclGroupStart/End

Using communicators concurrently can cause deadlocks. Even with separate streams, e.g. one uses too many blocks.

aggregating operations might optimize communication

a communicator object of a device can use host pointers, cannot use pointers of a peer device, to avoid programmer errors

to avoid internal copies:

ncclMemAlloc & ncclCommRegister

Don't use streams while NCCL uses them

2025.01.30

## NCCL setup and helpers in zero.cuh

gpt2\_update: All-Gather updated shards of params

gpt2\_backward\_and\_reduce:  
All-Reduce gradients per-layer in the last microbatch step, also All-Reduce accumulated\_mean\_loss

multi\_gpu\_async\_reduce\_gradient: if not sharding then All-Reduce, if ZeRO-1 then Reduce-Scatter

gpt2\_calculate\_grad\_norm:  
All-Reduce grad\_norm\_squared

main: Periodically compute and  
All-Reduce: validation loss,  
HellaSwag accuracy

NCCL usage in llm.c

All-Reduce to compute memory stats

2025.01.31

unlike OCANNL  
compilation is optional

lazy like tinygrad, but more control like OCANNL:  
explicitly forced, can be explicitly compiled

API the same as NumPy/PyTorch?

all graphs can be compiled with  
dynamic shapes (unlike tinygrad  
and current OCANNL)

cumbersome, verbose notation

compilation pipeline: shape inference, high-level -> scheduler i.e. kernel demarcation and ordering -> low-level -> polyhedral IR -> rendering kernels

High-level (API) and low-level (AIR)  
graph interfaces, codegen, graph runner

quick first glance at Caten

focus on "NN inference runtime"

verdict: the code is too sprawling  
(e.g. too many files) to be worth  
digging into; better focus on tinygrad

but it's impressive, maybe I'll have  
some luck finding inspiration

interesting to look at:  
source/api/tensor.lisp  
source/codegen/scheduler.lisp  
source/codegen/jit.lisp  
source/codegen/memory-planner.lisp  
(minimizes peak memory usage)  
source/byoc/metal.lisp  
external/l1m/layers.lisp