

2024.11.07



2024.11.08



2024. 11. 09



2024. 11. 10



2024.11.11

**actually, evacuated\_to :**

[Host | Stream Tn.t] option

**if node is evacuated,  
automatically schedule  
bringing it back when needed**

evacuated **field**

**can evacuation  
be automated?**

to\_host ~evacuate:true

**managing scarce memory**

device\_to\_device ~evacuate:true?

**bringing back across multiple hops**

2024.11.12

but the reverse for  
computational initialization  
e.g. random distr

**default** from\_host  
to initialize hosted  
nodes on device

similar to  
evicted\_to

**data movement automation**

does not need an extra layer!

explicit calls to backend  
work as currently

automates backend calls

available functions and/or  
exposed fields to disable or tune  
the automation

2024. 11. 13



2024. 11. 14



2024.11.15



more natural for pipeline parallelism

In OCANNL, easy to manually partition a model into components and explicitly distribute over devices and backends.

awkward/hard for model parallelism

especially for parameter sharding

Tensor parallelism

In OCANNL, activations are typically a special case of non-materialized tensors.

to be continued tomorrow

## Fully Sharded Data Parallel

Paper from Meta AI  
arXiv:2304.11277

model parallelism

we need good design for initialization

parameter prefetching

overlap communication with computation

operation reordering

hard! requires

passing activations

rethinking design:  
non-materialized tensors vs the eviction mechanism

2024.11.18

1. Each sub-module / model layer is a FSDP unit.
2. Each unit's non-shared parameters are flattened, concatenated and sharded across backends/devices.
3. Before the unit's computation, unshards required parameters. Afterward, deletes other shards' parameters.

communicates parameters & their gradients on demand, for unsharding & accumulation

**crosses abstraction levels:  
bad fit for OCANNL design**

great for balancing memory, computation, number of devices

### Fully Sharded Data Parallel

parameter sharding

communicates activations: computation boundaries

downside: complicates model design with computational considerations

### Tensor Parallelism

upside: keeps model design clean

manually: a slice operator indexed by backend/device/stream

automatically find good axes

tricky

hard work: find good axes to balance memory, computation, number of devices

fits with OCANNL design

worse fit for OCANNL design

2024.11.19, updated 2024.11.27

design risk: interacts with  
the whole OCANNL, esp.  
shape inference

visually attractive examples

convolutions  
and padding

even LLMs:  
need padding

1/3 done : tinygrad

Pallas: extension of JAX  
for writing custom kernels

done

mutable abstraction  
layer for JAX

what are blocks?

added  
11.27

more  
deep  
dives

keras

keras.core

LLM101n

llm.c

Karpathy's

jackpeck's llama2.ml

llama2.c

nanoGPT

Keller Jordan's  
modded-nanogpt

continues recent theme

pipeline parallelism

mostly done

program search  
in tinygrad

done

super important for design

2024.11.20

papers to read:

ZeRO arXiv:1910.02054

Zero Bubble arXiv:2401.10241

NanoFlow arXiv:2408.12757

more notes coming

pipeline with nano-batches at the granularity of operations

PP most helpful for cross-server connections

optimizer/gradient sharding

pipeline parallelism

asynchronous PP breaks optimizer semantics

ZeRO: focus on minimizing per-GPU memory

Zero Bubble

ZeRO-DP is similar to FSDP, but lossy and much simpler

optimizes microbatch scheduling of Forward, Backward, and Weight-gradient

layer-wise parallelizes the optimizer step by accumulating progressively

partitions optimizer states, discards gradient parts for other partitions

propagates global optimizer state of previous iteration while the next iteration is computing the initial forward steps

redoes the optimizer step if a global check fails for any layer (found INF, NaN, or gradient clip needed)

2024.11.21



2024.11.22

Why is all this in here? schedule.py: things to schedule  
Where is the scheduling? not scheduling.

what is sizzle???

↳ see later

to\_uop for non-const buffers is the ShapeTracker's view of the buffer

via graph rewriting

an UOp is\_scheduled when the op is Ops.VIEW

selects groups to fuse,  
vs. what to materialize

prepares indexing (aka.  
movement ops), integrates  
indexing with computation

similar to OCANNL's inputs and outputs fields in routine

ScheduleItem with disjoint input and output buffers

multiple outputs possible via Ops . SINK AST node,  
otherwise single output

scheduler in tinygrad

PR 7065

upcoming design

Gets rid of LazyBuffer (replaced with UOp) and of engine/lazy.py

gets rid of indexing processing in schedule.py, instead exposes ShapeTracker methods in ops.py

tracks materialized buffers (i.e. realized in tinygrad) in ops.py

graph rewriting to push views below computations, collect buffers and kernels

in schedule.py

2024.  
11.23

args are already SRAM (copied before running a kernel), then copied explicitly into registers before computation

args are HBM/DRAM mem (global), explicitly copied into SRAM mem (local) before computation

imperative: explicit assignments

Refs were introduced to make JAX stateful even before Pallas, reused

on TPU

on GPU

more control over memory access

Pallas: a JAX kernel language

kernels parallelize over grids

a BlockSpec projects an input or output to a block-slice view and threadsstreams, for blockwise parallelism over grids

vmap of pallas\_call: adds extra grid axis

manually specified for each input & output

dynamic slicing and masking

generalization of tiling

doesn't support:

conv\_general etc. -- usually not on hardware

gather / scatter -- backends without

noncontiguous memory reads / writes

also needs explicit out\_shape

not implemented yet: alternatives to BlockSpec e.g. overlapping windows for convolutions

GPU and TPU are not entirely interchangeable?

2024.11.24



2024.11.25

counter-based Parallel Random Numbers

reducing <sup>manual</sup> across blocks in a group by locking at the end of a kernel

computes pseudo-random nums on the device with seed int32

Triton part 2

vs scheduling DSLS

vs polyhedral DSLS

use affine *access functions*

large search space

separate algo and schedule:  
tile splits, loop reordering  
and unrolling, parallel axes

support fusion, interchange,  
tiling, parallelization

not applicable to  
(structured-)sparse networks

TVM has built-in  
automatic scheduling

loop transformations

thread swizzling

auto optimizations

coalescing

pre-fetching

shared mem  
synchronization

orders threads  
within micro-tile to  
contiguous mem  
access

transforms row-major  
to column-major  
submatrix for each  
group-size rows

async copy  
scheduling

inserts barriers into  
GPU code by detecting  
read-after-writes and  
write-after-read

2024.11.26

fwd: NumPy pad  
bprop: shrink

fwd: NumPy broadcast\_to  
bprop: reduce SUM

fwd: NumPy subarray view  
bprop: pad

RESHAPE

PERMUTE

PAD

EXPAND

SHRINK

pure movement ops

translates all of NumPy  
syntax to ops

STRIDE

vs teenygrad =  
1/10th of tinygrad

tinygrad ShapeTracker

Decomposed using Flip  
for < 0 and a  
combination of Pad  
and Reshape for > 1

a View has:  
shape (i.e. dims), strides,  
offset, mask (begin-end per  
axis), whether it's contiguous

is a list of View objects

canonicalized, e.g.:  
if mask uncovers  
just 1 index, convert  
to "no stride" and  
adjust the offset

assigned to op nodes except  
DEFINE\_LOCAL/GLOBAL/VAR,  
BUFFER, CONST

default strides assume  
shape is rightmost-major

VIEW has a  
ShapeTracker

VIEW = non-copy  
movement op

VIEW doesn't have children,  
instead typically provides a view  
for the preceding DEFINE\_GLOBAL

other nodes inherit  
ShapeTracker from children;  
all children must have the  
same ShapeTracker!

*why many views per Tracker? → tomorrow*

2024.11.27

ends up duplicating the part  
of the compute graph below  
reducing of the sharded axes

stores per-device bounds  
of the sharded axis

its ScheduleItem  
forms a kernel

dedicated float4 support

STORE: retain after  
kernel finishes

MultiLazyBuffer with  
per-device LazyBuffers

BITCAST first takes address  
and pointer-casts the address

multigpu via sharding

CAST and BITCAST

DEFINE\_GLOBAL args:  
position in parameter list,  
param name, mutability

tinygrad followup

UPCAST UOp actually  
means UNROLL!

axis dimensions placeholder

Variable

What is UPCAST Opt  
(i.e. in program search)?

can create\_schedule\_with\_vars  
perform some shape inference?

has range

this range is used to  
render loops in Linearizer

can remain symbolic

generates symbolic mask guard

passed as a kernel param