

## LAYER 0: HOST SYSTEM (PC - Python/PyTorch)

(Kernel Compilation, Grid Tiling, Data Flattening to Linear Stream)

USB Serial / WiFi (Command Stream)

## LAYER 1: GPU MASTER & TENSOR ENGINE [ AMB82-Mini ]

(Simplifying Layer 1 Components)

- [Main Thread / CPU]**
1. Parse USB CMD Packets
  2. Manage VRAM Pointers
  3. Trigger DMA Transfers

**[Hardware Engine / DMA]**

1. Drive 8080 Parallel Bus
2. VRAM → GPIO (Zero-Copy)
3. Gen. WR/CS Timing Signals

Control (CMD)

Data Stream (Weights)

GLOBAL G-BUS (8-bit Parallel)

~50MB/s Memory Streaming  
[D0-D7 | WR | RD | DC | CS x]

## LAYER 2: SM 0 [ ESP32-S3 ]

### CORE 0 (Receiver)

[ Upstream IO ]

1. Listen to GlobalBus
2. Drive Slave Interface (RX)
3. Write to L1 PSRAM

(RingBuf)

### CORE 1 (Scheduler)

[ Downstream IO ]

1. Read L1 PSRAM
2. Warp Scheduler Logic
3. Drive Local Master (TX)

LAYER 2: SM 1  
(Collapsed View)

LAYER 2: SM N  
(Collapsed View)

LOCAL G-BUS (8-bit Parallel)

~40MB/s Instruction Broadcast  
[D0-D7 | WR | RD | DC | CS [0..3]]

## SMSP 0 [ RP2040 ]

### CORE 0 (IO)

1. PIO RX Manager
2. Fill FIFO
3. Sync Signals

[ FIFO ]

### CORE 1 (ALU)

1. Floating Point Ops
2. Activation Func
3. Matrix Mult.

## SMSP 1 [ RP2040 ]

### CORE 0 (IO)

1. PIO RX Manager
2. Fill FIFO
3. Sync Signals

[ FIFO ]

### CORE 1 (ALU)

1. Floating Point Ops
2. Activation Func
3. Matrix Mult.

## SMSP 2 [ RP2040 ]

### CORE 0 (IO)

1. PIO RX Manager
2. Fill FIFO
3. Sync Signals

[ FIFO ]

### CORE 1 (ALU)

1. Floating Point Ops
2. Activation Func
3. Matrix Mult.

## SMSP 3 [ RP2040 ]

### CORE 0 (IO)

1. PIO RX Manager
2. Fill FIFO
3. Sync Signals

[ FIFO ]

### CORE 1 (ALU)

1. Floating Point Ops
2. Activation Func
3. Matrix Mult.

Logical "Warp" (32 Threads)

(Synchronized via SYNC\_TRIG Signal)