

# Contents

|                                                                 |          |
|-----------------------------------------------------------------|----------|
| <b>Stream Index</b>                                             | <b>1</b> |
| <b>STREAM Specification Index</b>                               | <b>1</b> |
| Document Organization . . . . .                                 | 2        |
| Chapter 1: Overview . . . . .                                   | 2        |
| Chapter 2: Functional Blocks . . . . .                          | 2        |
| Quick Reference . . . . .                                       | 2        |
| Functional Unit Blocks (FUB) . . . . .                          | 2        |
| Integration Blocks (MAC) . . . . .                              | 2        |
| Performance Modes (AXI Engines) . . . . .                       | 3        |
| Holistic Overview: V1/V2/V3 Working Together . . . . .          | 3        |
| V1 - Low Performance (Single Outstanding Transaction) . . . . . | 3        |
| V2 - Medium Performance (Command Pipelined) . . . . .           | 4        |
| V3 - High Performance (Out-of-Order Completion) . . . . .       | 4        |
| Performance Comparison Summary . . . . .                        | 5        |
| Clock and Reset Summary . . . . .                               | 5        |
| Clock Domains . . . . .                                         | 5        |
| Reset Signals . . . . .                                         | 5        |
| Interface Summary . . . . .                                     | 6        |
| External Interfaces . . . . .                                   | 6        |
| Internal Buses . . . . .                                        | 6        |
| Area Estimates . . . . .                                        | 6        |
| By Performance Mode . . . . .                                   | 6        |
| Breakdown (Low Performance) . . . . .                           | 6        |
| Related Documentation . . . . .                                 | 7        |
| Specification Conventions . . . . .                             | 7        |
| Signal Naming . . . . .                                         | 7        |
| Parameter Naming . . . . .                                      | 7        |
| State Machine Naming . . . . .                                  | 7        |

## Stream Index

**Generated:** 2025-11-16

## STREAM Specification Index

**Version:** 0.26 **Date:** 2025-10-17 **Purpose:** Complete technical specification for STREAM subsystem

---

## Document Organization

**Note:** All chapters linked below for automated document generation.

### Chapter 1: Overview

- Clocks and Reset

### Chapter 2: Functional Blocks

- Descriptor Engine
  - Scheduler
  - AXI Read Engine
  - AXI Write Engine
  - SRAM Controller
  - Simple SRAM
  - Channel Arbiter
  - APB Config
  - MonBus AXIL Group
  - Top-Level Integration
  - Performance Profiler
- 

## Quick Reference

### Functional Unit Blocks (FUB)

| Module            | File                            | Purpose                          | Lines | Status |
|-------------------|---------------------------------|----------------------------------|-------|--------|
| descriptor_engine | stream_fub/descriptor_engine.sv | Descriptor fetch/parse (256-bit) | ~300  | [Pen]  |
| scheduler         | stream_fub/scheduler.sv         | Transfer coordinator             | ~40   | [Pen]  |
| axi_read_engine   | stream_fub/axi_read_engine.sv   | AXI read master                  | ~30   | [Pen]  |
| axi_write_engine  | stream_fub/axi_write_engine.sv  | AXI write master                 | ~40   | [Pen]  |
| sram_controller   | stream_fub/sram_controller.sv   | Per-channel buffer management    | ~30   | [Pen]  |
| simple_sram       | stream_fub/simple_sram.sv       | Dual-port SRAM primitive         | ~10   | [Pen]  |
| perf_profiler     | stream_fub/perf_profiler.sv     | Channel performance profiling    | ~40   | [Pen]  |

### Integration Blocks (MAC)

| Module          | File                            | Purpose              | Lines | Status   |
|-----------------|---------------------------------|----------------------|-------|----------|
| channel_arbiter | stream_macro/channel_arbiter.sv | Priority arbitration | ~200  | [Pen]    |
| apb_config      | regs/stream_regs.rdl + wrapper  | Config registers     | ~350  | [Future] |

| Module            | File                              | Purpose       | Lines | Stat  |
|-------------------|-----------------------------------|---------------|-------|-------|
| monbus_axil_group | stream_macro/monbus_axil_group.sv | MonBus + AXIL | ~800  | [Don] |
| stream_top        | stream_macro/stream_top.sv        | Top-level     | ~500  | [Pen] |

---

## Performance Modes (AXI Engines)

STREAM AXI engines support three performance modes via compile-time parameters, offering area/performance trade-offs from tutorial simplicity to datacenter throughput.

### Holistic Overview: V1/V2/V3 Working Together

#### Design Philosophy:

- **V1 (Low):** Tutorial-focused, simple architecture, minimal area
- **V2 (Medium):** Command pipelining hides memory latency, 6.7x throughput improvement
- **V3 (High):** Out-of-order completion maximizes memory controller efficiency, 7.0x throughput

**Key Insight:** The benefit scales with memory latency. For low-latency SRAM (2-3 cycles), V1 achieves 40% throughput. For high-latency DDR4 (70-100 cycles), V1 drops to 14% but V2/V3 maintain 94-98% throughput.

#### Parameterization Strategy:

- `ENABLE_CMD_PIPELINE = 0`: V1 (default, tutorial mode)
- `ENABLE_CMD_PIPELINE = 1`: V2 (command pipelined, best area efficiency)
- `ENABLE_CMD_PIPELINE = 1, ENABLE_000_DRAIN = 1` (write) or `ENABLE_000_READ = 1` (read): V3 (out-of-order, maximum throughput)

### V1 - Low Performance (Single Outstanding Transaction)

**Architecture:** Single-burst-per-request, blocking on completion

- **Area:** ~1,250 LUTs per engine
- **Throughput:** 0.14 beats/cycle (DDR4), 0.40 beats/cycle (SRAM)
- **Outstanding Txns:** 1
- **Use Case:** Tutorial examples, embedded systems, low-latency SRAM
- **Key Feature:** Zero-bubble streaming pipeline (NO FSM!)

### **Blocking Behavior:**

- Write: Blocks on B response before accepting next request
- Read: Blocks on R last before accepting next request
- Simple control: 3 flags (`r_ar_inflight`, `r_ar_valid`, completion tracking)

## **V2 - Medium Performance (Command Pipelined)**

**Architecture:** Command queue decouples command acceptance from data completion

- **Area:** ~2,000 LUTs per engine (1.6x increase)
- **Throughput:** 0.94 beats/cycle (DDR4), 0.85 beats/cycle (SRAM)
- **Outstanding Txns:** 4-8 (command queue depth)
- **Use Case:** General-purpose FPGA, DDR3/DDR4, balanced area/performance
- **Key Feature:** Hides memory latency via pipelining (6.7x improvement)

### **Command Pipelining:**

- Write: AW queue + W drain FSM + B scoreboard (async response tracking)
- Read: AR queue + R in-order reception (simpler than write, no B channel)
- Per-command SRAM pointers enable multiple bursts from same channel

**Best Area Efficiency:** 4.19x throughput per unit area (6.7x throughput / 1.6x area)

## **V3 - High Performance (Out-of-Order Completion)**

**Architecture:** OOO command selection maximizes memory controller efficiency

- **Area:** ~3,500-4,000 LUTs per engine (2.8-3.2x increase)
- **Throughput:** 0.98 beats/cycle (DDR4), 0.92 beats/cycle (SRAM)
- **Outstanding Txns:** 8-16 (larger command queue)
- **Use Case:** Datacenter, ASIC, HBM2, high-performance memory controllers
- **Key Feature:** Flexible drain order based on SRAM data availability (7.0x improvement)

### **Out-of-Order Mechanisms:**

- Write: OOO W drain (select command with SRAM data ready), AXI ID matching for B responses
- Read: OOO R reception (match m\_axi\_rid to queue entry), independent SRAM write per command
- Transaction ID structure: {counter, channel\_id} preserves channel routing for MonBus

### Why OOO is Naturally Supported:

1. Per-channel SRAM partitioning (no cross-channel hazards)
2. Per-command pointer tracking (no pointer collision)
3. Transaction ID matching (channel ID in lower bits for routing)

### Performance Comparison Summary

| Configuration | Area | DDR4 Throughput         | SRAM Throughput         | Area Efficiency | Usage    |
|---------------|------|-------------------------|-------------------------|-----------------|----------|
| <b>V1</b>     | 1.0x | 0.14 beats/cycle (1.0x) | 0.40 beats/cycle (1.0x) | 1.00            | Tutorial |
| <b>V2</b>     | 1.6x | 0.94 beats/cycle (6.7x) | 0.85 beats/cycle (2.1x) | 4.19            | General  |
| <b>V3</b>     | 2.8x | 0.98 beats/cycle (7.0x) | 0.92 beats/cycle (2.3x) | 2.50            | Data     |

**Key Takeaway:** V2 offers best area efficiency (4.19x) for typical FPGA use cases. V3 provides marginal throughput improvement (7.0x vs 6.7x) at higher area cost, justified only for high-performance memory controllers that support out-of-order responses.

---

### Clock and Reset Summary

#### Clock Domains

| Clock | Frequency   | Usage                                              |
|-------|-------------|----------------------------------------------------|
| aclk  | 100-500 MHz | Primary - all STREAM logic, AXI/AXIL interfaces    |
| pclk  | 50-200 MHz  | APB configuration interface (may be async to aclk) |

#### Reset Signals

| Reset   | Polarity   | Type                        | Usage                       |
|---------|------------|-----------------------------|-----------------------------|
| aresetn | Active-low | Async assert, sync deassert | Primary - all STREAM logic  |
| presetn | Active-low | Async assert, sync deassert | APB configuration interface |

**See:** Clocks and Reset for complete timing specifications

---

## Interface Summary

### External Interfaces

| Interface        | Type   | Width           | Purpose                         |
|------------------|--------|-----------------|---------------------------------|
| APB              | Slave  | 32-bit          | Configuration registers         |
| AXI (Descriptor) | Master | 256-bit         | Descriptor fetch                |
| AXI (Read)       | Master | 512-bit (param) | Source data read                |
| AXI (Write)      | Master | 512-bit (param) | Destination data write          |
| AXIL (Slave)     | Slave  | 32-bit          | Error/interrupt FIFO access     |
| AXIL (Master)    | Master | 32-bit          | MonBus packet logging to memory |
| IRQ              | Output | 1-bit           | Error interrupt                 |

### Internal Buses

| Interface | Width  | Purpose                                                 |
|-----------|--------|---------------------------------------------------------|
| MonBus    | 64-bit | Internal monitoring bus (channels -> monbus_axil_group) |

---

## Area Estimates

### By Performance Mode

| Configuration      | Total LUTs | SRAM  | Use Case                      |
|--------------------|------------|-------|-------------------------------|
| Low (Tutorial)     | ~9,500     | 64 KB | Educational, area-constrained |
| Medium (Typical)   | ~11,200    | 64 KB | Balanced FPGA implementations |
| High (Performance) | ~13,700    | 64 KB | High-throughput ASIC/FPGA     |

### Breakdown (Low Performance)

| Component              | Instances | Area/Instance | Total       |
|------------------------|-----------|---------------|-------------|
| Descriptor Engine      | 8         | ~300 LUTs     | ~2,400 LUTs |
| Scheduler              | 8         | ~400 LUTs     | ~3,200 LUTs |
| AXI Read Engine (Low)  | 1         | ~250 LUTs     | ~250 LUTs   |
| AXI Write Engine (Low) | 1         | ~250 LUTs     | ~250 LUTs   |
| SRAM Controller        | 1         | ~1,600 LUTs   | ~1,600 LUTs |

| Component              | Instances | Area/Instance  | Total                     |
|------------------------|-----------|----------------|---------------------------|
| Simple SRAM (internal) | 1-8       | 1024x64B total | 64 KB                     |
| Channel Arbiter        | 3         | ~150 LUTs      | ~450 LUTs                 |
| APB Config             | 1         | ~350 LUTs      | ~350 LUTs                 |
| MonBus AXIL Group      | 1         | ~1,000 LUTs    | ~1,000 LUTs               |
| <b>Total</b>           | -         | -              | <b>~9,500 LUTs + 64KB</b> |

---

## Related Documentation

- **PRD.md** - Product requirements and overview
  - **ARCHITECTURAL NOTES.md** - Critical design decisions
  - **CLAUDE.md** - AI development guide
  - **Register Generation** - PeakRDL workflow
- 

## Specification Conventions

### Signal Naming

- **Clock:** aclk, pclk
- **Reset:** aresetn, presetn (active-low)
- **Valid/Ready:** Standard AXI/custom handshake
- **Registers:** r\_ prefix (e.g., r\_state, r\_counter)
- **Wires:** w\_ prefix (e.g., w\_next\_state, w\_grant)

### Parameter Naming

- **Uppercase:** NUM\_CHANNELS, DATA\_WIDTH, ADDR\_WIDTH
- **String parameters:** PERFORMANCE ("LOW", "MEDIUM", "HIGH")

### State Machine Naming

```
typedef enum logic [3:0] {
    IDLE    = 4'h0,
    ACTIVE  = 4'h1,
    // ...
} state_t;

state_t r_state, w_next_state; // Current and next state
```

---

**Last Updated:** 2025-10-17 **Maintained By:** STREAM Architecture Team