



VLSI architecture,  
synthesis & technology



---

# Loom: A Compiler and Hardware Generator for Heterogeneous Reconfigurable Systems

*When Graphs and Streams Are First-Class Citizens*

**Sihao Liu, Tony Nowatzki**

UCLA | CDSC Annual Review | February 2026

# The Representation Problem

## Today: Fragmented Representations

HLS

C → opaque RTL

CGRA

ad-hoc DFG → fixed mapper

Verification

separate testbench

DSE

manual re-integration

## Key Observation

SW kernels

= dataflow graphs

HW architectures

= resource graphs

Mapping

= graph → graph association

Make graphs first-class citizens → accelerator compilation = graph transformations

# The Graph-First Thesis



Handshake + Dataflow  
MLIR dialect

Placement + routing +  
resource assignment

Fabric MLIR dialect  
PEs, switches, memories

## SW Optimization

Graph mutation preserving  
execution semantics

## HW Optimization

Topology transformation  
(tile replication, reshaping)

## SW-HW Mapping

Constrained graph-to-graph  
association

No prior framework treats both SW and HW as native graph IRs in one compiler

# Loom Full-Stack Pipeline



## Software Frontend

C++ with pragmas → Clang  
SCF → Handshake → Dataflow MLIR  
**DFG: streaming ops + loop deps**

## Mapper P&R

Placement on ADG tiles  
Routing through switches  
**config\_mem bitstream output**

## Hardware Backend

ADGBuilder C++ API  
Fabric MLIR → dual backend  
**SystemC + SystemVerilog**

**Verification: ESI cosim + gem5 full-system simulation**

2 MLIR dialects | Dual backend | 135+ kernels validated end-to-end

From C++ source to verified multi-core hardware in one framework

# Heterogeneous Mapping Problem



Find constrained association from DFG onto ADG: placement + routing + resource assignment

# Constraint-Driven Mapping

## Phase 1 Placement

DFG node → ADG tile  
Op semantics n capability  
Topological traversal  
**Spatial PE vs. temporal PE**

## Phase 2 Routing

DFG edge → HW path  
BFS/Dijkstra on ADG  
Connectivity + type check  
Reserve ports, track capacity

## Phase 3 Resource Assign

Temporal slots + registers  
Slot, opcode, reg indices  
Route-table entries  
Graph-coloring allocation

**Repair:** conflict → reroute → reassign → restart

**Cost:** placement + routing + utilization

**6 hard constraint classes (C1-C6) — never relaxed**

Three-phase place-and-route with iterative repair and 6 hard constraints

# Tagged Types for Heterogeneity

How do multiple operations share one physical tile and interconnect?

## Port Multiplexing

Multiple streams share one port, demuxed by tag

## Instruction Dispatch

Tag matches → selects operation + routing

## Memory Disambiguation

LSQ matches load/store transactions per tag

## Route Multiplexing

Per-tag route-table entries on shared links

Not a separate dimension — tag assignment follows from placement/routing decisions

# Multi-Kernel Scheduling



## Config-Mem Switching

Each kernel → independent config image  
Host writes config between invocations  
**Delta update: overwrite only changed modules**

## Cross-Kernel Optimization

Share temporal PE slots across kernels  
Disjoint spatial regions run concurrently  
**Complementary kernels maximize utilization**

**gem5-loom: full-system CGRA simulation in gem5 memory hierarchy**  
CPU dispatch | config loading | DMA | cache effects | workload-level evaluation

From single-kernel mapping to workload-level scheduling on shared fabric

# Experimental Setup

## Benchmark Suite: 135+ Kernels

### Signal Proc

conv, FFT, FIR

### Linear Alg

axpy, gemv, matmul

### Stencil

Jacobi, Laplacian

### ML

batchnorm, softmax

### Sparse

scatter, gather

## Architecture Configs

### Spatial-only

4×4 mesh, all spatial PEs

### Temporal-only

2×2 mesh, all temporal

### Heterogeneous

4×4 spatial + temporal

ESI cosim (bit-exact) | Metrics: success rate, PE utilization, routing congestion, cycle count, config footprint

Topology sweep: Mesh | Torus | DiagonalMesh | Custom

[Preliminary results — detailed data at the poster session]

# Mapping Quality and Performance



Success: XX%

Utilization: +XX%

Config: XX words avg

Heterogeneous mapping wins: best of spatial throughput + temporal flexibility

# Positioning in the Landscape

|                      | CGRA-ME      | OpenCGRA    | Calyx     | Loom               |
|----------------------|--------------|-------------|-----------|--------------------|
| <b>Input</b>         | Restricted C | DFG         | Custom IR | C/C++              |
| <b>HW desc</b>       | Fixed XML    | Fixed RTL   | Custom    | ADGBuilder API     |
| <b>IR</b>            | Custom       | Custom      | Custom    | MLIR native        |
| <b>Scheduling</b>    | Modulo (ILP) | Modulo (SA) | Static    | Constraint P&R     |
| <b>Heterogeneous</b> | No           | No          | No        | Spatial + Temporal |
| <b>Retargetable</b>  | Limited      | No          | Partial   | Any topology       |
| <b>Verification</b>  | External     | External    | External  | ESI + gem5         |

Graph-native IR

Heterogeneous sched

Full-system verif

*vs. HLS: Loom targets reconfigurable fabrics with transparent config; HLS produces opaque monolithic RTL*

First framework with graph-native IR + heterogeneous scheduling + full verification

# Summary and Future Directions

## What we built

Dataflow + Fabric

Two MLIR dialects

Mapper

Constraint-driven P&R

Backends

SystemC + SystemVerilog

gem5-loom

Full-system simulation

## The bigger thesis

- Graphs and streams as first-class citizens
- Complexity emerges from connectivity
- No fine-tuning HW or SW individually
- Different philosophy from HLS + classical CGRA

## Open Research Directions

ILP/SAT placement

Auto-tiling

Physical design closure

Cross-layer DSE

135+ kernels | Dual backend | Retargetable to any topology

Thank you | sihao@cs.ucla.edu

# Backup: Dialect Details

## Dataflow Dialect

|                                 |                             |
|---------------------------------|-----------------------------|
| <code>dataflow.stream</code>    | index generator + predicate |
| <code>dataflow.carry</code>     | loop-carried dependency     |
| <code>dataflow.invariant</code> | loop-invariant broadcast    |
| <code>dataflow.gate</code>      | stream alignment            |

## Fabric Dialect

|                                 |                              |
|---------------------------------|------------------------------|
| <code>fabric.pe</code>          | compute / const / load-store |
| <code>fabric.temporal_pe</code> | time-multiplexed PE          |
| <code>fabric.switch</code>      | static / temporal routing    |
| <code>fabric.memory</code>      | storage + LSQ                |
| <code>fabric fifo</code>        | buffering (bypassable)       |

## Structure vs. Configuration

HW params (fixed at build) | Runtime config via config\_mem (tags, predicates, routes)

Two MLIR dialects capture both software dataflow and hardware fabric structure

# Backup: ADGBuilder API

## Programmatic Hardware Construction in C++

```
ADGBuilder builder("my_cgra");
auto pe  = builder.newPE("alu").setLatency(1).addOp(...);
auto tpe = builder.newTemporalPE("shared").setNumInstruction(8);
auto sw  = builder.newSwitch("router").setPortCount(4, 4);
builder.buildMesh(4, 4, pe, sw, Topology::Mesh);
builder.exportSV("output/sv"); // SystemVerilog
builder.exportSysC("output/sysc"); // SystemC
```

### Features

Single-source C++ API

Mesh/Torus/Diagonal/Custom

Clone-with-deduplication

### Multi-Backend

Fabric MLIR → backends

SystemC: fast iteration

SystemVerilog: synthesis

### Pipeline

C++ API

Fabric MLIR

SystemC

SystemVerilog

Single C++ API → any topology → dual-backend export

# Backup: Constraints C1-C6

|        |                  |                                                |
|--------|------------------|------------------------------------------------|
| C<br>1 | <b>Node</b>      | SW op semantics match HW tile capability       |
| C<br>2 | <b>Port/Type</b> | Category match, value type, tag-width          |
| C<br>3 | <b>Route</b>     | Physical connectivity + directionality         |
| C<br>4 | <b>Capacity</b>  | Fan-in/out, temporal slots, memory queues      |
| C<br>5 | <b>Temporal</b>  | Tag uniqueness, slot validity, register bounds |
| C<br>6 | <b>Config</b>    | Emitted config matches Fabric op spec          |

All 6 are hard constraints — never traded for cost optimization