



# Synthesizing Faster Circuits using Redundant Representations

Sam Coward - Research Fellow, University College London

Collaborators - Cambridge (Grosser Lab), SiFive

# Introduction

IMPERIAL



ROVER



intel

Numerical HW  
Group (April 2025)



Number Representation:



8-bit: E5M2, E4M3

4-bit: E3M0, E2M1

# Current Projects

## E-Graphs as Automata



## Equality Saturation & MLIR



## Parametric Bitvector Proofs

$$\forall w_x, w_y \ M_1(x, y) \cong M_2(x, y)$$

# Time to talk about circuits...

# Addition Is Slow...

$$\begin{array}{r} \textcircled{1} & \textcircled{1} & \textcircled{1} \\ 1 & 1 & 0 & 1 \\ + & 1 & 1 & 1 \\ \hline 1 & 0 & 1 & 0 & 0 \\ \hline \end{array}$$

$$\begin{array}{rcl} 1010 & = & 10 \quad (\text{a}) \\ 1011 & = & 11 \quad (\text{b}) \\ + 0011 & = & 3 \quad (\text{c}) \\ \hline 0010 & = & 2 \\ \hline 1011x & = & 22 \\ \hline 11000 & = & 24 \end{array}$$

Reduced 3 values to 2, but  
we've only looked locally -  
much faster than adding (a+b)

# Efficient 4-bit Integer Dot Product Circuit



# Efficient Integer Dot Product Circuit



4-bit:

- 5% faster
- 4% smaller

Meh...

16-bit:

- 25% faster
- 5% smaller

Say more...

# Key Takeaway

A **hardware intermediate representation** should express **efficient circuit constructs** beyond typical arithmetic operators.

Language Design Target:

- Composable
- Verifiable

# Datapath Intermediate Representation

# MLIR & CIRCT



OpenXLA



Mojo



TensorFlow

Tools & Languages

Dialects

Passes

tensor

convert

arith

analyze

vector

canon.



Multi-Level Intermediate  
Representation



Tools & Languages

Dialects

Passes

comb

convert

HW

analyze

Seq

canon.



Circuit IR Compilers and Tools



# Datapath Dialect: $(a \times b) + (c \times d)$

```
%ab:4 = datapath.partial_product %a, %b : i8
```



# Datapath Dialect: $(a \times b) + (c \times d)$

```
%ab:4 = datapath.partial_product %a, %b : i8  
%cd:4 = datapath.partial_product %c, %d : i8
```



# Datapath Dialect: $(a \times b) + (c \times d)$

```
%ab:4 = datapath.partial_product %a, %b : i8  
%cd:4 = datapath.partial_product %c, %d : i8  
  
// Compress to carry-save form  
%e:2 = datapath.compress %ab#0, ..., %ab#3,  
       %cd#0, ..., %cd#3  
       : i8 [8 -> 2]  
  
// (carry + save) = a*b + c*d  
%result = comb.add %e#0, %e#1 : i8
```



# Verified via Contracts

Implementation not specified - Booth/AND/...

```
%ab:N = datapath.partial_product %a, %b  
// Verification Contract  
assert(%ab#0 + ... + %ab#<N-1> == %a x %b)
```

Functionality constrained

# Verified via Contracts

```
%ab:N = datapath.partial_product %a, %b  
// Verification Contract  
assert(%ab#0 + ... + %ab#<N-1> == %a x %b)
```

```
%z:2 = datapath.compress %y#0, ..., %y#<N-1>  
// Verification Contract  
assert(%z[0] + %z[1] = %y#0 + ... + %y#<N-1>)
```

Software:  
Undef? Poison?

Can you zero-extend y's?

# Composability



```
%ab:4 = datapath.partial_product %a, %b : i8  
%cd:4 = datapath.partial_product %c, %d : i8
```

# Automation

# Always Profitable



# Delay Reducing, Area Increasing



# Automated Datapath Synthesis



# Comparison vs Yosys (ASAP 7nm)



# Comparison vs Hand-Optimized (PULP)



Ours / Manual



(a) 16-bit inputs

Ours / Manual



(b) 32-bit inputs

Timing-Aware Compressor Trees

Booth Encoding

# Verification Challenges



# Datapath Benchmarks



- Parameterized System Verilog modules
- Automated logic synthesis competition
- Generates equivalence checking benchmarks

Please contribute your own benchmarks via PRs?

| Category                | Metric         | circt_d8f76781813 | circt_d8f76781813+reduce_datapath_branch |
|-------------------------|----------------|-------------------|------------------------------------------|
| Overall (61 benchmarks) | Gates          | 204.1             | 203.8 (-0.2%)                            |
|                         | Depth          | 13.0              | 13.3 (+1.8%)                             |
|                         | Area (ASAP7)   | 19.1              | 18.8 (-1.8%)                             |
|                         | Delay (ASAP7)  | 157.4             | 158.5 (+0.7%)                            |
|                         | Area (Sky130)  | 1525.3            | 1519.3 (-0.4%)                           |
|                         | Delay (Sky130) | 647.5             | 650.1 (+0.4%)                            |

# Conclusion & Next Steps

- Upstream and open-source in CIRCT
- Datapath design benchmarks - share your designs?
- Solver improvements desperately needed...
- Signed arithmetic & pipelining to come



CIRCT



Docs



Benchmarks

# BACKUP

# BACKUP - Multiplier



# Back-up

