

# *Once bittern, twice shy*

*Revisiting hardware architectures for lazy functional languages with Heron*

---

Craig Ramsay & Rob Stewart

September 2024

Heriot-Watt University





<https://haflang.github.io/history>



GRIP (1987-1996)

...What are those?!





A map of Europe where specific regions are highlighted in orange, purple, and blue. The orange area covers the British Isles and parts of Northern France and Germany. The purple area covers Central Europe, including Poland and the Czech Republic. The blue area covers Southern Europe, including Spain, Portugal, Italy, and Greece. A light gray background represents the rest of the continent.

Reduceron  
(Template inst.)

PilGRIM  
(Programmed)

≈ 2010



≈ 2017

Reduceron  
(Template inst.)

PHEONIX  
(Programmed)

Cephalopode  
(Combinators)

PilGRIM  
(Programmed)

Cecil Accetti  
(Combinators)

≈ 2020





≈ 2023

# *HAFLANG Project*

---

*add*  $x \ y \ z =$

$$x + y + z$$



*add x y z =*

*x + y + z*



*s1ba\_info: ; Code for helper (a b -> a+b)*  
.Lc1bo: ; Check for stack space  
leaq -40(%rbp),%rax  
cmpq %r15,%rax  
jb .Lc1bp ; Jump if stack full  
.Lc1bq: ; Reduce helper  
movq \$stg\_upd\_frame\_info,-16(%rbp)  
movq %rbx,8(%rbp)  
movq 16(%rbx),%rax ; Load a & b from heap  
movq 24(%rbx),%rbx  
movl \$base\_GHCziNum\_zdfNumInt\_closure,%r14d  
;; Push 'a+b' onto stack  
movq \$stg\_ap\_pp\_info,-40(%rbp)  
movq %rax,-32(%rbp)  
movq %rbx,-24(%rbp)  
addq \$-40,%rbp  
jmp base\_GHCziNum\_zp\_info ; Enter  
.Lc1bp: ; Ask RTS for stack space  
jmp \*-16(%r13)  
Add\_add\_info: ; Code for 'add'  
.Lc1br: ; Check for stack space  
leaq -24(%rbp),%rax  
cmpq %r15,%rax  
jb .Lc1bs ; Jump if stack full

.Lc1bt: ; Check for heap space  
addq \$32,%r12  
cmpq 856(%r13),%r12  
ja .Lc1bv ; Jump if heap full  
.Lc1bu: ; Reduce 'add'  
;; Build 'x+y' thunk on heap  
movq \$s1ba\_info,-24(%r12)  
movq %r14,-8(%r12)  
movq %rsi,(%r12)  
leaq -24(%r12),%rax  
movl \$base\_GHCziNum\_zdfNumInt\_closure,%r14d  
;; Push 'thunk+2' to stack  
movq \$stg\_ap\_pp\_info,-24(%rbp)  
movq %rax,-16(%rbp)  
movq %rdi,-8(%rbp)  
addq \$-24,%rbp  
jmp base\_GHCziNum\_zp\_info ; Enter  
.Lc1bv: ; Ask RTS for heap space  
movq \$32,904(%r13)  
.Lc1bs: ; Ask RTS for stack space  
movl \$Add\_add\_closure,%ebx  
jmp \*-8(%r13)

# Heron

noun [C]

/'herən/

A graph reduction processor.

Performs template instantiation in one clock cycle via multiple, wide, multi-ported memories.



<sup>o</sup>Ramsay and Stewart, "Heron: Modern Graph Reduction Hardware".

# Heron

noun [C]

/'herən/

A graph reduction processor.

Performs template instantiation in one clock cycle via multiple, wide, multi-ported memories.



<sup>o</sup>Ramsay and Stewart, "Heron: Modern Graph Reduction Hardware".

## Template example

`enumFrom :: Int -> [Int]`

`enumFrom n = let m = n + 1`

`ns = enumFrom m`

`in Cons n ns`



# Template example

enumFrom :: Int -> [Int]

enumFrom n = let m = n + 1

ns = enumFrom m

in Cons n ns



let APP [ ARG True 0, PRI 2 +, INT 1 ]  
APP [ FUN True 1 0, VAR False 0, ]

in APP [ CON 2 0, ARG True 0, VAR False 1 ]





Heap is on-chip & small  
(think L1/L2 cache)





### Atoms

- = **FUN s n**
- | **CON s n**
- | **VAR s n**
- | **INT n**
- | **PRI a ⊗**
- | **ARG s n**
- | **REG n**



Atoms

- =  $\text{FUN}_n$
- |  $\text{CON}_n$
- |  $\text{VAR}_n$
- |  $\text{INT}_n$
- |  $\text{PRI}_n \otimes$
- |  $\text{ARG}_n$
- |  $\text{REG}_n$



### Atoms

- = **FUN s n**
- | **CON a n**
- | **VAR s n**
- | **INT n**
- | **PRI a ⊗**
- | **ARG s n**
- | **REG n**





### Atoms

- = FUN<sub>s n</sub>
- | CON<sub>a n</sub>
- | VAR<sub>s n</sub>
- | INT<sub>n</sub>
- | PRI<sub>a ⊗</sub>
- | ARG<sub>s n</sub>
- | REG<sub>n</sub>



Postfix prims for  
long spines

$$(fx\ y) + (gz) \\ \Rightarrow f\ x\ y\ g\ z\ +$$



...But what about  
heap updates?



Avoid most updates via  
run-time sharing analysis!

### Atoms

- = FUN<sub>s</sub> n
- | CON<sub>a</sub> n
- | VAR<sub>s</sub> n
- | INT<sub>n</sub>
- | PRI<sub>a</sub>  $\otimes$
- | ARG<sub>s</sub> n
- | REG<sub>n</sub>



Avoid most updates via  
run-time sharing analysis!

### Atoms

- = FUN<sub>s n</sub>
- | CON<sub>a n</sub>
- | VAR<sub>s n</sub>
- | INT<sub>n</sub>
- | PRI<sub>a</sub>  $\otimes$
- | ARG<sub>s n</sub>
- | REG<sub>n</sub>

$\approx$  One-bit

reference counting!





## *Software for Concurrent GC*



## Software for Concurrent GC



## Software for Concurrent GC



Function alloc (app):  
 $\text{heap}[\text{hp}] \leftarrow \text{app}$   
 $\text{hp}++$

## Software for Concurrent GC



Function alloc (app):

$a \leftarrow \text{pop from freelist}$

if allocBarrier(gcPhase, a)

then

  tag a as Marked

else

  tag a as Unmarked

$\text{heap}[a] \leftarrow \text{app}$

## Software for Concurrent GC

1990

Real-Time Garbage Collection on General-Purpose Machines

Taiichi Yuasa  
Research Institute for Mathematical Sciences, Kyoto University, Kyoto, Japan

2020

Alligator Collector: A Latency-Optimized Garbage Collector for Functional Programming Languages

Ben Gamari      Laura Dietz

"The *nofib* cases are quite mixed [...] most tests slow down, with a median of +21%"

## Software for Concurrent GC

1990

Real-Time Garbage Collection on General-Purpose Machines

Taiichi Yuasa

Research Institute for Mathematical Sciences, Kyoto University, Kyoto, Japan

2020

Alligator Collector: A Latency-Optimized Garbage Collector for Functional Programming Languages

Ben Gamari

Laura Dietz

Key Observation:

Stock CPUs sequentialise write-barriers  
(trades-off GC latency for throughput)

Custom hardware + read-first memories grants us both

# Cloaca

noun [C]

/kloh-ah-kuh/

A concurrent hardware  
garbage collector for Heron



<sup>o</sup>Ramsay and Stewart, "Cloaca: A Concurrent Hardware Garbage Collector for Non-strict Functional Languages".

# Cloaca

noun [C]

/kloh-ah-kuh/

A concurrent hardware  
garbage collector for Heron



<sup>o</sup>Ramsay and Stewart, "Cloaca: A Concurrent Hardware Garbage Collector for Non-strict Functional Languages".

## Reduction Core



## Memory Management



## Reduction Core



## Memory Management



## Reduction Core



## Memory Management



## *Results*

---





# Which Architectures?

|          |                            |                                     |             |
|----------|----------------------------|-------------------------------------|-------------|
| Platform | Heron<br>Xilinx Alveo U280 | GHC + Intel i7 1250U<br>Performance | Power-saver |
|----------|----------------------------|-------------------------------------|-------------|

# Which Architectures?

|          |                            |                                     |             |
|----------|----------------------------|-------------------------------------|-------------|
| Platform | Heron<br>Xilinx Alveo U280 | GHC + Intel i7 1250U<br>Performance | Power-saver |
| Clock    | 185 MHz                    | 4.7 GHz                             | ≈ 2 GHz     |

# Which Architectures?

|            | Heron                                       | GHC + Intel i7 1250U          |                            |
|------------|---------------------------------------------|-------------------------------|----------------------------|
| Platform   | Xilinx Alveo U280                           | Performance                   | Power-saver                |
| Clock      | 185 MHz                                     | 4.7 GHz                       | ≈ 2 GHz                    |
| Power est. | 0.8W dynamic +<br>3.1 W Static <sup>1</sup> | Cores 15 W or<br>Package 16 W | Cores 2W or<br>Package 6 W |

<sup>1</sup> Heron only occupies 1.13% of any resource type though!

# Which Architectures?

|             | Heron                                       | GHC + Intel i7 1250U          |                            |
|-------------|---------------------------------------------|-------------------------------|----------------------------|
| Platform    | Xilinx Alveo U280                           | Performance                   | Power-saver                |
| Clock       | 185 MHz                                     | 4.7 GHz                       | ≈ 2 GHz                    |
| Power est.  | 0.8W dynamic +<br>3.1 W Static <sup>1</sup> | Cores 15 W or<br>Package 16 W | Cores 2W or<br>Package 6 W |
| Fabrication | 16 nm (FPGA!)                               | 10 nm                         | 10 nm                      |

<sup>1</sup> Heron only occupies 1.13% of any resource type though!

## Wall-clock times vs GHC -O2



## Wall-clock times vs GHC -OO



## Conclusion

A tiny template instantiation core nearly keeps up with a modern CPU  
running at x10 speed!

## Conclusion

A tiny template instantiation core nearly keeps up with a modern CPU running at  $\times 10$  speed!

Assisted by a fully concurrent GC (modulo  $\approx 20$  cycles per pass).

## Conclusion

A tiny template instantiation core nearly keeps up with a modern CPU running at  $\times 10$  speed!

Assisted by a fully concurrent GC (modulo  $\approx 20$  cycles per pass).

Lays a path towards a single-chip multi-core architecture.

## Conclusion

A tiny template instantiation core nearly keeps up with a modern CPU running at  $\times 10$  speed!

Assisted by a fully concurrent GC (modulo  $\approx 20$  cycles per pass).

Lays a path towards a single-chip multi-core architecture.

We want to see more research towards custom FP architectures!

Questions?

---