



Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

# Numerical mathematics on FPGAs using CλaSH

## From Haskell to a hardware accelerator

Martijn Bakker

Computer Architecture for Embedded Systems (CAES)  
University of Twente

July 1, 2015



# Overview

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- ① Functional programming
- ② FPGA
- ③ CλaSH
- ④ Problem definition and breakdown
- ⑤ CλaSH project overview
- ⑥ Results
- ⑦ Performance
- ⑧ Demo
- ⑨ Discussion
- ⑩ Conclusion



# Properties

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Main building block: functions
- No assignments, only *unchangeable* definitions
- No statements, only expressions
- ‘Variables’ that cannot vary
- The execution of a program is a function evaluation



# Resulting features

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- No global, mutable state
- No side effects
- Pure functions
- Lazy evaluation
- Higher-order functions
- Strong type system
- Partial function application
- Function composition
- Clear structure of the program, similar to mathematics



# Example of functional programming

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Listing 1: A very short introduction to functional programming

```
1 fib :: Integral a => a -> a
2 fib n | n == 0    = 1
3     | n == 1    = 1
4     | otherwise = fib (n-2) + fib (n-1)
5
6 fib_list :: Integral a => a -> [a]
7 fib_list n = map fib [0..n]
8
9 choose_list :: Integral a => (a -> a) -> [a]
10 choose_list function = map function [0..]
11
12 sum_list :: Integral a => [a] -> a
13 sum_list list = foldl (+) 0 list
```



# The Field-Programmable Gate Array

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- 'Programmable hardware'
- Create your own circuit instead of a list of instructions



Figure: FPGA structure



# Why would you want to use an FPGA?

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

## Advantages over CPUs

- High throughput
- Low *guaranteed* latency
- Low power use
- Support for new "instructions"

## Advantages over ASICs

- Reconfigurability



# Drawbacks

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Listing 2: A single AND-gate specified in VHDL

```
1 LIBRARY IEEE;
2 USE IEEE.std_logic_1164.all ;
3
4 ENTITY andgate IS
5 PORT (
6   input_0  : in  std_logic ;
7   input_1  : in  std_logic ;
8   output_0 : out std_logic
9 );
10 END ENTITY andgate;
11
12 ARCHITECTURE a OF andgate IS
13 BEGIN
14   output_0 <= input_0 and input_1;
15 END ARCHITECTURE a;
```



# Drawbacks

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

## ① FPGA development is **hard**

- Vendor-specific, closed-source tools
- Small communities
- Ancient HDLs
- Synthesis, debugging and verification is a slow process

## ② FPGAs cannot be reconfigured as quickly as CPUs

## ③ FPGAs run at lower clock speeds than ASICs



# CλaSH - What?

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- CAES Language for Synchronous Hardware
- A library for the specification of hardware in Haskell.
- Includes a compiler: generation of VHDL and Verilog
- Written by Christiaan Baaij at CAES
- [www.clash-lang.org](http://www.clash-lang.org)



# CλaSH - Why?

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

## Similarities between hardware and functional programming

- Hardware consists of largely combinatorial parts: pure functions
- Combinatorial circuits contain a dependency tree: function dependencies
- Higher-order functions



# CλaSH - Apparent issues?

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Hardware has a mutable state
- Functional programming has no mutable states

Solution: the Mealy machine.



# CλaSH - Apparent issues?

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Hardware has a mutable state
- Functional programming has no mutable states

Solution: the Mealy machine.



# The Mealy machine

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff



Figure: Block diagram describing a Mealy machine. The loop gets executed once every clock cycle.



# Example

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Listing 3: A basic multiply-accumulate circuit specification in CλaSH

```
1 module MAC where
2   import CLaSH.Prelude
3
4   topEntity :: Signal (Signed 9, Signed 9) -> Signal (Signed 9)
5   topEntity = mealy mac 0
6
7   mac :: Num t => t -> (t, t) -> (t, t)
8   mac state input = (state', output)
9     where
10       (x,y)    = input          -- unpack the two inputs
11       state'  = state + x*y   -- the new state
12       output   = state'        -- output the new state
```



# Wrap-up of introduction

Functional programming

FPGA

CλaSH

Problem definition and breakdown

CλaSH project overview

Results

Performance

Demo

Discussion

Conclusion

Additional stuff

Imperative languages:

- C / C++
- Matlab

Consist of:

- Assignments and statements
- $i = i + i$

compile



Sequence of instructions

execute (CPU)



Result

Functional languages:

- Haskell (CλaSH)

Consist of:

- Function definitions
- $\text{fac } 0 = 1$
- $\text{fac } n = n * \text{fac } (n-1)$

compile (CλaSH)



Hardware Description Language

synthesis  
deploy (FPGA)  
run



Result



# Definition

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

The **feasibility** and **performance** of performing **numerical approximations** to **ODEs** on an FPGA with hardware specified in CλaSH.

- ① **Feasibility:** does the most simple version work?
- ② **Performance:** comparison with a desktop CPU
- ③ **Numerical approximations:** simple integration schemes (Euler, RK2, RK4)
- ④ **ODEs:** 4 coupled first order differential equations with constant coefficients (4x4 matrix)
- ⑤ **FPGA:** a Terasic SoCKit development board, containing an Altera Cyclone V SoC FPGA



# Breakdown - requirements for the system

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Number representation
  - Number storage
  - Control protocol
  - Supplying a suitable clock frequency
- 
- ① Programming
  - ② Getting matrix constants and initial values in
  - ③ Controlling and monitoring the state
  - ④ Performing the actual computation
  - ⑤ Getting results out of the FPGA (go to 3)



# Number representation

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Floating point or fixed point?
- CλaSH only supports fixed point
  - Signed fixed point with 8 integer bits and 24 fractional bits
  - Total width: 32 bits
  - Integer range: [-128 .. 127]
  - ULP<sup>1</sup>:  $2^{-24} \approx 6 \times 10^{-8}$ , roughly 7 decimal digits.

---

<sup>1</sup>Unit of least precision



# Number representation

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Floating point or fixed point?
- CλaSH only supports fixed point
  - Signed fixed point with 8 integer bits and 24 fractional bits
  - Total width: 32 bits
  - Integer range: [-128 .. 127]
  - ULP<sup>1</sup>:  $2^{-24} \approx 6 \times 10^{-8}$ , roughly 7 decimal digits.

---

<sup>1</sup>Unit of least precision



# Number representation

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Floating point or fixed point?
- CλaSH only supports fixed point
- Signed fixed point with 8 integer bits and 24 fractional bits
- Total width: 32 bits
- Integer range: [-128 .. 127]
- ULP<sup>1</sup>:  $2^{-24} \approx 6 \times 10^{-8}$ , roughly 7 decimal digits.

---

<sup>1</sup>Unit of least precision



# Number storage

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Registers<sup>2</sup>, SRAM<sup>3</sup> or SDRAM<sup>4</sup>?
- All variables get updated every single cycle.
- All data fits in the registers.
- No need for SRAM or SDRAM.

---

<sup>2</sup>Directly accessible on-chip: latches

<sup>3</sup>Static RAM, on-chip, single cycle access delays

<sup>4</sup>Synchronous Dynamic RAM, off-chip, 'high' latency



# Number storage

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Registers<sup>2</sup>, SRAM<sup>3</sup> or SDRAM<sup>4</sup>?
- All variables get updated every single cycle.
- All data fits in the registers.
- No need for SRAM or SDRAM.

---

<sup>2</sup>Directly accessible on-chip: latches

<sup>3</sup>Static RAM, on-chip, single cycle access delays

<sup>4</sup>Synchronous Dynamic RAM, off-chip, 'high' latency



# Supplying a suitable clock frequency

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- A higher\* clock frequency means higher throughput
  - \*up to a certain limit, otherwise your system fails.
  - Output stabilization of the combinatorial circuits takes time.
- FPGA contains a crystal (50 MHz) and PLL<sup>5</sup> circuits.
- CλaSH + IO + Altera PLL = non-deterministic behaviour.
- Solution: a simple integer frequency divider.

---

<sup>5</sup>Phase-Locked Loop



# Supplying a suitable clock frequency

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- A higher\* clock frequency means higher throughput
- \*up to a certain limit, otherwise your system fails.
- Output stabilization of the combinatorial circuits takes time.
- FPGA contains a crystal (50 MHz) and PLL<sup>5</sup> circuits.
- CλaSH + IO + Altera PLL = non-deterministic behaviour.
- Solution: a simple integer frequency divider.

---

<sup>5</sup>Phase-Locked Loop



# Supplying a suitable clock frequency

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- A higher\* clock frequency means higher throughput
- \*up to a certain limit, otherwise your system fails.
- Output stabilization of the combinatorial circuits takes time.
- FPGA contains a crystal (50 MHz) and PLL<sup>5</sup> circuits.
- C $\lambda$ aSH + IO + Altera PLL = non-deterministic behaviour.
- Solution: a simple integer frequency divider.

---

<sup>5</sup>Phase-Locked Loop



# Actual computation - External Types

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Listing 4: CλaSH topEntity for the ODE solver

```
1 topEntity :: Signal InputSignals -> Signal OutputSignals
2 topEntity = mealy solveODE initialState
3
4 solveODE state input = (state', output)
5   where
6     state  = (systemState,systemConstants,oul,block)
7     state' = (systemState',systemConstants',oul',block')
```



# Actual computation - Internal Types

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Listing 5: Internal types for the ODE solver

```
1 type Data = SFixed 8 24
2 type UInt = Unsigned 32
3 type ValueVector = Vec 5 Data
4 type ConstantVector = Vec 20 Data
5
6 data ODEState = ODEState { valueVector :: ValueVector
7                               , time :: Data
8                               } deriving (Show)
9
10 type Equation = (ODEState, ConstantVector) -> ValueVector
11 type Scheme = SystemConstants -> Equation -> ODEState -> ODEState
```



# Actual computation - Equation

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

$$\mathbf{x}' = \begin{bmatrix} c_1 & c_2 \\ c_3 & c_4 \end{bmatrix} \mathbf{x} \quad (1)$$

Listing 6: Computing the derivative

```
1 matrix2d :: Equation
2 matrix2d (odestate, constants) = dxs
3   where
4     xs = valueVector odestate
5
6     c1 = constants !! 4
7     c2 = constants !! 5
8     c3 = constants !! 6
9     c4 = constants !! 7
10
11    x0 = c1 * (xs !! 0) + c2 * (xs !! 1)
12    x1 = c3 * (xs !! 0) + c4 * (xs !! 1)
13
14    dxs = fst $ shiftInAt0 xs (x0 :> x1 :> Nil)
```



# Actual computation - Scheme

$$\mathbf{x}_{k+1} = \begin{cases} \mathbf{x}_k + h\mathbf{x}'_k & \text{if } t_k < t_{max} \\ \mathbf{x}_k & \text{otherwise} \end{cases} \quad (2)$$

Listing 7: Eulers method

```
1 euler :: Scheme
2 euler constants equation state = state'
3 where
4   c_user = userconstants constants
5   c_maxtime = maxtime constants
6   h = timestep constants
7   t = time state
8   xs = valueVector state
9
10 --Apply Euler's integration scheme
11 eulerxs = zipWith (+) xs $ map (*h) (equation (state, c_user))
12 (xs',t') = if t < c_maxtime then (eulerxs, t + h)
13                                else (xs,t) -- already at maximum time
14
15 state' = ODEState {valueVector = xs', time = t'}
```



# Actual computation - Control

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Listing 8: Controlling the ODE solver (important lines start with '!')

```
1 ! scheme = euler
2 ! equation = matrix4d
3
4 (systemState', oul ', block')
5   --Handle the setup (reset the state, insert input values, start the computation)
6   | i_c == 2 && i_cs == 1 && i_ca == 0 = ( initialSystemState, 0, 0)
7   | i_is == 1                               = ( systemState{ odestate = s_odestate.in' }, 0, 1)
8   | i_c == 1 && i_cs == 1 && i_ca == 0 = ( systemState{ step = 0 } , 0, 0)
9
10  --Handle the computation and output:
11 ! | block == 1 && i_os == 1           = ( systemState, pack (xs !! i_oa), block)
12 ! | block == 0 && s_step < c_maxstep = ( systemState.up' , 0, block)
13 ! | block == 0 && s_step >= c_maxstep = ( systemState{ step = UIntMax}, pack UIntMax, 1)
14
15  --Default, do nothing
16 | otherwise                           = ( systemState, oul, block)
17 where
18   s_odestate.in' = s_odestate {valueVector = replace i_ia (unpack i_i :: Data) xs}
19
20 ! s_odestate.up   = scheme systemConstants equation s_odestate
21   valueVector_wt = replace 4 (time s_odestate.up) (valueVector s_odestate.up)
22   s_odestate.up' = s_odestate.up {valueVector = valueVector_wt }
23   s_step'        = s_step + 1
24
25 systemState.up' = systemState{ odestate = s_odestate.up', step = s_step'}
```



# An overview of the entire system

Functional programming  
FPGA  
CλaSH  
Problem definition and breakdown  
CλaSH project overview  
Results  
Performance  
Demo  
Discussion  
Conclusion  
Additional stuff





# Results

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Each solution plot contains 3 curves
  - ① FPGA-approximation
  - ② MATLAB-approximation (same functionality as FPGA)
  - ③ Analytical or obtained from `ode45()`, the real solution
- Each error plot contains at least 2 curves
  - ① FPGA - real
  - ② FPGA - MATLAB approximation
- MATLAB uses IEEE 754 double precision floating point
- FPGA uses 8/24 signed fixed point.



# Results

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Each solution plot contains 3 curves
  - ① FPGA-approximation
  - ② MATLAB-approximation (same functionality as FPGA)
  - ③ Analytical or obtained from `ode45()`, the real solution
- Each error plot contains at least 2 curves
  - ① FPGA - real
  - ② FPGA - MATLAB approximation
- MATLAB uses IEEE 754 double precision floating point
- FPGA uses 8/24 signed fixed point.



# Simple oscillation - I

$$x(t) = 50 \cos(t)$$

Functional programming

FPGA

CλaSH

Problem definition and breakdown

CλaSH project overview

Results

Performance

Demo

Discussion

Conclusion

Additional stuff



Figure: The final error is  $\approx 30$  ( $h = 0.01$ )



# Simple oscillation - II

$$x(t) = 50 \cos(t)$$

- Functional programming
- FPGA
- C $\lambda$ aSH
- Problem definition and breakdown
- C $\lambda$ aSH project overview
- Results
- Performance
- Demo
- Discussion
- Conclusion
- Additional stuff



Figure: The final error decreased by a factor 150 ( $h = 1 \times 10^{-4}$ )



# Simple oscillation - III

$$x(t) = 50 \cos(t)$$

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff



Figure: Breakdown of the number representation ( $h = 1 \times 10^{-7}$ )

# Switching equations

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Eulers method (first order) to RK2 (second order)
- Simple oscillations to 4 coupled equations
- Matrix chosen such that all eigenvalues negative

$$\mathbf{x}' = \begin{bmatrix} 2 & 3 & 2 & 0 \\ -5 & -5 & -3 & 1 \\ 3 & -1 & -2 & -3 \\ 4 & 2 & 2 & -3 \end{bmatrix} \mathbf{x} \quad \text{with} \quad \mathbf{x}(t_0) = \begin{bmatrix} 7 \\ 5 \\ 7 \\ 5 \end{bmatrix} \quad (3)$$



# RK2 - I

Functional programming  
FPGA  
CλaSH  
Problem definition and breakdown  
CλaSH project overview  
Results  
Performance  
Demo  
Discussion  
Conclusion  
Additional stuff



Figure: A high value of  $h$ , maximum error  $\approx 0.3$  ( $h = 0.01$ )



# RK2 - II

Functional programming  
FPGA  
CλaSH  
Problem definition and breakdown  
CλaSH project overview  
Results  
Performance  
Demo  
Discussion  
Conclusion  
Additional stuff



Figure: A very low value of  $h$ , maximum error  $\approx 0.2$  ( $h = 5 \times 10^{-7}$ )



# RK4

Functional programming  
FPGA  
CλaSH  
Problem definition and breakdown  
CλaSH project overview  
Results  
Performance  
Demo  
Discussion  
Conclusion  
Additional stuff



Figure: For all tested values of  $h$ , ( $h = 1 \times 10^{-5}$ )



# Euler revisited

Functional programming  
FPGA  
CλaSH  
Problem definition and breakdown  
CλaSH project overview  
Results  
Performance  
Demo  
Discussion  
Conclusion  
Additional stuff



Figure: Euler outperforming both RK2 and RK4, ( $h = 1 \times 10^{-4}$ )



# Results

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

Table: Eulers method, sorted by iterations per second.

| Device                | Iterations      | time (s) | Output interrupts | Iterations per second ( $\times 10^6$ ) |
|-----------------------|-----------------|----------|-------------------|-----------------------------------------|
| CPU - C++ - int       | $1 \times 10^8$ | 1,35     | 1                 | 74,1                                    |
| FPGA                  | $1 \times 10^8$ | 2,18     | 1                 | 45,8                                    |
| FPGA                  | $1 \times 10^8$ | 2,25     | $1 \times 10^3$   | 44,4                                    |
| FPGA                  | $1 \times 10^8$ | 2,30     | $1 \times 10^4$   | 43,4                                    |
| FPGA                  | $1 \times 10^8$ | 3,27     | $1 \times 10^5$   | 30,6                                    |
| FPGA                  | $1 \times 10^7$ | 0,39     | 1                 | 25,6                                    |
| FPGA                  | $1 \times 10^6$ | 0,21     | 1                 | 4,76                                    |
| CPU - C++ - float     | $5 \times 10^7$ | 58,1     | 1                 | 0,86                                    |
| FPGA                  | $1 \times 10^5$ | 0,19     | 1                 | 0,53                                    |
| CPU - Haskell - float | $1 \times 10^6$ | 3,23     | 1                 | 0,31                                    |
| CPU - MATLAB - float  | $1 \times 10^7$ | 304      | 1                 | 0,03                                    |



# Notes

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- FPGA for "low-power, cost-sensitive design needs"
- No automatic derivation of the optimal clock frequency
- 4x4 matrix equation only used  $\sim 10\%$  of the FPGA space
- Power usage is  $\sim 2$  orders of magnitude lower than the CPU
- CPUs are heavily optimized for fast integer arithmetic



# The entire process takes ~10 minutes

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- ① CλaSH - Windows/Linux
  - ② Synthesis (Quartus) - Windows
  - ③ Driver compilation - (arm-linux-gnueabihf-g++) - Linux
  - ④ Run and extract results - Windows/Linux
  - ⑤ Verification of results (MATLAB) - Windows
- 
- Normal work flow - Use two machines
  - Linux for compiling and deploying the driver
  - Windows for the other steps



# Discussion

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Accuracy
  - Equality between the FPGA and MATLAB implementations
  - Limitations of the number representation and CλaSH
- Functionality
  - 4x4 matrix with constant coefficients
  - Variety of basic solver schemes
  - Limitations of the number representation and CλaSH
- Performance
  - Sub-par development FPGA is close to a modern CPU core
  - Trade-off between power usage and performance



# What has been created?

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Single-command toolchain integration for deploying C $\lambda$ aSH on an FPGA
- IO system of sufficiently high performance
  - Configuration of the Avalon bridges
  - C++ library for easy FPGA accessibility



# What has been shown?

Functional  
programming

FPGA

C $\lambda$ aSH

Problem  
definition and  
breakdown

C $\lambda$ aSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Hardware acceleration of numerical solvers is feasible
- Number representations are important
- C $\lambda$ aSH is usable for projects with complex IO by use as a module
- FPGAs are fast: subpar FPGA close to a modern CPU core



# Acknowledgements

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff

- Jan Kuper - Introducing me to functional programming, CλaSH and giving feedback
- Christiaan Baaij - Creating CλaSH and answering related questions
- Ruud van Damme & Jan Broenink - Feedback
- Rinse Wester - Input on configuring and using the Avalon bridges



# Frequency too high? Random number generator.

- Functional programming
- FPGA
- C $\lambda$ aSH
- Problem definition and breakdown
- C $\lambda$ aSH project overview
- Results
- Performance
- Demo
- Discussion
- Conclusion
- Additional stuff



Figure: The main source of failure after overclocking.



# Nondeterministic behaviour

Functional  
programming

FPGA

CλaSH

Problem  
definition and  
breakdown

CλaSH  
project  
overview

Results

Performance

Demo

Discussion

Conclusion

Additional  
stuff



Figure: Only when using the Altera PLL



# The core of the circuit: the matrix multiplier

Functional programming  
FPGA  
CλaSH  
Problem definition and breakdown  
CλaSH project overview  
Results  
Performance  
Demo  
Discussion  
Conclusion  
Additional stuff



Figure: A  $4 \times 4$  matrix vector multiplication consists of  $4 * (4 + 3) = 28$  operations



# Synthesized hardware

Functional programming

FPGA

CλaSH

Problem definition and breakdown

CλaSH project overview

Results

Performance

Demo

Discussion

Conclusion

Additional stuff



Figure: An overview of the circuit, generated by Quartus