

# ASCELLA: Accelerating Sparse Computation by Enabling Stream Accesses to Memory

Bahar Asgari, Ramyad Hadidi, Hyesoon Kim





# Sparse matrices are everywhere!

Sparse matrix vector multiplication (SpMV) is a main operation in:

**Neural Networks**



The weight matrix  
is sparse

**Graph Analytics**



The adjacency matrix  
is sparse

**Differential Equations**



The coefficient matrix  
is sparse



# An ideal hardware accelerator for SpMV

3

- ▶ SpMV can be accelerated:
  - ▶ By stream data from memory
  - ▶ By using a parallel dot product engine
- ▶ Ideally, we want compute time and data-transfer time for blocks of data (e.g., A, B, C, and D) to be equal:



\*a block is the unit of streaming data from memory



# Decompression can cause a bottleneck

- ▶ Sparse matrices are often stored and transferred in compressed forms
  - ▶ Example: compressed sparse row (CSR) and blocked CSR (BCSR)
- ▶ The compressed data must be first decompressed
- ▶ Decompression is slow and causes bottleneck





# Why decompression is slow?

CSR and BCSR use three vectors to represent a sparse matrix

- ▶ Row indices (offsets)
- ▶ Column indices
- ▶ Values

To decompress a non-zero row, we need to

- ▶ First, read one element of row indices
- ▶ Then, read column indices and values as required



# Decompression from CSR format

6



| <b>BRAM Accesses Timeline:</b> |   |   |    |    |    |    |
|--------------------------------|---|---|----|----|----|----|
| cycle: 0                       | 1 | 2 | 3  | 4  | 5  | 6  |
| <b>R</b> B0 → 2                |   |   |    |    |    |    |
|                                | 8 | 9 | 10 | 11 | 12 | 13 |



# Decompression from CSR format

7



**BRAM Accesses Timeline:**

**R** Read Operation    **C** Compute Operation

| cycle: 0                    | 1              | 2 | 3  | 4  | 5  | 6  |
|-----------------------------|----------------|---|----|----|----|----|
| <b>R</b> B0 $\rightarrow$ 2 | <b>C</b> 2-0=2 |   |    |    |    |    |
| cycle: 7                    | 8              | 9 | 10 | 11 | 12 | 13 |
|                             |                |   |    |    |    |    |



# Decompression from CSR format

8



**BRAM Accesses Timeline:**

**R** Read Operation    **C** Compute Operation

| cycle: 0                     | 1                | 2                                                            | 3  | 4  | 5  | 6  |
|------------------------------|------------------|--------------------------------------------------------------|----|----|----|----|
| <b>R</b> $B_0 \rightarrow 2$ | <b>C</b> $2-0=2$ | <b>R</b> $B_1 \rightarrow 0$<br><b>R</b> $B_2 \rightarrow 3$ |    |    |    |    |
|                              |                  |                                                              |    |    |    |    |
| cycle: 7                     | 8                | 9                                                            | 10 | 11 | 12 | 13 |
|                              |                  |                                                              |    |    |    |    |



# Decompression from CSR format

9



**BRAM Accesses Timeline:**

R Read Operation      C Compute Operation

| cycle: 0 | 1       | 2                | 3                | 4  | 5  | 6  |
|----------|---------|------------------|------------------|----|----|----|
| R B0→2   | C 2-0=2 | R B1→0<br>R B2→3 | R B1→1<br>R B2→2 |    |    |    |
| cycle: 7 | 8       | 9                | 10               | 11 | 12 | 13 |



# Decompression from CSR format

10



**BRAM Accesses Timeline:**

**R** Read Operation    **C** Compute Operation

| cycle: 0      | 1              | 2                              | 3                              | 4             | 5  | 6  |
|---------------|----------------|--------------------------------|--------------------------------|---------------|----|----|
| <b>R</b> B0→2 | <b>C</b> 2-0=2 | <b>R</b> B1→0<br><b>R</b> B2→3 | <b>R</b> B1→1<br><b>R</b> B2→2 | <b>R</b> B0→2 |    |    |
| cycle: 7      | 8              | 9                              | 10                             | 11            | 12 | 13 |



# Decompression from CSR format

11



**BRAM Accesses Timeline:**

R Read Operation      C Compute Operation

| cycle: 0 | 1       | 2                | 3                | 4      | 5       | 6  |
|----------|---------|------------------|------------------|--------|---------|----|
| R B0→2   | C 2-0=2 | R B1→0<br>R B2→3 | R B1→1<br>R B2→2 | R B0→2 | C 2-2=0 |    |
|          | 8       | 9                | 10               | 11     | 12      | 13 |



# Decompression from CSR format

12



**BRAM Accesses Timeline:**

**R** Read Operation    **C** Compute Operation

| cycle: 0       | 1                              | 2                              | 3                              | 4              | 5                              | 6                              |
|----------------|--------------------------------|--------------------------------|--------------------------------|----------------|--------------------------------|--------------------------------|
| <b>R</b> B0→2  | <b>C</b> 2-0=2                 | <b>R</b> B1→0<br><b>R</b> B2→3 | <b>R</b> B1→1<br><b>R</b> B2→2 | <b>R</b> B0→2  | <b>C</b> 2-2=0                 | <b>R</b> B0→5                  |
| cycle: 7       | 8                              | 9                              | 10                             | 11             | 12                             | 13                             |
| <b>C</b> 5-2=3 | <b>R</b> B1→0<br><b>R</b> B2→4 | <b>R</b> B1→1<br><b>R</b> B2→5 | <b>R</b> B1→2<br><b>R</b> B2→1 | <b>C</b> 7-5=2 | <b>R</b> B1→2<br><b>R</b> B2→3 | <b>R</b> B1→3<br><b>R</b> B2→2 |



# Decompression from BCSR format

13



**BRAM Accesses Timeline:**

cycle: 0    1    2    3    4    5    6

**R** Read Operation

**C** Compute Operation

3

4

5

6



# Decompression from BCSR format

14



| <b>BRAM Accesses Timeline:</b> |   |   |   |   |   |   |
|--------------------------------|---|---|---|---|---|---|
| <b>cycle: 0</b>                | 1 | 2 | 3 | 4 | 5 | 6 |
| <b>R</b> B0→1                  |   |   |   |   |   |   |



# Decompression from BCSR format

15



| <b>BRAM Accesses Timeline:</b> |                |   |   |   |   |   |  |
|--------------------------------|----------------|---|---|---|---|---|--|
| cycle: 0                       | 1              | 2 | 3 | 4 | 5 | 6 |  |
| <b>R</b> B0 → 1                | <b>C</b> 1-0=1 |   |   |   |   |   |  |



# Decompression from BCSR format

16



**BRAM Accesses Timeline:**

**R** Read Operation      **C** Compute Operation

| cycle: 0      | 1              | 2             | 3 | 4 | 5 | 6 |
|---------------|----------------|---------------|---|---|---|---|
| <b>R</b> B0→1 | <b>C</b> 1-0=1 | <b>R</b> B1→0 |   |   |   |   |
|               |                | <b>R</b> B2→3 |   |   |   |   |
|               |                | <b>R</b> B3→2 |   |   |   |   |
|               |                | <b>R</b> B4→0 |   |   |   |   |
|               |                | <b>R</b> B5→0 |   |   |   |   |



# Decompression from BCSR format

17



**BRAM Accesses Timeline:**

**R** Read Operation    **C** Compute Operation

| cycle: 0      | 1              | 2                                                                                 | 3             | 4 | 5 | 6 |
|---------------|----------------|-----------------------------------------------------------------------------------|---------------|---|---|---|
| <b>R</b> B0→1 | <b>C</b> 1-0=1 | <b>R</b> B1→0<br><b>R</b> B2→3<br><b>R</b> B3→2<br><b>R</b> B4→0<br><b>R</b> B5→0 | <b>R</b> B0→3 |   |   |   |



# Decompression from BCSR format

18



**BRAM Accesses Timeline:**

**R** Read Operation      **C** Compute Operation

| cycle: 0      | 1              | 2                                                                                 | 3             | 4              | 5 | 6 |
|---------------|----------------|-----------------------------------------------------------------------------------|---------------|----------------|---|---|
| <b>R</b> B0→1 | <b>C</b> 1-0=1 | <b>R</b> B1→0<br><b>R</b> B2→3<br><b>R</b> B3→2<br><b>R</b> B4→0<br><b>R</b> B5→0 | <b>R</b> B0→3 | <b>C</b> 3-1=2 |   |   |



# Decompression from BCSR format

19



**BRAM Accesses Timeline:**

| <i>cycle: 0</i> | <i>1</i>       | <i>2</i>                                                                          | <i>3</i>      | <i>4</i>       | <i>5</i>                                                                          | <i>6</i> |  |
|-----------------|----------------|-----------------------------------------------------------------------------------|---------------|----------------|-----------------------------------------------------------------------------------|----------|--|
| <b>R</b> B0→1   | <b>C</b> 1-0=1 | <b>R</b> B1→0<br><b>R</b> B2→3<br><b>R</b> B3→2<br><b>R</b> B4→0<br><b>R</b> B5→0 | <b>R</b> B0→3 | <b>C</b> 3-1=2 | <b>R</b> B1→0<br><b>R</b> B2→4<br><b>R</b> B3→5<br><b>R</b> B4→0<br><b>R</b> B5→0 |          |  |



# Decompression from BCSR format

20



**BRAM Accesses Timeline:**

| cycle: 0      | 1              | 2             | 3             | 4              | 5             | 6             |
|---------------|----------------|---------------|---------------|----------------|---------------|---------------|
| <b>R</b> B0→1 | <b>C</b> 1-0=1 | <b>R</b> B1→0 | <b>R</b> B0→3 | <b>C</b> 3-1=2 | <b>R</b> B1→0 | <b>R</b> B1→2 |
|               |                | <b>R</b> B2→3 |               |                | <b>R</b> B2→4 | <b>R</b> B2→1 |
|               |                | <b>R</b> B3→2 |               |                | <b>R</b> B3→5 | <b>R</b> B3→0 |
|               |                | <b>R</b> B4→0 |               |                | <b>R</b> B4→0 | <b>R</b> B4→3 |
|               |                | <b>R</b> B5→0 |               |                | <b>R</b> B5→0 | <b>R</b> B5→2 |



# Key Challenge of Decompressing CSR and BCSR

21

Creating each row of data has following overheads:

- ▶ One access to the meta data
- ▶ One computation

Reading the column indices and values is sequential because

- ▶ We do not know in advance which elements of column indices and values are going to be accessed.
- ▶ We cannot partition and allocate those two vectors across the blocks of BRAM to guarantee parallel reads.



# Key Insights and Solutions

22

To address the challenge we propose Ascella

Ascella achieves the ideal streaming for sparse problems by

- ▶ Avoiding extra accesses to meta data
- ▶ Providing deterministic parallel accesses to data

Ascella is an accelerator for SpMV that sustains a balance between computation and data-transfer time



# Contributions of Ascella

23

Ascella uses a compressed format similar to list of lists (LIL<sup>1</sup>):



Ascella implements a lightweight microarchitecture that

- ▶ Connects the streamlines of memory to the parallel dot-product engine
- ▶ Enables ideal data streaming

<sup>1</sup>[https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil\\_matrix.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html)



# Decompression mechanism of Ascella

24

**SpMV Example:**

|   |   |   |   |   |
|---|---|---|---|---|
|   | 3 | 2 |   |   |
| A | 4 | 5 | 1 |   |
|   |   |   | 3 | 2 |

X

|                |   |   |   |   |
|----------------|---|---|---|---|
| B <sup>T</sup> | 3 | 1 | 5 | 6 |
|----------------|---|---|---|---|

|               |         |   |   |   |          |
|---------------|---------|---|---|---|----------|
|               | row     | 0 | 0 | 2 | 3        |
|               | indices | 2 | 2 | 3 | $\infty$ |
| LIL<br>Format | values  | 3 | 2 | 1 | 2        |
|               |         | 4 | 5 | 3 | 0        |

**Mapping to BRAM Blocks**

| row indices | values |
|-------------|--------|
| B0          | 0      |
| B1          | 0      |
| B2          | 2      |
| B3          | 3      |
| B4          | 3      |
| B5          | 2      |
| B6          | 1      |
| B7          | 2      |





# Decompression mechanism of Ascella

25

| SpMV Example:  |          |
|----------------|----------|
| A              | $\times$ |
| 3 2            |          |
| 4 5 1          |          |
|                | 3 2      |
| B <sup>T</sup> | 3 1 5 6  |

| LIL Format | row indices    | values |
|------------|----------------|--------|
|            | 0 0 2 3        |        |
|            | 2 2 3 $\infty$ |        |
|            | 3 2 1 2        |        |
|            | 4 5 3 0        |        |

| Mapping to BRAM Blocks | B0 B1 B2 B3    | B4 B5 B6 B7 |
|------------------------|----------------|-------------|
|                        | 0 0 2 3        | 3 2 1 2     |
|                        | 2 2 3 $\infty$ | 4 5 3 0     |

| BRAM Accesses for Decompression: |                      |                      |                      |
|----------------------------------|----------------------|----------------------|----------------------|
| cycle: 0                         |                      | R Read Operation     |                      |
| R B0 $\rightarrow$ 0             | R B4 $\rightarrow$ 3 | R B0 $\rightarrow$ 2 | R B4 $\rightarrow$ 4 |
| R B1 $\rightarrow$ 0             | R B5 $\rightarrow$ 2 | R B1 $\rightarrow$ 2 | R B5 $\rightarrow$ 5 |
| R B2 $\rightarrow$ 2             | R B6 $\rightarrow$ 1 | R B2 $\rightarrow$ 2 | R B6 $\rightarrow$ 1 |
| R B3 $\rightarrow$ 3             | R B7 $\rightarrow$ 2 | R B3 $\rightarrow$ 3 | R B7 $\rightarrow$ 2 |



# Decompression mechanism of Ascella

26

| SpMV Example:  |          |
|----------------|----------|
| A              | $\times$ |
| 3 2            |          |
| 4 5 1          |          |
|                | 3 2      |
| B <sup>T</sup> | 3 1 5 6  |

| LIL Format | row indices    | values |
|------------|----------------|--------|
|            | 0 0 2 3        |        |
|            | 2 2 3 $\infty$ |        |
|            | 3 2 1 2        |        |
|            | 4 5 3 0        |        |

| Mapping to BRAM Blocks | B0 B1 B2 B3    | B4 B5 B6 B7 |
|------------------------|----------------|-------------|
|                        | 0 0 2 3        | 3 2 1 2     |
|                        | 2 2 3 $\infty$ | 4 5 3 0     |

| BRAM Accesses for Decompression: |                      |                      |                      |
|----------------------------------|----------------------|----------------------|----------------------|
| cycle: 0                         |                      | 1                    | 2                    |
| R B0 $\rightarrow$ 0             | R B4 $\rightarrow$ 3 | R B0 $\rightarrow$ 2 | R B4 $\rightarrow$ 4 |
| R B1 $\rightarrow$ 0             | R B5 $\rightarrow$ 2 | R B1 $\rightarrow$ 2 | R B5 $\rightarrow$ 5 |
| R B2 $\rightarrow$ 2             | R B6 $\rightarrow$ 1 | R B2 $\rightarrow$ 2 | R B6 $\rightarrow$ 1 |
| R B3 $\rightarrow$ 3             | R B7 $\rightarrow$ 2 | R B3 $\rightarrow$ 3 | R B7 $\rightarrow$ 2 |
|                                  |                      | R B2 $\rightarrow$ 3 | R B6 $\rightarrow$ 2 |
|                                  |                      | R B3 $\rightarrow$ 3 | R B7 $\rightarrow$ 3 |



# Microarchitecture of Ascella

27

At each step of decompression:

- ▶ *read indices* are used to read the *row indices*
- ▶ the minimum of *row indices* is used to create a *mask*
- ▶ the mask
  - ▶ Selects values of *dense row*
  - ▶ Updates the *read indices*





# Microarchitecture of Ascella

28

Creating the first row:





# Microarchitecture of Ascella

29

Creating the third row:





# Microarchitecture of Ascella

30

Creating the last row:





# Experimental Setup

31

We apply SpMV on a group of matrices from SuiteSparse collection

We implement Ascella and the baselines

- ▶ Using Xilinx Vivado HLS
- ▶ On ZYNQ XC7Z020FPGA.

Our comparison metrics are

- ▶ Latency
- ▶ Recourse utilization

The configurations of Ascella and baselines include

- ▶ AXI stream interfaces
- ▶ DDR3 memory
- ▶ 100 MHz frequency
- ▶ 32-bit integers



# Performance Evaluation

32



Dataset: thermomech-TC



# Performance Evaluation

33

On average, Ascella executes SpMV

- ▶ 2.7x faster than using CSR
- ▶ 5.1x faster than using BCSR





# Resource Utilization

34

- ▶ BCSR and Ascella use more BRAM because of partitioning.
- ▶ CSR has the lowest flip-flop and look-up table (LUT) utilization because it does not implement any parallelism.





# Conclusions

35

Ascella is a streaming accelerator for sparse problems that

- ▶ Streams the non-zero values of sparse matrices and processes them as they come at the same pace
- ▶ Is a significant step towards accelerating larger sparse problems because its storage format
  - ▶ facilitates partitioning large matrices
  - ▶ is supported in Python libraries, which makes the implementation straightforward