

# TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs

Diyou Shen\*

Integrated Systems Laboratory  
ETH Zürich  
Zürich, Switzerland  
disen@iis.ee.ethz.ch

Yichao Zhang\*

Integrated Systems Laboratory  
ETH Zürich  
Zürich, Switzerland  
yiczhang@iis.ee.ethz.ch

Marco Bertuletti

Integrated Systems Laboratory  
ETH Zürich  
Zürich, Switzerland  
mbertuletti@iis.ee.ethz.ch

Luca Benini

Integrated Systems Laboratory  
ETH Zürich  
Zürich, Switzerland  
Università di Bologna  
Bologna, Italy  
lbenini@iis.ee.ethz.ch

**Abstract**—As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster’s core count increases, a flat all-to-all interconnect between cores and L1 memory banks becomes a physical implementation bottleneck, and hierarchical network topologies are required. However, hierarchical, multi-level intra-cluster networks are subject to internal contention which may lead to significant performance degradation, especially for SIMD or vector cores, as their memory access is bursty. We present the TCDM Burst Access architecture, a software-transparent burst transaction support to improve bandwidth utilization in clusters with many vector cores tightly coupled to a multi-banked L1 data memory. In our solution, a Burst Manager dispatches burst requests to L1 memory banks, multiple 32b words from burst responses are retired in parallel on channels with parametric data-width. We validate our design on a RISC-V Vector (RVV) many-core cluster, evaluating the benefits on different core counts. With minimal logic area overhead (less than 8%), we improve the bandwidth of a 16-, a 256-, and a 1024-Floating Point Unit (FPU) baseline clusters, without Tightly Coupled Data Memory (TCDM) Burst Access, by 118%, 226%, and 77% respectively. Reaching up to 80% of the cores-memory peak bandwidth, our design demonstrates ultra-high bandwidth utilization and enables efficient performance scaling. Implemented in 12-nm FinFET technology node, compared to the serialized access baseline, our solution achieves up to 1.9x energy efficiency and 2.76x performance in real-world kernel benchmarks.

**Index Terms**—RISC-V, NoC, Vector, Many-Core

## I. INTRODUCTION

In the last ten years, the explosive growth of deep-learning workloads has driven the demand for parallel computing systems with large compute power and memory footprint: computing capacity requirements for training and inference of Machine Learning (ML) models doubled every 6-9 months [1], skyrocketing to quadrillion FLOPs; Large Language Models (LLMs) require substantial memory capacity to handle hundreds of billions of parameters [2].

Shared L1-memory clusters with programmable processing elements (PEs) became a common architectural pattern to achieve both performance and energy efficiency: as the core count scales up, the memory access latency is kept low. To increase the PE-memory bandwidth, keeping the physical design modular

and feasible, these architectures often incorporate a pipelined hierarchical interconnect, configuring a Non-Uniform Memory Access (NUMA) system. However, the limited number of ports to the shared interconnect nodes constrains the interconnection bandwidth, restricting the design scalability. Examples can be found in modern designs, where the size of shared-L1-memory clusters is limited to tens of PE: The TensTorrent architecture [3] packs only five PEs with shared memory into a Tensix *core warp*, forming the design’s second hierarchy; In ET-SOC-1 [4] by Esperanto, four 8-core *Neighborhood* blocks shared only 4 Scratchpad Memory (SPM) banks through two hierarchical Fully-Connected (FC) crossbars; The Fujitsu A64RX [5] and Kalray MPPA architecture [6] have a shared memory across 13 and 16 cores, respectively.

Larger scale clusters are desirable to increase the tiling size of computations, thereby reducing data movement overheads, and increasing compute vs. memory transfer ratio in kernels like matrix multiplication (where this ratio grows as  $N^3/N^2$ ) [7], [8]. The PULP-Platform’s MemPool architecture [8] implements this concept by leveraging 256 PEs tightly coupled to shared-L1, specifically Tightly Coupled Data Memory (TCDM), where FC crossbar ensures low-latency access between PEs and banks. Another key trend [9] is to increase the PEs compute efficiency by leveraging vector Instruction Set Architectures (ISAs), which boosts the compute-fetch balance and increases the utilization of computing units. However, vector load and stores need to access large chunks of consecutive addresses to keep the vector lanes busy. This translates into several simultaneous requests to the ports of the PE-to-L1-memory interconnect. As a result, the conflicts between vector PEs trying to concurrently access shared memory through the memory-to-PE interconnect become critical and can bottleneck, severely limiting design scalability.

Fig. 1 illustrates the problem. First note that the PE-to-memory interconnect is hierarchical for physical scalability. At the lowest level of the hierarchy, the tile, PEs access a subset of the L1 shared memory (banks 0-7 in the example) with full bandwidth. However, accesses to banks in other tiles have to go through the remote crossbars with a number of remote ports smaller than the number of local ports (one to four in the example). As a consequence, data fetching is serialized and fails to keep all Load/Store Unit (LSU) ports busy, thereby underutilizing the

\*These two authors contributed equally to this work.

full memory bandwidth of the processor. Multiple approaches have been explored to mitigate this effect. Task scheduling techniques reduce memory access and alleviate interconnection pressure [9] [10]. However, task-level optimization cannot fundamentally resolve the accessing conflict in the core-memory interconnection. An architecture-aware data arrangement leverages the available interconnection ports better and reduces data movement overhead [11]. Nonetheless, this solution is typically tied to specific application algorithms, and adds complexity to software-managed data-allocation and transfer. Topology-level optimization, such as 2D meshes typically implemented with Network-on-Chips (NoCs), can achieve high link bandwidth [12]. However, mesh-like NoCs are unsuitable for interconnections between PEs and L1 due to the additional latency from router hops, which substantially reduces the throughput available to the PEs when traffic is not localized in tight sub-meshes.



Fig. 1. Conflicts to shared interconnection resources reduce the interconnection bandwidth in vector many-core shared memory processors. The number on the request indicates its target bank.

In this paper, we propose the TCDM Burst Access that breaks the interconnection bandwidth barrier caused by port competition in vector many-core clusters and helps the shared-L1 cluster to efficiently scale up beyond 1000 Floating Point Units (FPUs). We implement and validate our design approach across various scales of the MemPool-Spatz design [13], an open-sourced, scalable many-core RISC-V Vector (RVV) cluster. Our results demonstrate a bandwidth improvement up to 80% of the theoretical design while maintaining minimal area impact and superior energy efficiency. The main contributions of this paper are:

- The TCDM Burst Access, a conflict-free mechanism on memory requests, implementing narrow-word (32-bit) burst accesses to L1 shared memory. A Burst Manager module designed to: (i) dispatch burst requests to a multi-banked scratchpad. (ii) merge the parallel memory responses into a single transaction, saturating the available interconnection bandwidth.
- A physical-design aware FC interconnect that maximizes the area-utilization of routing resources, by increasing data-width on the response channels only, to reduce serialization of burst responses.
- The validation of our design on a scalable RVV cluster with different core-counts. Clusters with 16/256/1024 FPUs, obtain 118%, 226% and 77% bandwidth improvement respectively. Compared to the baseline, we achieve 176%, 64% and

62% performance improvement on real-world kernels: Dot Product (DotP), Fast Fourier Transform (FFT), and Matrix Multiplication (MatMul).

Implemented in GF12nm FinFET technology, our approach demonstrates less than 8% logic area overhead without introducing critical timing paths. It improves up to 90% energy efficiency for memory-bound kernels. Our design is fully open-sourced<sup>1</sup>.

## II. TESTBED CLUSTER AND PEAK-BW ANALYSIS

To investigate the internal contention in hierarchical PEs-memory networks, this section presents a bandwidth analysis based on a single instruction, multiple data (SIMD) many-core testbed cluster. We analyze the theoretical peak interconnection bandwidth across various cluster scales, quantify the loss of bandwidth utilization, and outline our proposed solution.

### A. Testbed cluster architecture

As discussed in section I, many-core vector clusters are susceptible to interconnect contention due to SIMD load and store operations accessing consecutive addresses. This leads to conflicts at the same ports of the hierarchical interconnection. We select *MemPool-Spatz* [13] as our testbed architecture, an open-sourced, scalable, RVV many-core shared-L1 cluster based on the Zve32f ISA. The architecture's PEs are Core Complex (CC), where one *Snitch* scalar core is responsible for executing scalar instructions and forwarding vector instructions to a floating point *Spatz* vector core. We follow the naming convention  $MP_N Sartz_K$  to represent different design scales, where  $N$  indicates the number of CCs and  $K$  represents the number of vector FPUs per *Spatz* vector core. The total number of FPUs is given by  $N \times K$ .

In the hierarchical multi-level interconnection design of  $MP_N Sartz_K$ , all PEs have shared NUMA to  $N \times 4$  fully interleaved 1 KiB banks of SPM. In this paper, we focus on the two 16 and 256-FPU most energy-efficient configurations demonstrated by [13]. We further scale it up to a 1024-FPU configuration, incorporating a hierarchy configuration inspired by [7], as follows:

- 1)  $MP_4 Sartz_4$ : a 16-FPU vector cluster with a maximum vector length (VLEN) of 256 bits. The design is built with one hierarchy, the *Tile*, consisting of 4 CCs and 16 SPM banks with 1-cycle round-trip accessing latency. Each *Tile* has four hierarchical interconnection ports, accessing other *Tiles* with 3 cycles round-trip latency.
- 2)  $MP_{64} Sartz_4$ : a 256-FPU vector cluster with a maximum VLEN of 256 bits. The design is built in two hierarchies. The *Tile* hierarchy consists of 4 CCs and 16 SPM banks with 1-cycle round-trip accessing latency. Four *Group* hierarchy blocks, contain 16 *Tiles* each. Each *Tile* and *Group* has four hierarchical interconnection ports accessing other *Tiles* with 3 cycles, and other *Groups* with 5 cycles round-trip latency.
- 3)  $MP_{128} Sartz_8$ : a 1024-FPU vector cluster with a maximum VLEN of 512 bits. The design is built in three hierarchies. The *Tile* hierarchy consists of 8 CCs and 32 SPM banks

<sup>1</sup><https://github.com/pulp-platform/mempool>

with 1-cycle round-trip accessing latency, followed by four *SubGroup* hierarchy blocks with 8 *Tiles* each. Four *SubGroups* form the *Group* hierarchy. Each *Tile* has seven hierarchical interconnection ports: one port accesses other *Tiles* within the same *SubGroup* with 3 cycles latency; three ports access other three *Subgroups* within the same *Group* with 5 cycles latency; and three ports access remote *Groups* with 9 cycles round-trip latency.

### B. Interconnect bandwidth analysis

The Vector Load/Store Unit (VLSU) manages the memory accesses of the vector core, with the number of request and response ports matching the number of  $K$  FPUs in the Spatz<sub>K</sub> design as shown in Fig. 2. The VLSU splits a vector memory request into multiple 32 b data requests and distributes them across the available VLSU ports. The theoretical VLSU peak bandwidth can be defined as the bandwidth achieved when all requests sent through the VLSU are routed by an all-to-all fully connected crossbar without any contention:

$$BW_{vlsuPeak} = BW_{Spatz_K} = K \times 4 \text{ Bytes/cyc} \quad (1)$$

In hierarchical FC crossbars-based testbed clusters, memory accesses can be categorized as *local-Tile* or *remote-Hierarchy*, depending on the requested target address. In *local-Tile* accesses, memory requests from a VLSU target the SPM banks within its own *Tile*. This achieves full local-interconnection bandwidth if no bank conflicts are encountered, benefiting from a FC crossbar that does not require arbitration. In contrast, *remote-Hierarchy* accesses encounter conflicts when the parallel requests from a VLSU target L1 address portions that are allocated to the same shared interconnection port. This results in decreased bandwidth utilization as parallel requests must be arbitrated and serialized, as previously illustrated in Fig. 1. The estimated bandwidth is as follows:

$$BW_{locTile} = BW_{vlsuPeak} \quad (2)$$

$$BW_{rmtHier} = BW_{serialized} = 4 \text{ Bytes/cyc} \quad (3)$$

Assuming each vector request targets random and uniformly distributed destination banks, with  $N_{PE}$  representing the total number of vector cores, the probabilities of targeting *local-Tile* and *remote-Hierarchy* accesses are denoted as  $p_l$  and  $p_r$ , respectively, as shown in eq. (4). Additionally, the random accessing average bandwidth is presented in eq. (5).

$$p_l = \frac{1}{N_{PE}}, \quad p_r = 1 - p_l = \frac{N_{PE} - 1}{N_{PE}} \quad (4)$$

$$BW_{hierAvg} = \mathbb{E}[BW] = p_l \cdot BW_{locTile} + p_r \cdot BW_{rmtHier} \quad (5)$$

We calculate the theoretical VLSU peak bandwidth ( $BW_{vlsuPeak}$ ) and the hierarchical interconnection average bandwidth with random accessing ( $BW_{hierAvg}$ ) for all three scaled testbed clusters, and summarize the results in the first two rows of Table I. The results demonstrate significantly lower bandwidth in the multi-level hierarchical FC crossbar compared with the peak bandwidth that VLSU interfaces could support. In the MP<sub>128</sub>Spatz<sub>8</sub>, the local-Tile bandwidth

increases, scaling with the number of CCs. As a result, the hierarchical interconnection average bandwidth of the baseline MP<sub>128</sub>Spatz<sub>8</sub> testbed cluster slightly improves, but the bandwidth utilization (11.75%) reduces, due to the increased VLSU peak bandwidth. Thus, finding a solution to mitigate the hierarchical interconnection conflicts is crucial for maintaining performance scalability in large-scale shared-memory vector cluster designs. In the following subsection, we introduce our proposed solution to address this challenge effectively.

TABLE I  
CALCULATED MEMORY BANDWIDTH: COMPARISON ACROSS CLUSTER SIZES AND CONFIGURATIONS.

|                 |                 | MP <sub>4</sub> Spatz <sub>4</sub> | MP <sub>64</sub> Spatz <sub>4</sub> | MP <sub>128</sub> Spatz <sub>8</sub> |
|-----------------|-----------------|------------------------------------|-------------------------------------|--------------------------------------|
| <b>Baseline</b> | Peak BW [B/cyc] | 16.00                              | 16.00                               | 32.00                                |
|                 | BW [B/cyc]      | 7.00                               | 4.18                                | 4.22                                 |
|                 | Utilization     | 37.50%                             | 21.38%                              | 11.75%                               |
| <b>2xRsp</b>    | BW [B/cyc]      | 10.00                              | 8.13                                | 8.19                                 |
|                 | Utilization     | 62.50%                             | 50.78%                              | 25.59%                               |
|                 | Improvement     | +42.86%                            | +94.38%                             | +94.02%                              |
| <b>4xRsp</b>    | BW [B/cyc]      | 16.00                              | 16.00                               | 16.13                                |
|                 | Utilization     | 100.00%                            | 100.00%                             | 50.39%                               |
|                 | Improvement     | +128.57%                           | +282.78%                            | +282.11%                             |



Fig. 2. MP<sub>64</sub>Spatz<sub>4</sub>'s Tile level architectural schematic with TCDM burst and GF4. The increased data-width response channels are marked in red.

### C. Burst Access for TCDM conflicts reduction

We propose the *TCDM Burst Access* as a solution to break the interconnection bandwidth barrier due to port competition, specifically enhancing the load request and the memory-response channels in multi-level hierarchical FC crossbar. We focus on loads because the latency of store operations is hidden by the synchronization time required to solve inter-core data dependencies in SPM-based parallel clusters. Stores are consequently non-critical for the cluster performance. Our solution focuses on loads and consists of two key contributions:

1) *Burst narrow requests*: To resolve the port contentions on hierarchical interconnect, a widely used approach is to reduce the number of memory requests by employing a *burst access* mechanism. In this mechanism, multiple narrow memory

requests (32 b) are combined into a single transfer with the *burst length* information, which specifies the number of consecutive element words to be requested. This is particularly advantageous for vector requests, as their consecutive address patterns can straightforwardly be mapped to a burst format by specifying a start address and burst length.

2) *Increased response data-width*: Upon receiving a burst request, the SPM banks process requests simultaneously and generate data responses in parallel. This scenario introduces port contentions on the memory-response channel, leading us to the second aspect of our solution: the parallel response data can be merged in the memory-response channel with an expanded data field, thereby reducing the number of individual transfers sent across the interconnection. However, the routing complexity in the FC crossbar design linearly increases with the width of the data field. To maintain physical feasibility in differently scaled clusters, the data width extension should remain hardware-configurable, allowing for flexible adjustments to ensure high routing resource utilization. The Grouping Factor (GF) describes the multiplier used to extend the data width on the response channel. In this paper, we explore the bandwidth improvements associated with doubling (*GF2*) and quadrupling (*GF4*) the response channel data fields. The results from the analytical model outlined in section II-B are presented in Table I. According to our model, full bandwidth utilization becomes achievable when the width factor of the response data field equals the number of VLSU ports. Our proposed solution improves bandwidth in multi-level hierarchical interconnection designs and enhances bandwidth utilization during further scaling up.

### III. ARCHITECTURE

This section presents the key architectural components designed to support TCDM Burst Access. We implement TCDM Burst Access in the testbed clusters' *Tiles*. A *Tile* diagram with *GF4* implementation on MP<sub>64</sub>Spatz<sub>4</sub> is shown as an example in Fig. 2.

#### A. Burst Sender

The Burst Sender is attached to the VLSU ports in the Spatz processor. When detecting a VLE instruction, it combines the  $K$  parallel requests at VLSU ports into a single burst with a burst length of  $K$  words. In the MemPool-Spatz testbed, the orders between memory requests and responses are guaranteed using Reorder Buffers (ROBs) at VLSU ports. These ROBs are also used for latency tolerance by enabling multiple outstanding transactions. Since each burst request contains multiple narrow requests, the depth of the ROB needs to be increased to maintain the same level of outstanding transaction support, which is doubled in our testbed clusters, as an example.

#### B. Burst Manager

We design a Burst Manager module that serves as a burst format adapter. It efficiently splits or combines 32 b narrow requests and responses, adapting them to SPM banks without complicating the memory module design. Further details are:

i) On the request channel, the Burst Manager receives burst requests, converts them into parallel 32 b memory requests, and forwards them to the SPM banks. If multiple bursts arrive simultaneously, an arbitrator and a small first-in first-out (FIFO) buffer are used to hold the following burst requests.

ii) On the response channel, the Burst Manager leverages the widened response data width, configured through an elaboration parameter (*GF*). It merges the parallel response data into a single transfer and forwards it through the widened data field. This block is needed for every *GF* number of SPM banks to handle the burst requests in parallel.

We implemented our solution with modular designs to minimize the changes in the original testbed clusters. In both MP<sub>4</sub>Spatz<sub>4</sub> and MP<sub>64</sub>Spatz<sub>4</sub>, a *GF4* design is implemented for maximizing the bandwidth. A *GF2* design is used in MP<sub>128</sub>Spatz<sub>8</sub> considering the increased routing congestion in scaling. In the next section, we will evaluate the performance of our design as implemented on these testbeds.

### IV. PERFORMANCE ANALYSIS

The roofline model is widely used to analyze an architecture's performance with respect to the memory bandwidth [14]. We present the roofline models of our designs on the testbed clusters in Fig. 3. We determined the ideal no-contention bandwidth per core using the theoretical VLSU peak bandwidth, and the maximum achievable performance with the theoretical maximum throughput of the FPUs in a PE.

In our analysis, we benchmark different real-world kernels with distinct arithmetic intensities to demonstrate the effectiveness of TCDM Burst Access mechanism in improving bandwidth utilization and performance:

- 1) *DotP*: Multiplication between two  $n$ -element vectors, with an arithmetic intensity of 0.25 FLOPs/byte.
- 2) *FFT*: Multi-core implementation of the Cooley-Tukey Radix-2 FFT algorithm, running  $k$  instances of  $n$ -point FFTs in parallel across all cores on complex single-precision floating-point samples. Depending on both the problem size of FFT and the number of cores involved, the arithmetic intensity ranges between 0.3 FLOPs/byte and 0.5 FLOPs/byte.
- 3) *MatMul*: Matrix multiplication on two  $n \times n$  single-precision floating-point matrices. The arithmetic intensity varies depending on the problem sizes of the matrices. We evaluate the performance on two sizes of MatMul kernels on each hardware configuration, with arithmetic intensity of at 1.5 FLOPs/byte and 3.5 FLOPs/byte, respectively.

Additionally, we simulate and present the bandwidth analysis model in section II-B, by using a test kernel with vector loads targetting random addresses, showing as the dashed line in Fig. 3. All kernels and tests follow a fork-join programming model, with all data preloaded into the testbed's L1 memory.

The roofline plots, shown in Fig. 3, compare the testbed clusters with and without TCDM Burst Access implementation. Our *GF4* design improves the hierarchical average bandwidth by 118% and 226%, achieving average bandwidth utilization of 82% and 70%, in MP<sub>16</sub>Spatz<sub>4</sub> and MP<sub>64</sub>Spatz<sub>4</sub> cluster, respectively. In DotP kernel, the *GF4* design shows a 106% and 176%



Fig. 3. The roofline plots on original and burst-enabled configurations on MP<sub>4</sub>Spatz<sub>4</sub> (left), MP<sub>64</sub>Spatz<sub>4</sub> (middle), and MP<sub>128</sub>Spatz<sub>8</sub> (right). The hierarchical average bandwidth is shown in dashed lines in the graph, the ideal no-contention bandwidth and maximum achievable performance are in solid black lines.

performance improvement compared to the baseline testbed in MP<sub>16</sub>Spatz<sub>4</sub> and MP<sub>64</sub>Spatz<sub>4</sub>, respectively, closely matching the improvements in bandwidth. Smaller gains of 41% and 64% are observed in FFT kernel for MP<sub>16</sub>Spatz<sub>4</sub> and MP<sub>64</sub>Spatz<sub>4</sub>, due to the unavoidable inter-core synchronization inherent in the multi-core FFT algorithm. The performance of compute-bound MatMul kernels do not differ between the baseline testbed and our *GF4* design in MP<sub>16</sub>Spatz<sub>4</sub> cluster. When working on the smaller matrix sizes in a MatMul kernel, the ratio of data transfer to computation becomes significant, causing the performance to be limited by the memory bandwidth. In this scenario, such as the 64 × 64 × 64 MatMul in MP<sub>64</sub>Spatz<sub>4</sub> cluster, a notable performance improvement of 35% is observed by implementing *GF4* design.

The MP<sub>128</sub>Spatz<sub>8</sub> shows a higher hierarchical average bandwidth compared to the MP<sub>64</sub>Spatz<sub>4</sub> testbed cluster, consistent with the estimation in section II-B. By implementing the TCDM Burst Access with *GF2* configuration, the hierarchical average bandwidth is improved by 90%, reaching the utilization of 20.8%. The testbed with *GF2* shows performance improvements of 80% and 47% in DotP and FFT kernels compared to the baseline MP<sub>128</sub>Spatz<sub>8</sub> cluster. The larger cluster scale requires a higher problem size of MatMul kernel to remain in the compute-bound region. Because of this, a 128 × 128 × 128 MatMul kernel achieves 62% performance improvement, a 256 × 256 × 256 MatMul kernel moves into the compute-bound region, obtaining 12% improvement, and over 90% FPU utilization.

## V. PHYSICAL IMPLEMENTATION

In this section, we analyze the power, performance, and area (PPA) of our design Place-and-Route (PnR) implementation.

We synthesize and PnR (Synopsys Fusion Compiler 2022.03) the testbed clusters with TCDM Burst Access mechanism in GlobalFoundries' 12nm LP-PLUS FinFET technology. Both MP<sub>16</sub>Spatz<sub>4</sub> and MP<sub>64</sub>Spatz<sub>4</sub> are targeted to run at 770 MHz, while MP<sub>128</sub>Spatz<sub>8</sub> targets 634 MHz under worst-case conditions (SS/0.72 V/125 °C), with no frequency degradation compared to the original testbed. Power estimations are obtained using Synopsys PrimeTime 2022.03 under nominal conditions



Fig. 4. Placed-and-routed layout annotated Group- and Tile-level view of GF4 design on MP<sub>64</sub>Spatz<sub>4</sub> cluster

(TT/0.80 V/25 °C) at 910 MHz and 875 MHz, with switching activities extracted from post-PnR gate-level simulations.

### A. Area Analysis and Breakdown

The post-PnR physical layout of the GF4 design on MP<sub>64</sub>Spatz<sub>4</sub> is shown in Fig. 4. Implementing the TCDM Burst Access results in less than 8% logic area increase in all three clusters. An area breakdown of MP<sub>64</sub>Spatz<sub>4</sub> with GF4 design is shown in the left part of Fig. 5, with total area increased by 4.5 MGE. The 35% area increase in the VLSU is primarily due to the enlarged ROB. The increased data width in the response channel leads to a 51% logic area increase in the interconnection network. The Burst Manager and the Burst Sender blocks contribute an additional 1.5 MGE of the logic area in total, occupied mainly by the FIFOs in the Burst Manager.

### B. Power Analysis and Breakdown

We measure the power consumption per kernel in nominal operating conditions as recorded in Table II across different testbed clusters. A power breakdown is shown in the right part of Fig. 5 on the GF4 design on MP<sub>64</sub>Spatz<sub>4</sub> testbed running 256 × 256 × 256 MatMul kernel. The increased power consumption in the VLSU, SPM banks and interconnection logic indicates a higher data transfer rate due to the increased hierarchical average bandwidth. Even in this compute-bound

TABLE II  
THE SUMMARY OF KERNEL PERFORMANCE AND ENERGY EFFICIENCY.

| Config                                                       | Kernel | Kernel Size | Arithmetic Intensity [FLOP/B] | FPU Utilization | Performance @ss_freq [GFLOPS] | Performance @tt_freq [GFLOPS] | Power @tt_freq [W] | En. Efficiency @tt_freq [GFLOPS/W] | En. Efficiency Comparsion |
|--------------------------------------------------------------|--------|-------------|-------------------------------|-----------------|-------------------------------|-------------------------------|--------------------|------------------------------------|---------------------------|
| <b>MP<sub>4</sub>Spatz<sub>4</sub> Cluster<sup>1</sup></b>   |        |             |                               |                 |                               |                               |                    |                                    |                           |
| Baseline                                                     | dotp   | 4096        | 0.25                          | 18.88%          | 4.65                          | 5.50                          | 0.09               | 63.12                              | -                         |
| Baseline                                                     | fft    | 1x512       | 0.47                          | 30.71%          | 7.57                          | 8.94                          | 0.09               | 95.14                              | -                         |
| Baseline                                                     | matmul | 16x16x16    | 1.33                          | 47.06%          | 11.60                         | 13.70                         | 0.12               | 118.70                             | -                         |
| Baseline                                                     | matmul | 64x64x64    | 2.91                          | 94.97%          | 23.40                         | 27.66                         | 0.13               | 218.69                             | -                         |
| GF4                                                          | dotp   | 4096        | 0.25                          | 38.91%          | 9.59                          | 11.33                         | 0.12               | 91.82                              | +45.47%                   |
| GF4                                                          | fft    | 1x512       | 0.47                          | 42.72%          | 10.53                         | 12.44                         | 0.13               | 96.58                              | +1.52%                    |
| GF4                                                          | matmul | 16x16x16    | 1.33                          | 48.30%          | 11.90                         | 14.06                         | 0.12               | 113.28                             | -4.57%                    |
| GF4                                                          | matmul | 64x64x64    | 2.91                          | 94.95%          | 23.40                         | 27.65                         | 0.13               | 206.82                             | -5.43%                    |
| <b>MP<sub>64</sub>Spatz<sub>4</sub> Cluster<sup>1</sup></b>  |        |             |                               |                 |                               |                               |                    |                                    |                           |
| Baseline                                                     | dotp   | 65536       | 0.25                          | 12.06%          | 47.55                         | 56.19                         | 1.32               | 42.70                              | -                         |
| Baseline                                                     | fft    | 4x2048      | 0.37                          | 17.51%          | 69.03                         | 81.58                         | 1.30               | 62.95                              | -                         |
| Baseline                                                     | matmul | 64x64x64    | 1.52                          | 51.64%          | 203.59                        | 240.60                        | 1.45               | 166.05                             | -                         |
| Baseline                                                     | matmul | 256x256x256 | 3.12                          | 94.58%          | 372.87                        | 440.67                        | 1.77               | 248.40                             | -                         |
| GF4                                                          | dotp   | 65536       | 0.25                          | 33.29%          | 131.24                        | 155.10                        | 1.91               | 81.12                              | +89.99%                   |
| GF4                                                          | fft    | 4x2048      | 0.37                          | 28.70%          | 113.15                        | 133.72                        | 1.76               | 75.80                              | +20.42%                   |
| GF4                                                          | matmul | 64x64x64    | 1.52                          | 69.75%          | 274.98                        | 324.98                        | 1.84               | 176.62                             | +6.37%                    |
| GF4                                                          | matmul | 256x256x256 | 3.12                          | 96.93%          | 382.14                        | 451.62                        | 1.97               | 229.01                             | -7.81%                    |
| <b>MP<sub>128</sub>Spatz<sub>8</sub> Cluster<sup>2</sup></b> |        |             |                               |                 |                               |                               |                    |                                    |                           |
| Baseline                                                     | dotp   | 131072      | 0.25                          | 5.49%           | 71.28                         | 98.38                         | 4.24               | 23.20                              | -                         |
| Baseline                                                     | fft    | 4096x8      | 0.42                          | 7.87%           | 102.19                        | 141.03                        | 4.03               | 34.98                              | -                         |
| Baseline                                                     | matmul | 128x128x128 | 1.73                          | 29.56%          | 383.82                        | 529.72                        | 7.30               | 72.52                              | -                         |
| Baseline                                                     | matmul | 256x256x256 | 3.46                          | 80.57%          | 1046.15                       | 1443.81                       | 7.78               | 185.68                             | -                         |
| GF2                                                          | dotp   | 131072      | 0.25                          | 9.85%           | 127.90                        | 176.51                        | 5.41               | 32.64                              | +40.67%                   |
| GF2                                                          | fft    | 4096x8      | 0.42                          | 11.32%          | 146.98                        | 202.85                        | 4.62               | 43.87                              | +25.42%                   |
| GF2                                                          | matmul | 128x128x128 | 1.73                          | 47.86%          | 621.43                        | 857.65                        | 8.14               | 105.40                             | +45.34%                   |
| GF2                                                          | matmul | 256x256x256 | 3.46                          | 90.09%          | 1169.76                       | 1614.41                       | 8.91               | 181.15                             | -2.44%                    |

<sup>1</sup> In MP<sub>4</sub>Spatz<sub>4</sub> and MP<sub>64</sub>Spatz<sub>4</sub>, ss\_freq = 770 MHz, tt\_freq = 910 MHz

<sup>2</sup> In MP<sub>128</sub>Spatz<sub>8</sub>, ss\_freq = 634 MHz, tt\_freq = 875 MHz



Fig. 5. Area (left) and power (right) breakdown for the MemPool<sub>64</sub>Spatz<sub>4</sub> clusters. Area and power extracted in 12-nm technology, at TT@910MHz, executing MatMul kernel.

kernel, which cannot benefit from higher bandwidth, we only observe a small reduction (less than 8% on average) in energy efficiency.

The memory-bound kernels show higher power consumption because of the higher FPU utilization. Compared to the baseline, a large energy efficiency gain, up to 90% improvement on a performance improvement of 176%, is observed in all kernels,

for different scales of testbed cluster.

## VI. CONCLUSION

In this paper, we presented the TCDM Burst Access, a software-transparent burst transaction architecture enhancement for bandwidth utilization improvement in many-core vector clusters with tightly coupled L1 memory. By sending 32 b narrow burst requests through the Burst Sender, adapting them to the SPM banks via the Burst Manager, and adding a parametrizable datFa width on the response channel, TCDM Burst Access significantly enhanced bandwidth utilization while maintaining scalability across different cluster scales. We evaluated our design by implementing it into three sizes of the MemPool-Spatz architecture, validated in an advanced 12-nm technology node. Our design improved the MatMul kernel performances up to 62%, fully pushing the kernels into the memory-bound region with less than a 8% increase in area. Additionally, it achieved up to **2.76x** performance and **1.9x** energy efficiency improvements on memory-bound kernels compared to the baseline testbed clusters.

## ACKNOWLEDGMENT

This work is funded in part by the COREnext project supported by the EU Horizon Europe research and innovation programme under grant agreement No. 101092598.

## REFERENCES

- [1] J. Sevilla, L. Heim, A. Ho, T. Besiroglu, M. Hobbahn, and P. Villalobos, “Compute trends across three eras of machine learning,” in *Proceedings of the International Joint Conference on Neural Networks*, vol. 2022-July, 2022.
- [2] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” 2024. [Online]. Available: <https://arxiv.org/abs/2402.06196>
- [3] J. Vasiljevic and D. Capalija, “Blackhole & tt-metalium: The standalone ai computer and its programming model,” in *2024 IEEE Hot Chips 36 Symposium (HCS)*, 2024, pp. 1–30.
- [4] D. R. Ditzel and the Esperanto team, “Accelerating ml recommendation with over 1,000 risc-v/tensor processors on esperanto’s et-soc-1 chip,” *IEEE Micro*, vol. 42, no. 3, pp. 31–38, 2022.
- [5] S. Matsuoka, “Fugaku and a64fx: the first exascale supercomputer and its innovative arm cpu,” in *2021 Symposium on VLSI Circuits*, 2021, pp. 1–3.
- [6] B. D. de Dinechin, “Consolidating high-integrity, high-performance, and cyber-security functions on a manycore processor,” in *Proceedings of the 56th Annual Design Automation Conference 2019*, ser. DAC ’19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: <https://doi.org/10.1145/3316781.3323473>
- [7] Y. Zhang, M. Bertuletti, S. Riedel, M. Cavalcante, A. Vanelli-Coralli, and L. Benini, “Terapool-sdr: An 1.89tops 1024 rv-cores 4mib shared-l1 cluster for next-generation open-source software-defined radios,” in *Proceedings of the Great Lakes Symposium on VLSI 2024*. ACM, 6 2024, pp. 86–91. [Online]. Available: <https://dl.acm.org/doi/10.1145/3649476.3658735>
- [8] S. Riedel, M. Cavalcante, R. Andri, and L. Benini, “Mempool: A scalable manycore architecture with a low-latency shared l1 memory,” *IEEE Transactions on Computers*, vol. 72, 2023.
- [9] T. Kim, J. Lim, J. Kim, W.-C. Cho, E.-Y. Chung, and H.-J. Lee, “Scalable bandwidth shaping scheme via adaptively managed parallel heaps in manycore-based network processors,” *ACM Trans. Des. Autom. Electron. Syst.*, vol. 22, no. 4, 2017. [Online]. Available: <https://doi.org/10.1145/3065926>
- [10] Y. Hu and T. Li, “Enabling efficient network service function chain deployment on heterogeneous server platform,” in *2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)*, 2018, pp. 27–39.
- [11] J. Wang, Z. Wang, and Y. Hu, “Towards an efficient simd virtual radio access network (vran) and edge cloud system,” *IEEE Transactions on Cloud Computing*, vol. 11, no. 3, pp. 3226–3238, 2023.
- [12] T. Fischer, M. Rogenmoser, M. Cavalcante, F. K. Gürkaynak, and L. Benini, “Floonoc: A multi-tb/s wide noc for heterogeneous axi4 traffic,” *IEEE Design and Test*, vol. 40, no. 6, pp. 7–17, 2023.
- [13] M. Cavalcante, D. Wüthrich, M. Perotti, S. Riedel, and L. Benini, “Spatz: A compact vector processing unit for high-performance and energy-efficient shared-l1 clusters,” in *IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD*. Institute of Electrical and Electronics Engineers Inc., 10 2022.
- [14] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” *Commun. ACM*, vol. 52, no. 4, p. 65–76, apr 2009. [Online]. Available: <https://doi.org/10.1145/1498765.1498785>