

# Instruction Scheduling in the Saturn Vector Unit

Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic

Department of Electrical Engineering and Computer Science, The University of California, Berkeley

{jzh|dgrubb|miles.rusch|tianruiwei|kevinand|bora|krste}@berkeley.edu

**Abstract**—While the challenges and solutions for efficient execution of scalable vector ISAs on long-vector-length microarchitectures have been well established, not all of these solutions are suitable for short-vector-length implementations. This work proposes a novel microarchitecture for instruction sequencing in vector units with short architectural vector lengths. The proposed microarchitecture supports fine-granularity chaining, multi-issue out-of-order execution, zero dead-time, and run-ahead memory accesses with low area or complexity costs. We present the Saturn Vector Unit, a RTL implementation of a RVV vector unit. With our instruction scheduling mechanism, Saturn exhibits comparable or superior power, performance, and area characteristics compared to state-of-the-art long-vector and short-vector implementations.

## I. INTRODUCTION

Scalable vector architectures have proven to be a robust instruction-set paradigm for executing modern data-parallel applications across a wide range of microarchitecture design points. The heritage of these vector architectures points back to the original Cray-style [29] vector machines, which featured very long multi-kilobit architectural vector lengths with many-lane microarchitectures. These “long-vector” machines continue to see widespread deployment in scientific computing and machine-learning scenarios, where application vector lengths are suitably long.

However, long-vector units are seldom deployed where short application vector lengths are more prevalent. Mobile SoCs require performant execution of audio, video, and DSP workloads that operate on diverse application vector lengths [18], [28], while many embedded applications rely on small matrix operations [17]. When executing these workloads, large, high-capacity register files provide little performance benefit, as much of the hardware vector length cannot be used.

Even for domains where the application vector lengths tend to be suitably long, power and area constraints further restrict the feasibility of long-vector microarchitectures with many-kilobyte SRAM-backed vector register files. For these reasons, commercial high-performance mobile and edge SoCs generally integrate compact packed-SIMD (P-SIMD) processors instead of long-vector machines. These P-SIMD extensions are typically less rich than modern scalable vector ISAs, but can be integrated efficiently due to their simplicity and compactness. Aggressive instructions scheduling microarchitectures along with precisely tuned kernels are necessary for achieving high utilization in such systems.

We propose a short-vector microarchitecture to address the requirements for compact, efficient data-parallel processors executing modern scalable vector ISAs. Our vector microar-

chitecture supports all the architectural requirements of scalable vector ISAs while maintaining high utilization of SIMD functional units and memory bandwidth. Unlike prior vector microarchitectures, Saturn does not rely on costly capabilities like high-throughput instruction fetch, general out-of-order issue, register renaming, or long architectural vector lengths.

Saturn’s vector instruction sequencing microarchitecture supports fine-granularity vector chaining, instruction queuing, and limited multi-issue out-of-order execution with zero dead time. The sequencing mechanism requires minimal hardware resources to implement and fits neatly within a compact decoupled vector execution backend.

We develop and open-source Saturn, a RTL implementation of our microarchitecture implementing the full RISC-V vector extension 1.0 (RVV), a representative example of a modern scalable vector ISA. Saturn supports the complete RVV 1.0 ISA, including complex addressing modes, memory translation, and precise traps.

Using Saturn, we perform a simulation and VLSI-driven evaluation of the power, performance, and area characteristics of our proposed microarchitecture. We compare against baseline long-vector and short-vector microarchitectures. We further evaluate a range of design space parameters exposed by our microarchitecture, specifically around sensitivity to issue queue depth, chime length, memory latency, and application vector length.

We summarize our contributions below:

- Saturn, a complete RVV 1.0-compliant short-vector microarchitecture with a physical performance, power, and area evaluation.
- A vector instruction sequencing mechanism that maximizes datapath utilization with short vector lengths.
- A comparison of Saturn against baseline long-vector and short-vector designs.
- A discussion on the implications of critical design parameters in Saturn’s instruction scheduling microarchitecture

Our evaluation demonstrates that combining **short hardware-vector-lengths**, **compact SIMD datapaths**, and **dynamic vector instruction scheduling** with a modern **scalable vector ISA** yields a performant, efficient, and programmable microarchitecture for extracting vectorized DLP.

## II. BACKGROUND

We discuss specific challenges of implementing modern scalable vector ISAs, compare the long-vector and short-vector archetypes, and assert the advantages of short-vector units.

### A. Scalable Vector Architectures

The growing complexity of modern data-parallel workloads has reinforced the importance of binary compatibility and performance portability across varied design points. This has prompted a movement away from traditional P-SIMD ISAs towards modern *scalable vector architectures* [9]. Chief among these are the ARM Scalable Vector Extensions (SVE) [31], M-Profile Vector Extensions (MVE) [5] and RISC-V Vector Extensions (RVV) [6].

While the machine vector length (MVL) is a static architectural constant in SIMD ISAs, in scalable vector ISAs the MVL is an implementation-defined parameter made discoverable to software. Applications written for a scalable vector ISA can be expressed as MVL-agnostic stripmine loops. This ensures binary compatibility with some expectation of performance portability across a wide range of potential implementations.

We identify three specific properties of modern scalable vector ISAs that substantially differentiate the backend microarchitecture of a vector implementation from that of a P-SIMD implementation.

1) *Variable Chime Lengths*: The *chime length* in a data-parallel microarchitecture can be defined as the ratio of the architectural register width (VLEN) to the datapath width (DLEN) [11]. One design challenge of modern vector ISAs is the imposition of variable dynamic chime lengths during program execution. This arises from the observation that the number of architectural registers needed by a vector kernel to avoid register spilling is highly varied. Simple, yet common kernels, like `memcpy` or `saxpy`, may only need a handful of registers, while complex or unrolled kernels may need tens of registers.

To maximize register file utilization and to reduce instruction fetch pressure across all these kernels, modern scalable vector ISAs have adopted the concept of *register grouping*. With register grouping, a single instruction can be specified to encode operations over a group of vector registers, effectively treating the group as a single longer vector register. RVV facilitates register grouping with a length multiplier (LMUL) control field, while ARM SVE and ARM MVE directly provide *multi-vector* instruction encodings.

Since software will frequently leverage register grouping to reduce dynamic instruction counts, implementations should not expose a performance overhead when executing register-grouped instructions, compared to an equivalent block of single-vector instructions. This effectively requires that implementations remain efficient at varying chime lengths, since register grouping directly scales the chime length.

2) *Irregular Vector Memory Instructions*: To enable efficient execution of kernels where vector operand data is interleaved in memory, modern vector ISAs feature load and store instructions that directly de-interleave memory into vector registers. RVV provides “segmented” load and store instructions, while SVE provides “structure” loads and stores.

Since execution of these instructions should still fully utilize the available memory bandwidth, vector microarchitectures must perform a streaming transpose of memory data into

vector registers. Although complex to implement, these “transpose” or “segmentation” operations should also not stall the pipeline or disallow chaining.

3) *Decoupled Implementation Support*: Packed-SIMD implementations have historically been designed as extensions within the existing scalar pipeline of general-purpose cores to reuse existing scalar functionality, while modern vector ISAs are designed to more easily support decoupled implementations. A rich set of vector instructions, especially the mask instructions, reduce the need for introducing scalar-vector data or control dependencies into vector loops.

RVV avoids overlaying the vector registers on scalar registers to simplify decoupling. SVE preserves the overlay, but the Streaming SVE [24] variant, which explicitly targets decoupled implementations, uses a different scalar register space than the conventional scalar floating-point register space.

Since modern vector ISAs require precise resumable traps and scalar-vector memory ordering, a modern vector microarchitecture exhibits characteristics of both tightly-integrated SIMD implementations and classic decoupled long-vector implementations. Specifically, precise exceptions must be reported ahead of commit, deep in the scalar pipeline. However, the execution backend could still be implemented as a decoupled post-commit unit without degrading performance.

### B. Long-Vector Machines

We characterize *long-vector machines* as those implementations bearing close resemblance to the original Cray vector units. Modern long-vector machines feature banked, lane-distributed vector register files, typically implemented as dense SRAM macros.

The dense register file in these designs enables very long machine vector lengths, spanning into the multi-kilobit range. The long vector lengths imply long chime lengths and deep temporal execution, which simplify hazard resolution and reduce the dead-time penalty. As long as the instruction sequencing mechanism determines an efficient execution plan that maximizes utilization and avoids data and structural hazards, low instruction throughput or dead-time stalls are amortized away by the long chime length.

Examples of modern decoupled long-vector implementations include Ara [25], Hwacha [20], Vitruvius [21], and the NEC-Aurora [4]. All these implementations feature lane-distributed SRAM-backed vector register files, and represent the state of the art for compute-maximizing long-vector microarchitectures.

To summarize, long vector microarchitectures rely on **low-instruction-throughput sequencing of highly-efficient execution plans with long vector chimes** to maximize datapath utilization.

### C. Short Vector Machines

In contrast to long-vector machines, short-vector machines feature a unified multi-ported register file driving a unified cluster of SIMD functional units. At a high level, these

machines are more similar in datapath structure and register-file organization to traditional packed-SIMD machines, but must still handle the complexities of executing a scalable vector ISA.

Shorter vector lengths also dictate the need to execute shorter chimes with high efficiency, precise chaining, and zero dead time. Notably, the cost of handling short chime lengths in vector implementations increases as the chime length approaches 1, since the required instruction throughput to saturate a vector functional unit is inversely proportional to chime length. Thus, short-vector implementations, with shorter chime lengths, must sequence instructions with higher throughput than an implementation with longer chime lengths.

Additionally, portable vector workloads likely contain vector code that is suboptimally scheduled for the target vector implementation. To ensure efficient execution in spite of suboptimal code, short-vector implementations should support some degree of flexible out-of-order instruction scheduling, with chaining and simultaneous issues across the load, store, and execute paths. These combined requirements pose unique challenges for the sequencing microarchitecture of a short-vector implementation.

To summarize, unlike P-SIMD or long-vector microarchitectures, a short-vector design needs to maximize datapath utilization through **efficient, precise sequencing of short vector chimes without relying on high instruction throughput or complex instruction scheduling**.

#### D. Why Short Vectors

We identify three key advantages of short-vector microarchitectures over long-vector microarchitectures.

**Many application domains feature short application vector lengths and/or narrow datatypes**, too short to fully utilize a vector register in a long-vector microarchitecture [28]. For instance, Khadem et. al, [18] found that the performance of many mobile workloads exhibit sublinear, or even inverse scaling with SIMD widths above 256 bits. Frison et. al, [17] found that embedded numerical optimization fundamentally relies on small matrix sizes.

Such small problem sizes would map inefficiently to a long-vector microarchitecture with a long native hardware vector length, as much of the vector register storage would be unused. Critically for performance, long-vector microarchitectures designed around temporal execution over long vector lengths may be less efficient with short vector lengths. Conversely, short-vector microarchitectures seek to achieve high datapath utilization even with short application vector lengths using their precise sequencing mechanisms.

**Even with long application vector lengths, short vectors microarchitectures still present some inherent advantages over long-vector units.** A lower-capacity VRF has positive implications for area and power. Additionally, short hardware vector lengths do not imply low performance with long application vector lengths; a short-vector machine can be as performant as a long-vector machine, but with superior power and area characteristics. Notably, the register grouping feature

in modern scalable vector ISAs directly provides a mechanism through which a short-vector microarchitecture can execute longer vector lengths with minimal area or power overhead.

**Short-vector microarchitectures share physical characteristics with existing P-SIMD datapaths commonly deployed in commercial SoCs**, reducing the barrier to deployment for such designs. While long-vector microarchitectures are not deployed in any commercial mobile or desktop SoCs, compact P-SIMD cores are pervasive, and have not been displaced in mobile or edge domains by specialized accelerators. For instance, the HVX P-SIMD extension and datapath in the Hexagon DSP processor is critical to compute-intensive tasks on all Snapdragon mobile SoCs [14], [15]. Tens of billions of ARM cores with NEON, SVE, or MVE extensions ship annually to meet this market [27].

Short-vector microarchitectures have the potential to provide the same necessary capability for efficient data-parallel compute in a compact and efficient physical footprint as these P-SIMD systems, but while executing a more future-proof and programmer-friendly scalable vector ISA.

### III. SHORT-VECTOR MICROARCHITECTURE



Fig. 1. Overview of the Saturn short-vector microarchitecture (gray) and its intended integration into an in-order core. The gradated regions indicates components relevant to the scheduling mechanism.



Fig. 2. Pipeline stages of Saturn when integrated into a host in-order RISC-V core. Hatched lines indicate queues between stages.

Figure 1 depicts how all three components of Saturn integrate into a minimal host in-order CPU. In this section we describe how these components can be microarchitected to support a modern vector ISA with short vector lengths. We provide only a cursory overview of these components here. A more detailed description is available at [saturn-vectors.org](http://saturn-vectors.org).

- The **Frontend** sits ahead of the commit point of the host CPU and implements checks for precise vector faults in a shallow pipeline.

- The **Load/Store Unit** receives post-commit vector instructions and issues requests into the memory system.
- The **Backend Datapath** includes the vector register file, instruction sequencers, and functional units

### A. Frontend

The frontend precisely checks for the presence of faults in vector memory instructions and enables decoupling of the post-commit load-store and execute datapaths. Unlike P-SIMD microarchitectures, this vector frontend does not reuse the scalar load/store unit. The frontend merely checks for potential faults, while the actual address-generation and memory-issue occurs in the separate post-commit load-store unit.

The frontend operates in two modes: *pipelined* or *iterative*.

1) *Pipelined*: The pipelined mode executes in lock-step with the scalar core's pipeline. This mode computes a conservative bound on the extent of the vector memory access from the base address provided by a scalar register, and the encoded vector register length and addressing mode. For unit-strided or constant-strided accesses, the bound computation is trivial and can be performed in the same stage as scalar address-generation.

Single-page contiguous-accesses to a homogeneous physical memory region can be verified to be free of fault with a single TLB access. Such instructions can be safely committed and issued to the backend with the physical address without stalling any instructions. Multi-page contiguous-accesses are cracked into single-page operations, while indexed or strided accesses are deferred to the iterative mode. In summary, the pipelined mode is designed to catch the common-case contiguous unit-stride accesses that must be issued to the load-store unit with high throughput.

2) *Iterative*: This mode prevents the scalar core from retiring instructions younger than the currently processed vector instruction, giving cycles for an iterative state machine to step through the instruction element-wise, fetching indices from the backend, and precisely computing the accessed address for each element. While this procedure imposes a very high performance overhead, it is used only to support high-extent indexed memory operations which would fundamentally be performance-limited by a single-ported TLB, TLB reach, and page-table walks.

### B. Load-store unit

The vector load-store unit is microarchitecturally similar to per-lane load-store paths in long-vector implementations and load-store paths in decoupled SIMD microarchitectures. In Saturn, this unit is implemented like the "access processor" in the decoupled-access-execute (DAE) paradigm [30] instead of an integrated memory pipeline.

Following the DAE paradigm allows exploiting memory-level parallelism by adjusting the size of low-cost decoupling queues without further costly implications on the backend datapath and control components. Figure 3 depicts the independent load and store paths within the load-store unit. We



Fig. 3. The vector load and store paths handle variable-chime and long-latency memory operations with minimal storage requirements. The load path depicts a long-latency-load in the Agen unit running ahead of a long-chime load in the merge and SegBuf units. The store path depicts a sequence of short-chime stores in the SegBuf, Merge, and Agen units.

provide only a broad overview of the rest of the load-store unit components here.

Load and store in-flight queues track the base and extent of in-flight or pending vector memory operations. Vector instructions are tracked in the in-flight queues without requiring cracking, minimizing the structural cost of long-chime instructions. Address CAMs across each queue are used to enforce scalar-vector and vector-vector memory disambiguation.

Separate load and store paths enable simultaneous issue of loads and stores. Both paths can be designed as pipelined streams of latency-insensitive units. Each path processes memory requests and packets in-order, and are driven by pointers into the circular in-flight-load and in-flight-store queues.

Merge units correct for misaligned memory accesses and convert between sparse memory packets and dense vector register rows. Segment buffers provide streaming, high-throughput reformatting of segmented fields into vector registers.

The load path is decoupled from the rest of the backend, enabling run-ahead load address generation to exploit memory-level-parallelism across high memory latency memory systems. Similarly, the store path runs behind the backend.

### C. Backend Datapath

The backend datapath comprises the issue queues, instruction sequencers, unified vector register file (VRF), and SIMD execution units. The issue queues buffer vector instructions and feed the sequencers, which crack vector instructions into single-cycle micro-ops of execution. Upon issue, vector micro-ops read their operands from the VRF before proceeding through the SIMD execution units.

The backend is organized around "element-groups" as the base unit of compute, where an element-group is a DLEN-wide segment of a vector register. The width of each VRF bank, the width of the register access crossbars, and the width of the SIMD functional units are all DLEN. Thus, an element-



Fig. 4. The backend organization for a configuration with two arithmetic sequencers, separate load/store sequencers, 4-entry instruction queues, and a 4x3R1W vector register file. Gradated regions indicate the out-of-order execution window.

group represents a packet of register file data that can be consumed or produced in one cycle.

The VRF is implemented as a banked multi-ported flip-flop array. As shown in Figure 4, read and write crossbars arbitrate for access to the VRF ports across the banks. The vector register file is striped across the banks by element group; neighboring element groups reside in consecutive banks.

The sequencers each execute independently, enabling dynamic “slip” across the load, store, and arithmetic paths. This yields limited out-of-order execution of vector instructions. Combined with the decoupling queues and fine-grained hazard tracking, this sequencer structure can tolerate many suboptimal code patterns through dynamic sequencing.

The entire backend pipeline can be implemented with very few pipeline stages, improving its suitability for compact implementations. Aggressive implementations can bypass directly from the dispatch queue into the sequencers. Our hazard checking also requires minimal logic depth and can be implemented in the same stage as register-read. Thus, an instruction can proceed from the dispatch queue to vector-register-write-back in as few as three cycles.

#### IV. INSTRUCTION SCHEDULING IN SHORT-VECTOR UNITS

Saturn’s instructions scheduling mechanism performs **vector sequencing** with **explicit chaining**. We first motivate these design decisions before discussing the details of the scheduling microarchitecture.

##### A. Cracked vs Sequenced Scheduling



Fig. 5. A comparison of instruction cracking vs sequencing for a block of vector instructions A/B/C/D. The cracking approach will stall dispatch without deep issue queues.

One approach to schedule vector instructions is to crack them into single-cycle micro-ops early in the pipeline, ahead of where hazard resolution occurs at issue select. This approach is especially tempting as a means to handle register-grouped instructions by cracking them into single-vector-register instructions. Although this approach reduces hazard resolution complexity, it tends to stall dispatch without deep issue queues, as shown in Figure 5.

Vector sequencing cracks vector instructions into micro-ops behind the issue queues. Since issue queue entries may encode many cycles of work accessing many vector elements or registers, managing hazards across the issue queues is more costly. However, late sequencing reduces pressure on issue queue depth.

##### B. Explicit vs Implicit Chaining

With implicit chaining microarchitectures, the microarchitecture relies on regular vector access patterns to effect chaining. This approach relies on the observation that if the source and sink instructions both write and read their operands at the same fixed rate, the sink instruction can *always* schedule the next micro-op after a source instruction writes back *any* element.

While this approach is attractive for its low cost and ease of implementation, it struggles to handle irregular vector instructions that produce or consume operands at irregular rates. It additionally cannot elegantly handle variable-latency memory systems without global stalls. In contrast, explicitly chained vector units precisely track the availability of operands at sub-register granularity. Explicit chaining microarchitectures opportunistically perform chaining in response to dynamic behaviors.

##### C. Vector Instruction Scheduling

The instruction window for out-of-order execution in Saturn includes the per-sequencer issue queues and the instructions within the sequencers themselves. Thus, the scheduling mechanism must resolve potential data hazards across all the issue queues, sequencing instructions, and sequenced micro-ops. We define this set of instructions and micro-ops as the *OoO window*, as shown in Figure 4.

To resolve structural hazards on VRF ports and the execution units, the microarchitecture implements read and write port arbitration across all the sequencers. The sequencers advertise the requested read and write addresses for the next micro-op to the arbiters. Bank or port conflicts induce a structural hazard and stall the sequencing of the younger instruction. For data hazards, the sequencing mechanism should track data dependencies at element group granularity and support chaining across RAW, WAW, and WAR conditions.

Resolving data hazards using traditional CDC6600-like scoreboard [32] is a poor fit for an explicit chaining system where in-order register read and write-back are not enforced. We propose an augmented scoreboard scheme which tracks pending read/write scoreboards (PRSB/PWSB) for each instruction in the issue window at element-group,

rather than register granularity. We divide our discussion into a section on the microarchitectural state required for our algorithm, followed by a discussion of the algorithm itself.

1) *Tracking Data Hazards:* Our approach tracks the PRSb and PWSb scoreboards for all instructions within the OoO instruction window. The bit-width of each is the total number of the element groups in the VRF, or,  $(VLEN \times \# \text{ registers})/\text{DLEN}$ .

In the general case, this would be an infeasibly costly state to represent across all the necessary components in the OoO window. However, for short-vector microarchitectures with shallow pipelines and issue-queues, we demonstrate that the structural cost to represent the scoreboards is minimal, or already provided as part of existing state in the backend. The only additional state is a unique tag to enable age disambiguation between instructions in the OoO window. This tag is assigned when an instruction enters the OoO window and freed when the instruction completes sequencing.

TABLE I

PER-INSTRUCTION SCOREBOARD TABLE, FOR A MACHINE WITH 4 VECTOR REGISTERS AND VLEN=2, DLEN=2

| Instruction       | ID/Age | PRSb                | PWSb                |
|-------------------|--------|---------------------|---------------------|
| vadd.2 v0, v0, v2 | 0      | 8'b00000000         | 8'b00001 <u>100</u> |
| vle.2 v2          | 1      | 8'b00000000         | 8'b11 <u>100000</u> |
| vadd.2 v0, v0, v2 | 2      | 8'b11 <u>111111</u> | 8'b00001111         |
| vle.2 v2          | 3      | 8'b00000000         | 8'b1111 <u>0000</u> |
| vadd.2 v0, v0, v2 | 4      | 8'b11111111         | 8'b00001111         |



Fig. 6. A diagram depicting how Saturn implements the PRSb and PWSb scoreboards for instructions in Table I. At the next clock edge, the load sequencer will sequence a load into register v1[0], while the arithmetic sequencer will sequence reads from v0[0] and v2[0] to effect a pipelined write to v0[0]. Gray sequencer scoreboard elements will be cleared on the next clock edge.

Table I depicts the PRSb and PWSb scoreboards for a simple register-grouped loop with RAW, WAW, and WAR hazards, where all these instructions are currently in the OoO window. In this example, the machine has four architectural registers, with each register comprised of two element groups. The code sequence in this example leverages register grouping on pairs of vector registers. Instructions 0, 1, and 2 are currently in-flight, while instructions 3 and 4 are resident in the issue queues. On the next clock edge, instructions 0 and 1 will perform writes, while instruction 2 will perform a read, clearing the underlined bits in Table I.

Within the issue queues, the PRSb and PWSb need only to be known coarsely since no reads and writes have been performed at this stage. Thus, they can be derived from the existing operand specifier and register-grouping control signals, which are already stored in the issue queues. Figure 6 depicts how instructions 3 and 4 in the example from Table I are resident in the issue queues and require no additional state beyond operand specifiers and an age tag.

Within the SIMD functional units, only the PWSb needs to be known, as register reads occur immediately after sequencing. Since each micro-op in each pipeline stage in the SIMD functional units already needs to track a single destination register element group, the PWSb is trivially derived. Figure 6 depicts how instruction 0 is nearing completion in the arithmetic pipeline and requires no additional state to track the pending write hazard.



Fig. 7. Diagram of a vector instruction sequencer. The sequencer tracks precise PRSb and PWSb bit-vectors, updating them when it determines the next micro-op is free-of-hazards.

The only component in the microarchitecture which maintains precise element-group-granularity PRSb/PWSbs are the sequencers. These scoreboards are aggressively cleared as micro-ops are sequenced. Since the number of sequencers in Saturn is small (only 3 for load/store/arithmetic), the overhead of this state is tolerable. Figure 6 depicts how the load and arithmetic sequencers track per-operand scoreboards for the currently-sequenced instructions 1 and 2. Figure 7 depicts how the microarchitecture of the sequencer updates its internal PRSb and PWSb bit-vectors.

2) *Sequencing Algorithm:* At the sequencing stage, each sequencer must validate that the register reads and writes effected by the next micro-op are free of hazards against any older instructions or micro-ops within the OoO instruction window. To do this, the microarchitecture must perform age disambiguation across the instructions within the OoO window to determine hazards from older instructions. The PRSb/PWSbs from the older instructions can be bitwise OR'd together to form per-sequencer PRSb and PWSbs.

RAW, WAW, and WAR hazards can all be checked precisely using the element-group specifiers of the reads and write of the next micro-op the sequencer will issue. The presence of a hazard stalls sequencing. While the broadcasts of the PRSb and PWSbs may seem to be prohibitively expensive, several

key properties enable optimizations that reduce the fan-in into each sequencer.

- **Issue-queues are in-order**, implying that the sequenced instruction will always be older than any other instructions from its parent issue queue. Only adjacent issue queues, sequencers, and pipelined functional units need to be checked for pending reads and writes.
- **The number of pipelined functional units is low** since the load path is effectively a zero-deep pipeline. Thus, the cost of interlocks on pending writes resident in functional units is limited.
- **The OoO instruction window contains few instructions** across the shallow issue queues. Unlike in scalar microarchitectures, where speculative execution requires a very deep instruction window, OoO execution here is used only to improve load-balancing across datapaths.
- **Sequenced micro-ops will always be the oldest write to any element group**, eliminating the need to compare age tags against micro-ops in-flight in the functional units

Once a micro-op is cleared of hazards, it can be issued to register-read and the functional units. Micro-ops issued by the sequencer are fire-and-forget, proceeding irrevocably through the pipelined functional units.

If the issued micro-op guarantees that the currently sequenced instruction will no longer read or write that element group again, as is the case for regular vector instructions, the sequencer can clear the accessed bit in its PRSb and/or PWSb, as shown in Figure 7. This enables cycle-granularity chaining across instructions with irregular access patterns into the vector register file. Conversely, irregular vector instructions that do not read or write their operands in a static order can avoid clearing the sequencer scoreboards, disabling chaining from these instructions.

## V. IMPLEMENTATION

Saturn is a Chisel RTL implementation of the complete RISC-V vector extension 1.0, supporting all application-profile features, including memory translation, precise traps, and irregular addressing modes. Saturn is deeply parameterized and can target a wide range of short-vector design points, from small element-wise area-minimal vector units to wider, more performant, yet still compact DSP-core-like implementations, by varying the VLEN and DLEN parameters.

Saturn targets integration with a RISC-V dual-issue 6-stage host core, interfacing with the host core's existing TLB, CSR-file, and scalar load-store path. Precise traps, virtual memory support, and the scalar-vector sequential memory ordering are all provided. Additionally, parallel execution of scalar and vector memory operations is supported by providing a path for the scalar load-store path to access the vector load/store CAMs in parallel with the L1 cache access. Memory ordering violations are caught before the commit stage of the host core.

The vector memory system bypasses the scalar L1 data cache, instead accessing a shared last-level-cache. This enables efficient execution of kernels where scalar loads stream through elements resident in the L1, while vector loads and

stores use a dedicated high-bandwidth path to shared system memory.

## VI. EVALUATION

We evaluate the performance, power, and area of Saturn in comparison to existing long-vector and short-vector implementations. For the area and frequency evaluations we use Cadence VLSI tools with a commercial 16nm process targeting 800 MHz at the the SS corner.

### A. Performance

We compare the utilization of Saturn across a range of benchmarks. We report utilization as the main performance metric since all compared implementations share the same datapath throughput and memory bandwidth (both DLEN). Utilization provides a portable and interpretable metric for understanding to what degree the various designs can saturate the fundamental structural resources of each machine.

Sufficiently large application sizes are chosen such that the long-vector microarchitectures are not penalized while keeping the working sets resident in the LLC and TLB. The routines were implemented using C intrinsics or vector assembly, with the LMUL register grouping factor selected to maximize vector length while avoiding vector register spilling. Table II depicts the problem sizes, datatypes, and LMUL for each benchmark.

For all comparison points, we use the same host dual-issue 6-stage core and cache-based memory systems. This core can overlap RVV's `vsetvl` instructions alongside vector instructions, supporting 1 IPC issue into the vector unit for ideally scheduled code. The outer memory system is configured with 4 banks of last-level cache, providing bandwidth of 256 bits/cycle and a total capacity of 512 KB, sufficient to hold the working set for all workloads after warm-up. Access time to the cache is 4 cycles, but realistically degrades under load.

We evaluate a VLEN = 512, DLEN = 256 configuration of Saturn as **SV-Full**. All comparison points described below, whether short-vector or long-vector, are configured with matching DLEN = 256.

TABLE II  
WORKLOAD CONFIGURATIONS

|               | Benchmark  | Problem Size  | Datatype | LMUL |
|---------------|------------|---------------|----------|------|
| high-reuse    | conv3d     | 112x112x7x7x3 | F64      | 2    |
|               | conv2d     | 112x112x7x7   | F64      | 2    |
|               | jacobi2d   | 130x130       | F64      | 4    |
|               | sepconv    | 119x119x3x3   | F32      | 4    |
|               | gemm       | 87x87         | F32      | 4    |
| no-reuse      | cos        | 1024          | F32      | 4    |
|               | exp        | 1024          | F32      | 4    |
|               | axpy       | 30720         | F64      | 8    |
|               | gemv       | 128x128       | F32      | 8    |
| non elem.wise | pathfinder | 64x1024       | I32      | 8    |
|               | spmv       | 128x128 60%   | F32      | 8    |
|               | fft2       | 1024          | F32      | 4    |
|               | transpose  | 180x180       | F32      | 1    |



Fig. 8. Utilization across a variety of kernels. Comparison points include short-vector (SV) designs evaluated using Saturn, long-vector (LV) designs evaluated using Saturn, and Ara evaluated using its original RTL.

As a baseline, we configure a variant of Saturn without support for load-store decoupling and out-of-order issue. This **SV-Base** variant is comparable to the Spatz [12] microarchitecture, since Spatz serializes execution through its global controller. Among short-vector implementations, Spatz is the only comparable baseline, as it supports variable memory latency and does not require register renaming. We could not compare directly against Spatz’s implementation due to its partial support for RVV. We also compare to **SV-Base+DAE** and **SV-Base+OOO** variants, enabling the decoupled load-store and out-of-order issue features of Saturn, respectively.

For long-vector microarchitectures, we simulate **Ara**’s RTL directly [25] as a baseline implementation. We use a 4-lane configuration of Ara to match the DLEN of the SV design and reproduce the integration approach in the Ara+CVA6 system with our dual-issue host core.

We also attempt a comparison with Hwacha’s [20] scheduling mechanism by modelling Hwacha’s fundamental behavior with modifications to Saturn’s RTL. Hwacha uses a central 8-entry master sequencer where complex instructions occupy multiple entries. We evaluate this behavior for both short and long vector lengths, where **SV-Hwacha** uses VLEN = 512 and **LV-Hwacha** uses VLEN = 4096.

We additionally model a “full-fury” no-restraints long-vector microarchitecture by setting VLEN = 4096 without disabling any other features of Saturn. This **LV-Full** configuration represents a hypothetical unrealistic long-vector design point.

Figure 8 shows the utilizations across all the evaluation points. The **SV-Base** configuration, lacking the ability to exploit memory-level parallelism or to dynamically load-balance across issue paths, suffers in all evaluated workloads. While a decoupled load-store unit does improve performance in some simpler loops, like `axpy` or `gemv`, strict in-order sequencing results in frequent stalls in poorly load-balanced code. Likewise, relaxing in-order sequencing alone is also insufficient, as a tightly coupled load-store path induces frequent RAW stalls anyways. The **SV-Hwacha** configuration also underperforms, especially in kernels with complex instructions that would occupy multiple Hwacha sequencer slots. Notably, **SV-Hwacha** suffers in the convolution kernels, where poorly scheduled loops greatly benefit when the microarchitecture can

schedule across many inflight instructions. The **SV-Full** results demonstrate that combining a DAE architecture with highly dynamic instruction scheduling and many inflight instructions is necessary for achieving near-peak > 90% utilization across a wide range of kernels.

Compared to long-vector baselines, the **SV-Full** design exceeds the performance of the Ara implementation. Even though the **SV-Hwacha** configuration underperformed, the **LV-Hwacha** design can still achieve 99% utilization in some kernels, demonstrating that inefficient instruction scheduling can be offset by longer vector lengths. Still, the **LV-Hwacha** design underperforms compared to **SV-Full** in `fft`, `spmv`, and `transpose`, demonstrating that long-vector-lengths cannot always compensate for scheduling inefficiencies. Unsurprisingly, **LV-Full** achieves the highest utilization in almost all benchmarks. Section VII-A presents further analysis on the implication of longer vector lengths in Saturn.

We compare Saturn’s physical implementation results to reported numbers for prior vector units. As seen in Table III, Saturn can achieve comparable frequencies to existing academic vector units. Critical paths are in the SIMD datapath, which is not unique to Saturn’s instruction scheduling. Given that performance is equivalent to (utilization × frequency) when the datapath throughputs are equivalent (DLEN is the same), we conclude that the **SV-Full** Saturn implementation achieves state-of-the-art performance among similarly configured academic vector units.

TABLE III  
PHYSICAL COMPARISON OF OUR SHORT VECTORS MICROARCHITECTURE

|                         | Ours | Spatz <sub>2</sub> | Ara  | Hwacha | Vitruvius+ |
|-------------------------|------|--------------------|------|--------|------------|
| VLEN (bits)             | 512  | 256                | 4096 | 4096   | 16384      |
| DLEN (bits)             | 256  | 64                 | 256  | 256    | 512        |
| Process Node            | 16nm | 22nm               | 22nm | 16nm   | 22nm       |
| SS Freq. (MHz)          | 800  | 485                | 950  | 800    | 1200       |
| Area (kGE)              | 1235 | 170 <sup>1</sup>   | 2747 | 1118   | 6532       |
| Area (mm <sup>2</sup> ) | 0.41 | 0.14               | 0.55 | 0.38   | 1.3        |
| SGFLOPS                 | 12.5 | -                  | 15.2 | 12.5   | 18.6       |
| SGFLOPS/W (TT)          | 121  | -                  | 90   | 57     | 47         |

<sup>1</sup>Spatz<sub>2</sub> does not support floating-point or 64-bit operations.

## B. Area

Table III provides a comparison of Saturn’s area compared to numbers reported in publications of prior academic vector units. These numbers capture only the area of the vector unit, ignoring any scalar core or memory-system components. For prior work, kGE and mm<sup>2</sup> numbers were extrapolated from published results and area breakdowns. kGE provides a process-normalized approximation of area across process nodes.

The results show that Saturn’s short-vector microarchitecture can yield a more area-efficient design than a few-lane implementation of a long-vector microarchitecture, suggesting better suitability for area-constrained deployment scenarios.



Fig. 9. Synthesized area breakdown across three configurations of Saturn alongside the breakdown of the host scalar core.



Fig. 10. Layout of the SV-Full V512D256 implementation of Saturn, at 60% density.

We additionally break down the area of the various components of Saturn across three machine configurations with varying VLEN and DLEN. Figure 9 depicts this breakdown alongside the area of the host dual-issue core, while figure 10 depicts a layout of one of the configurations.

As expected, the significant component of Saturn which scales linearly with architectural vector lengths is the vector register file. The load-store unit and SIMD functional units, as the main datapath components, scale linearly with datapath width (DLEN). The key components of Saturn’s sequencing microarchitecture, the sequencers and issue queues, have consistently low area costs across all evaluated configurations. The moderate scaling observed in the sequencers with higher DLEN reflects the area overhead of the accumulation registers, which are within the sequencer’s module hierarchy. The area of the frontend module for providing precise traps is marginal, demonstrating that supporting precise-traps in a decoupled vector unit has low cost.

The area results and layout show that Saturn enables compact area-efficient implementations built around a wide SIMD datapath. The short-vector-specific components that enable efficient dynamic instruction scheduling are not costly compared to the necessary register file and SIMD functional units.

## C. Power

We estimate the average power consumption of Saturn running a small 64x64 SGEMM kernel in Figure 11. Power estimates are gathered post-synthesis at the TT corner and a clock frequency of 800MHz, annotated with waveform switching activities and post-PnR routing parasitics.



Fig. 11. Power consumption across three configurations of Saturn running a 64x64 SGEMM.

The power consumption of the VRF and vector FP units scales linearly with datapath width while remaining roughly constant with greater VLEN. As noted in Table III, the **SV-Full** design with VLEN = 512, DLEN = 256 design point achieves an efficiency of 112 SGFLOPS/W. As noted in Table III, the **SV-Full** design with VLEN = 512, DLEN = 128 design point achieves an efficiency of 121 SGFLOPS/W. The power breakdown shows that the scalar core and caches consume a significant portion of each design’s power when executing short-vector-length code. Further optimizations to the scalar core, chiefly around reducing spurious ICache accesses within a low-IPC vector loop, could improve Saturn’s power efficiency even further.

## VII. DISCUSSION

In this section, we explore several of the key parameters of Saturn. We discuss and evaluate Saturn’s sensitivity to native chime length, issue queue depth, memory latency, and application vector length.

### A. Chime Length

The “native” chime length in Saturn, for instructions with no register grouping, is given by the VLEN:DLEN ratio. Increasing the native chime length reduces overall throughput pressure on many components in Saturn, including instruction fetch, load-store queue sizes, issue-queue sizes, and sequencer throughput.

However, a higher native chime length for a constant datapath linearly scales the size of the multi-ported register file. The interlock logic in Saturn scales linearly with the chime length as well. We evaluate several configurations with varying VLEN:DLEN ratios, varying VLEN with fixed DLEN = 256.

TABLE IV  
PERCENT SPEEDUP WITH INCREASING CHIME LENGTH (VLEN/DLEN)  
AND ISSUE QUEUE DEPTH, WITH DLEN=256

| IQ Depths<br>VLEN/DLEN | Relative % Speedup w.<br>Increasing VLEN/DLEN |          |          | Relative % Speedup w.<br>Increasing IQ Depths |          |          |
|------------------------|-----------------------------------------------|----------|----------|-----------------------------------------------|----------|----------|
|                        | 4<br>1→2                                      | 4<br>2→4 | 4<br>4→8 | 0→1<br>2                                      | 1→2<br>2 | 2→4<br>2 |
| conv3d                 | 57%                                           | 22%      | 1%       | 2%                                            | 3%       | 3%       |
| conv2d                 | 61                                            | 20       | -1       | 3                                             | 5        | 3        |
| jacobi2d               | 82                                            | -7       | -8       | 12                                            | 13       | 6        |
| sepconv                | 23                                            | -1       | -1       | 20                                            | 2        | 0        |
| gemm                   | 3                                             | 0        | 5        | 3                                             | 1        | 1        |
| cos                    | 2%                                            | 1%       | -1%      | 11%                                           | 4%       | 4%       |
| exp                    | 10                                            | 4        | 0        | 2                                             | -1       | 0        |
| axpy                   | 7                                             | 4        | 0        | -1                                            | 0        | 0        |
| gemv                   | 1                                             | 2        | 0        | 1                                             | 0        | 0        |
| pathfinder             | 57%                                           | 9%       | 1%       | 26%                                           | -5%      | 3%       |
| spmv                   | 6                                             | -6       | -3       | 50                                            | 19       | 3        |
| fft2                   | 2                                             | 1        | -2       | 7                                             | 2        | 1        |
| transpose              | 21                                            | 31       | -7       | 1                                             | 1        | 0        |

As we increment the VLEN:DLEN ratio, we record the percent improvement in performance for each kernel. Table IV depicts the results of this comparison.

A native chime length of 1 imposes a very high performance burden, as this implementation effectively requires 1 IPC instruction issue when executing non-register-grouped code. Moving a 2:1 ratio yields significant performance improvements across most kernels, as greater chime lengths diminish the performance impact of scalar stalls. A 2:1 ratio is feasible with the precise scheduling mechanism and is the target design point for Saturn.

The effect is largely diminished at a 4:1 ratio, where only a few low-LMUL kernels see the benefit. We additionally observe that some benchmarks actually exhibit performance *degradation* at high chime lengths. Analysis of these cases revealed that deep temporal execution diminishes the effectiveness of load balancing across issue paths. These cases exposed a performance peculiarity specific to our evaluation system’s memory system: the evaluated LLC achieves higher throughput with a dynamic mix of loads and stores, compared to separate load-intensive and store-intensive phases.

### B. Issue Queue Depth

In design of scalar microarchitectures, deep issue queues are critical for maintaining high IPC, as the issue queues comprise the out-of-order execution window for the execution units. However, Saturn’s vector microarchitecture does not rely on the issue queue depth directly as a mechanism for maintaining functional unit utilization.

Rather, Saturn relies on the issue queues solely as a mechanism to enact load-balancing for suboptimal code sequences. While optimally scheduled code might be perfectly balanced across all the sequencing paths, common vector code could benefit from some degree of per-sequence queuing.

Since Saturn’s scheduling microarchitecture scales with the issue queue depths, shallow issue queues are especially desir-

able. Thus, we evaluate the degree to which vector code sequences benefit from instruction queuing. We compare variants of Saturn with issue queue sizes ranging from 0 to 4, recording the percent improvement in performance as we increment the queue depths.

The results in Table IV show that moving to single-entry issue queues immediately yields performance uplifts across many kernels. Again, the effect diminishes rapidly, becoming insignificant for many kernels towards 4-deep issue queues. Issue queue depths of 2-4 are likely the target design point for Saturn. As shown in Figure 9, the area cost of the issue queues is minimal, as Saturn does not impose substantial additional storage requirements for those queues.

### C. Memory Latency



Fig. 12. Performance degradation with memory latency injection on top of the base latency to the last-level cache.

We evaluate Saturn’s tolerance for memory latency through the DAE design paradigm of the vector load-store unit. Unlike long-vector microarchitectures, a short-vector implementation cannot rely on hiding memory latency across long vector lengths. The DAE microarchitecture enables run-ahead load request generation with minimal impact on the backend microarchitecture. We add a latency-injection buffer between the load-store unit and the memory system and evaluate the performance degradation with increasing memory latency.

Figure 12 shows that across a variety of memory-bound kernels, the DAE microarchitecture can effectively tolerate high memory latencies. Notably, the SPMV kernel suffers more significantly from memory latency since the current frontend cracks page-crossing indexed loads into single-element operations, decreasing the effectiveness of the decoupling queues. Reducing the degree of cracking performed under these cases or avoiding page-crossing indexed loads entirely would ameliorate this issue.

We can analyze the extent of tolerable memory latency by considering the decoupling and issue queue depths. For a DAE microarchitecture, the maximum tolerable memory latency is determined by the depth of the queues between the access processor and the execute processor.

In Saturn, this includes both the explicit decoupling queue and the load issue queue. Furthermore, a single vector instruction might encode multiple cycles of memory access. Thus, the maximum tolerable memory latency is the number of instructions in the decoupling queue and load-issue-queue,

multiplied by the maximum register grouping, multiplied by the native chime length.

In a VLEN = 512 DLEN = 256 machine with a 4-entry dispatch queue and a 4-entry load issue queue, this is 128 cycles of load latency in the optimal case, far more than would be necessary for a standard cached memory system. As the decoupling queue plays no role in the instruction scheduling microarchitecture, its capacity can be increased at will when integrating Saturn into high-latency memory systems.

#### D. Application Vector Length



Fig. 13. Utilization of SGEMM with varying problem sizes.

We sweep problem size for the compute bound floating-point matrix-multiplication algorithm and show the results in figure 13.

With short vector lengths and shallower temporal execution, instruction throughput becomes more critical. In the **SV-Base** configuration, inflexible scheduling reduces the effectiveness of the vector unit at draining instructions. Both the **SV-Base** and **Ara** designs cannot reach their peak utilizations without a application vector length of 48.

In contrast, the **SV-Full** configuration can achieve near its peak utilization with a shorter application vector length of 32 elements. Thus, Saturn’s short-vector approach is more suitable for domains where application vector lengths are frequently short, as are in mobile, DSP, or embedded deployments.

## VIII. RELATED WORK

Existing work on vector microarchitectures have either required long-vector lengths, fixed low-latency memory systems, register renaming, imprecise traps, and/or a simpler vector ISA. Saturn is the first to provide a full no-compromises short-vector implementation of a complete modern vector ISA. Table V summarizes the salient differences of Saturn against the most notable comparable academic and commercial vector implementations.

Recent academic interest in vector microarchitecture has been focused on long-vector machines for HPC applications. These microarchitectures operate with fundamentally different

TABLE V  
COMPARING RELATED VECTOR MICROARCHITECTURES

|              |                          | Classification | VLEN (Bits) | No Renaming | Var. Mem. Lat. | Prec. Traps | RVV 1.0 |
|--------------|--------------------------|----------------|-------------|-------------|----------------|-------------|---------|
| Academic     | <b>This work</b>         | Short          | 128-1024    | ✓           | ✓              | ✓           | Full    |
|              | Torrent [10]             | Long           | 1024        | ✓           | ✗              | ✗           | ✗       |
|              | OOOVA [16]               | Long           | 8192        | ✗           | ✓              | ✓           | ✗       |
|              | Hwacha [20]              | Long           | 4096        | ✓           | ✓              | ✗           | ✗       |
|              | Ara [25]                 | Long           | 4096        | ✓           | ✓              | ✗           | Partial |
|              | Vitruvius+ [21]          | Long           | 16384       | ✗           | ✓              | ✗           | Partial |
|              | AVA [19]                 | Long           | 1024        | ✗           | ✓              | ✗           | Partial |
|              | Spatz [12]               | Short          | 512         | ✓           | ✗              | ✗           | Partial |
| Industrial   | Vicuna [26]              | Short          | 512         | ✓           | ✗              | ✗           | Partial |
|              | RISC-V <sup>2</sup> [23] | Short          | 128         | ✗           | ✓              | ✗           | Partial |
|              | NEC Aurora [4]           | Long           | 16384       | ✗           | ✓              | ✗           | ✗       |
|              | A64FX [22]               | Short          | 512         | ✗           | ✓              | ✓           | ✗       |
| P-series [7] | P-series                 | Short          | 256         | ✗           | ✓              | ✓           | Full    |
|              | X-series [3]             | Short          | 512         | ✓           | ✓              | ✓           | Full    |
|              | NX27V [1]                | Short          | 512         | ✓           | ✓              | ✓           | Full    |
|              |                          |                |             |             |                |             |         |

constraints compared to short-vector units. Deep temporal execution with low IPC and scalability towards many lanes are critical concerns for long-vectors, but do not reflect the requirements for short-vector units.

Among comparable work, Saturn is the first to demonstrate efficient execution of short vector lengths without requiring register renaming or a constrained memory system. RISC-V<sup>2</sup> and Ava require register-renaming, simplifying hazard determination at the expense of register-file size. Saturn does not require register-renaming, but can still chain past all types of data hazards. Spatz, Vicuna, and Torrent all assume a low-latency memory system. Vicuna and Torrent further require a global memory stall to adapt to variable latency, a significant limitation. Saturn supports dynamic instruction scheduling and chaining in the presence of long and variable-latency memory systems.

Many commercial vector implementations, including the Xuantie 910 [13], SiFive P-series [7], Ventana Veyron [8], Semidynamics Atrevido [2], modern ARMv9 cores [24], and Fujitsu A64FX [22], integrate their vector units deeply within the out-of-order core microarchitecture. Compared to these implementations, Saturn does not rely on general-purpose-optimized capabilities of the host core, such as aggressive superscalar instruction fetch, arbitrary out-of-order execution, or register renaming.

The Andes NX27V [1] and SiFive X [3] series processors are commercial RVV implementations with similar short vector lengths and compact, unified datapaths. While the details of these commercial implementations are unknown, we observe several key differences with Saturn. The NX27V seems to implement a unified physical scoreboard, while Saturn’s scheduler distributes the scoreboard across sequencers. The X-series feature a unified scalar and vector load-store unit, while Saturn’s load-store unit is vector-specialized and separate from the scalar access path.

## IX. CONCLUSION

We propose, implement, and evaluate a short-vectors microarchitecture that addresses the requirements for general-purpose data-parallel compute in compact efficient cores. Our implementation can achieve competitive performance compared to aggressive long-vector microarchitectures, while also fulfilling all the necessary architectural requirements in modern scalable vector ISAs.

## X. ACKNOWLEDGMENTS

Research was partially funded by SLICE Lab industrial sponsors and affiliates, and by the NSF CCRI ENS Chipyard Award #2016662. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We thank Kevin He, Nico Casteneda, Mihai Tudor, and Vikram Jain for their work leading tapeouts which have integrated early versions of Saturn. We also thank Albert Ou, Colin Schmidt, Andrew Waterman, and Chris Batten for insightful conversations on vector microarchitecture.

## REFERENCES

- [1] “RISC-V:NX27V.” Available: <https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v/>
- [2] “Semidynamics Vector Unit - Only 100% customisable RISC-V Vector Unit.” Available: <https://semidynamics.com/en/technology/vector-unit>
- [3] “SiFive Intelligence X280.” Available: <https://www.sifive.com/cores/intelligence-x280>
- [4] “SX-Aurora TSUBASA Architecture.” Available: <https://www.nec.com/en/global/solutions/hpc/sx/architecture.html?>
- [5] “Armv8-M Architecture Reference Manual,” 2015. Available: <https://developer.arm.com/documentation/ddi0553/bx/?lang=en>
- [6] “RISC-V “V” Vector Extension,” Sep. 2021. Available: <https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf>
- [7] “P870 High-Performance RISC-V Processor,” in *2023 IEEE Hot Chips 35 Symposium (HCS)*, Aug. 2023, pp. 1–19. Available: <https://ieeexplore.ieee.org/document/10254712>
- [8] “Veyron V1 Data Center-Class RISC-V Processor,” in *2023 IEEE Hot Chips 35 Symposium (HCS)*, Aug. 2023, pp. 1–16. Available: <https://ieeexplore.ieee.org/document/10254710>
- [9] K. Al-Hawaj, T. Ta, N. Cebry, S. Agwa, O. Afuye, E. Hall, C. Golden, A. B. Apsel, and C. Batten, “EVE: Ephemeral Vector Engines,” in *2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. Montreal, QC, Canada: IEEE, Feb. 2023, pp. 691–704. Available: <https://ieeexplore.ieee.org/document/10071074/>
- [10] K. Asanović, “Vector microprocessors,” Ph.D. dissertation, EECS Department, University of California, Berkeley, 1998. Available: <http://www2.eecs.berkeley.edu/Pubs/TechRpts/1998/6404.html>
- [11] K. Asanovic, *Computer Architecture: A Quantitative Approach, Appendix G*, 2019.
- [12] M. Cavaleante, D. Wüthrich, M. Perotti, S. Riedel, and L. Benini, “Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters,” in *Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design*, ser. ICCAD ’22. New York, NY, USA: Association for Computing Machinery, Dec. 2022, pp. 1–9. Available: <https://dl.acm.org/doi/10.1145/3508352.3549367>
- [13] C. Chen, X. Xiang, C. Liu, Y. Shang, R. Guo, D. Liu, Y. Lu, Z. Hao, J. Luo, Z. Chen, C. Li, Y. Pu, J. Meng, X. Yan, Y. Xie, and X. Qi, “Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension : Industrial Product,” in *2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)*, May 2020, pp. 52–64. Available: <https://ieeexplore.ieee.org/document/9138983>
- [14] L. Codrescu, “Architecture of the Hexagon™ 680 DSP for mobile imaging and computer vision,” in *2015 IEEE Hot Chips 27 Symposium (HCS)*, Aug. 2015, pp. 1–26. Available: <https://ieeexplore.ieee.org/document/7477329>
- [15] L. Codrescu, W. Anderson, S. Venkumananti, M. Zeng, E. Plondke, C. Koob, A. Ingle, C. Tabony, and R. Maule, “Hexagon DSP: An Architecture Optimized for Mobile Multimedia and Communications,” *IEEE Micro*, vol. 34, no. 2, pp. 34–43, Mar. 2014. Available: <https://ieeexplore.ieee.org.libproxy.berkeley.edu/document/6762801>
- [16] R. Espasa, M. Valero, and J. Smith, “Out-of-order vector architectures,” in *Proceedings of 30th Annual International Symposium on Microarchitecture*. Research Triangle Park, NC, USA: IEEE Comput. Soc, 1997, pp. 160–170. Available: <http://ieeexplore.ieee.org/document/645807/>
- [17] G. Frison, D. Kouzoupis, T. Sartor, A. Zanelli, and M. Diehl, “BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization,” *ACM Trans. Math. Softw.*, vol. 44, no. 4, pp. 42:1–42:30, Jul. 2018. Available: <https://doi.org/10.1145/3210754>
- [18] A. Khadem, D. Fujiki, N. Talati, S. Mahlke, and R. Das, “Vector-Processing for Mobile Devices: Benchmark and Analysis,” Sep. 2023. Available: <http://arxiv.org/abs/2309.02680>
- [19] C. R. Lazo, E. Reggiani, C. R. Morales, R. F. Bagué, L. A. Vargas, M. A. R. Salinas, M. V. Cortés, O. S. Unsal, and A. Cristal, “Adaptable Register File Organization for Vector Processors,” in *2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*, Apr. 2022, pp. 786–799. Available: <http://arxiv.org/abs/2111.05301>
- [20] Y. Lee, C. Schmidt, A. Ou, A. Waterman, and K. Asanović, “The hwacha vector-fetch architecture manual, version 3.8.1,” Tech. Rep. UCB/EECS-2015-262, Dec. 2015. Available: <http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.html>
- [21] F. Minervini, O. Palomar, O. Unsal, E. Reggiani, J. Quiroga, J. Marimon, C. Rojas, R. Figueras, A. Ruiz, A. González, J. Mendoza, I. Vargas, C. Hernandez, J. Cabre, L. Khoirunisa, M. Bouhalil, J. Pavon, F. Moll, M. Olivieri, M. Kovac, M. Kovac, L. Dragic, M. Valero, and A. Cristal, “Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications,” *ACM Transactions on Architecture and Code Optimization*, vol. 20, no. 2, pp. 28:1–28:25, Mar. 2023. Available: <https://dl.acm.org/doi/10.1145/3575861>
- [22] T. Odajima, Y. Kodama, M. Tsuji, M. Matsuda, Y. Maruyama, and M. Sato, “Preliminary Performance Evaluation of the Fujitsu A64FX Using HPC Applications,” in *2020 IEEE International Conference on Cluster Computing (CLUSTER)*, Sep. 2020, pp. 523–530. Available: <https://ieeexplore.ieee.org/document/9229635>
- [23] K. Patsidis, C. Nicopoulos, G. C. Sirakoulis, and G. Dimitrakopoulos, “RISC-V2: A Scalable RISC-V Vector Processor,” in *2020 IEEE International Symposium on Circuits and Systems (ISCAS)*, Oct. 2020, pp. 1–5. Available: <https://ieeexplore.ieee.org/document/9181071>
- [24] A. Pellegrini, “Arm Neoverse N2: Arm’s 2nd generation high performance infrastructure CPUs and system IPs,” in *2021 IEEE Hot Chips 33 Symposium (HCS)*, Aug. 2021, pp. 1–27. Available: <https://ieeexplore.ieee.org/document/9567483>
- [25] M. Perotti, M. Cavalcante, R. Andri, L. Cavigelli, and L. Benini, “Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV1.0 Compliant Open-Source Processor,” Nov. 2023. Available: <http://arxiv.org/abs/2311.07493>
- [26] M. Platzer and P. Puschner, “Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation,” pp. 18 pages, 831 915 bytes, 2021. Available: <https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECRTS.2021.1>
- [27] A. H. plc, “Fye24-q3 shareholder letter,” December 2023. Available: <https://investors.arm.com/static-files/4404a89a-d033-419e-aaf0-d7b15d40e11f>
- [28] C. Ramírez, C. A. Hernández, O. Palomar, O. Unsal, M. A. Ramírez, and A. Cristal, “A risc-v simulator and benchmark suite for designing and evaluating vector architectures,” *ACM Transactions on Architecture and Code Optimization (TACO)*, vol. 17, no. 4, pp. 1–30, 2020.
- [29] R. M. Russell, “The CRAY-1 computer system,” *Communications of the ACM*, vol. 21, no. 1, pp. 63–72, Jan. 1978. Available: <https://dl.acm.org/doi/10.1145/359327.359336>
- [30] J. E. Smith, “Decoupled access/execute computer architectures,” *ACM SIGARCH Computer Architecture News*, vol. 10, no. 3, pp. 112–119, Apr. 1982. Available: <https://dl.acm.org/doi/10.1145/1067649.801719>

- [31] N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker, "The ARM Scalable Vector Extension," *IEEE Micro*, vol. 37, no. 2, pp. 26–39, Mar. 2017. Available: <http://ieeexplore.ieee.org/document/7924233/>
- [32] J. E. Thornton, "Parallel operation in the control data 6600," in *Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems*, ser. AFIPS '64 (Fall, Part II). New York, NY, USA: Association for Computing Machinery, Oct. 1964, pp. 33–40. Available: <https://dl.acm.org/doi/10.1145/1464039.1464045>