

# Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

Haoran Wu  
University of Cambridge  
Cambridge, UK

Binglei Lou  
Imperial College London  
London, UK

Przemyslaw Forys  
Imperial College London  
London, UK

Timothy M. Jones  
University of Cambridge  
Cambridge, UK

Can Xiao  
Imperial College London  
London, UK

Jeffrey T. H. Wong  
Imperial College London  
London, UK

Wayne Luk  
Imperial College London  
London, UK

Rika Antonova  
University of Cambridge  
Cambridge, UK

Jiayi Nie  
University of Cambridge  
Cambridge, UK

Zhiwen Mo  
Imperial College London  
London, UK

Hongxiang Fan  
Imperial College London  
London, UK

Robert Mullins  
University of Cambridge  
Cambridge, UK

Xuan Guo  
Imperial College London  
London, UK

Cheng Zhang  
Imperial College London  
London, UK

Jianyi Cheng  
University of Edinburgh  
Edinburgh, UK

Aaron Zhao  
Imperial College London  
London, UK

## Abstract

LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference – they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the *bandwidth* and *capacity* memory walls, preventing the on-chip compute units from achieving high utilization.

In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5 $\times$  higher utilization than existing accelerators, and delivers 2.24 $\times$  higher throughput than the A100 GPU and 3.85 $\times$  higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced.

## 1 Introduction

Transformer models have revolutionised AI across numerous fields, including language, vision, and science [34, 65, 69]. Decoder-only transformer-based autoregressive large language models (LLMs), like GPT [50] and LLaMA [63], are now widely deployed in many applications, such as real-time chatbots [49], code generation [32] and agentic tool-use and computer-use workflows [48].

The rapid rise of agentic LLM capabilities, e.g. computer use [41], tool use [27, 46], and command-line agents [1], relies heavily on their ability to process and reason over very long contexts. For instance, command-line agents need to both comprehend and generate large-scale codebases [30, 55, 71], while tool- and computer-use agentic workflows must keep track of multiple pieces of information across prolonged inputs—such as an entire web page DOM—which typically require very long contexts [12, 20, 35]. Figure 1(a) shows that, compared with chatbot workloads, agentic workloads consume 100 $\times$  more tokens per inference on average and up to 1,000 $\times$  at the maximum case. In response, modern LLMs have expanded their context windows: the original GPT-3 [11] supports roughly 2K tokens, whereas GPT-4 [50] reaches up to 32K tokens, and LLaMA4-Maverick [2] extends the context window to 1M tokens.

To clarify the computational impact of agentic workloads, Figure 1(b) analyzes a LLaMA 3.3 70B model with long-context capability and shows that, when the number of generated tokens is low, the Feed-Forward Networks (FFNs) account for most of the total inference FLOPs, whereas the attention layers become dominant as the number of tokens generated grows. Notice these two phases can happen in a single inference run since we are performing autoregressive decoding. For instance, in the Longwriter [8] workload, the



(a) Compared with standard chatbot workloads, the selected agentic web and code tasks generally consume over 100× more tokens.



(b) Compute shifts from FFN to Attention with increasing context length.

(c) KV cache scales with context length, eventually dominating memory usage.

**Figure 1:** An illustration of agentic inference workloads shows how they typically generate many more tokens per inference run (Figure 1(a)), contain both FFN-compute-intensive and attention-compute-intensive phases (Figure 1(b)), and include weight memory-capacity-dominant and KV-dominant phases (Figure 1(c)) within a single inference run.

prefilling phase finishes at 5K tokens, and the decoding phase starts from there and continues to expand the context to up to 85K tokens, causing the workload to shift from the FFN-intensive to Attention-intensive region in terms of FLOPs in a single inference run as shown in Figure 1(b). Furthermore, with such large context lengths, the KV cache becomes the primary consumer of HBM resources.

Figure 1(c) also identifies two major limiting factors on the memory side. The large number of KV values and weights that must be read, together with the portion of KV values written back, impose very substantial memory bandwidth demands. In addition, as context length increases, the KV-cache requirement grows linearly, quickly increasing memory usage and often surpassing the size of the model weights, making HBM capacity a primary limiting factor. For example, in LLaMA-3.3-70B, at a 128k context [45], the FP16 KV cache for a single batch is approximately 39 GB, which limits how many batches can be kept on the chip [23]. Building on this observation, we suggest that the two main challenges on the off-chip memory side, namely, (i) the limited memory bandwidth and (ii) the restricted memory capacity. We collectively term these *memory walls*. Together, they prevent devices from reaching peak performance at inference time, consistent with observations in prior work [18, 23, 74].

The memory wall phenomenon leads to underutilization of computing resources on modern hardware, including TPUs and GPUs. This effect is particularly evident in compute units dedicated to General Matrix-Matrix Multiplication (GEMM) operations ( $\mathbb{R}^{M \times K} \times \mathbb{R}^{K \times N} \rightarrow \mathbb{R}^{M \times N}$ ), denoted as  $(M, K) \times (K, N)$ , which constitute the core computational workload during LLM inference [26]. At the microarchitectural level, most hardware is built with square-shaped systolic arrays or matrix multiplication units, typically designed so that the  $M$  and  $N$  dimensions are close in size to  $K$ . For example, TPU v3 [24] features a  $128 \times 128$  systolic array, supporting  $M = K = N = 128$  GEMM operations. The NVIDIA Blackwell B200 architecture [15] introduces a minimal computation granularity of  $64 \times 8 \times 16$ . However, in long-context models, as demonstrated in Figure 1(c), memory often constrains the inference batch size. This results in a *fat GEMM*, where the batch-related dimension (typically  $M$  in  $(M, K) \times (K, N)$ ) is much smaller than the others, producing an uneven matrix shape. This imbalance hinders systolic arrays and Tensor Cores from achieving high utilization rate, resulting in significant underuse of computational resources [28].

To this end, we propose the **P**rogrammable **L**ong-context **E**fficient **N**eural **A**ccelerator (PLENA), an efficient transformer model accelerator system designed to maintain high utilization of GEMM units across all inference stages (prefilling and decoding), particularly for agentic LLM inference tasks with large contexts. PLENA achieves high efficiency for long-context inference by exploring three optimization pathways across both hardware and software design spaces: i) a flattened systolic array architecture tailored to *fat GEMM* (large inner dimension  $K$ ); ii) a set of quantization methods with mixed data types and precisions to address both memory wall challenges; and iii) a set of custom instructions (the PLENA\_ISA) that contain native FlashAttention support [16].

Figure 2 shows how these three pathways together can **increase the utilization** compared to the conventional square-shaped GEMM hardware without any optimization. First, our novel flattened systolic-array optimization (*Pathway 1*) achieves a higher attainable compute utilisation. The  $(M, K, N)$  matrix multiplication typically has  $N \ll K$  because of the memory capacity wall.<sup>2</sup> Our flattened systolic array thus brings a more effective utilization of the multiplication resources, as illustrated in Figure 2(a) and Figure 2(b). Second, we apply an *asymmetric quantization strategy* (*Pathway 2*), where Weights(W)/Activations(A)/KV Cache(KV) can be set to different arithmetic widths and precisions to

<sup>1</sup>64×64 square-shaped systolic array and 8×512 flattened systolic array. Data derived from 144 GB HBM capacity and 512 GB/s memory bandwidth.

<sup>2</sup>All KVs must be stored, so the batch size (the  $N$  dimension) is kept lower than the hidden size ( $K$ ). While various offloading techniques are available [4], they complicate system-level trade-offs and tend to make the system more memory I/O-bound.



(a) PLENA achieves higher utilization than the standard square systolic array(same resources). (b) PLENA’s optimization pathways—(1) a flattened systolic array and (2) asymmetric quantization—together achieve improved effective memory bandwidth utilization and help reduce memory capacity limitations.

**Figure 2:** A comparison of attainable FLOPs between a square-shaped systolic array (e.g. TPUs) and PLENA’s when using the same number of multipliers for running LLaMA3.3 (70B, 128K context)<sup>1</sup>. PLENA’s optimization pathways yield higher attainable FLOPS.

address both memory bandwidth and capacity limitations. With more aggressive W and KV cache quantization such as  $W_{mxint4}$ ,  $A_{mxint8}$ ,  $KV_{mxint4}$ , we free up more space in HBM for data scaling (e.g., larger batch sizes). We strategically integrate and enhance state-of-the-art (SOTA) optimization schemes for quantization, incorporating techniques such as micro-scaling [56], output-norm-guided Hessian-based iterative optimization [22], and selective rotation [6]. Our final quantization results demonstrate SOTA performance, effectively reducing limitations related to both memory bandwidth and capacity. Finally, we design and implement PLENA with a custom ISA to have native support for FlashAttention (*Pathway 3*), since Figure 1(b) shows that attention dominates at longer context lengths. We present a novel approach for effectively supporting FlashAttention on systolic-array-based architectures that avoids excessive off-chip memory I/O during attention. Together, these three optimizations yield significantly higher utilization than conventional square-shaped systolic-array accelerators. The main contributions of our work are as follows:

- We analytically characterize the bandwidth and capacity memory walls in agentic LLM inference and show that existing systolic-array accelerators are normally under-utilized when running this workload.
- We introduce three optimization pathways that jointly address the under-utilization caused by memory walls: (i) a flattened systolic array architecture; (ii) an asymmetric quantization scheme with mixed data types and precisions; and (iii) native support for FlashAttention.
- We present PLENA, a complete hardware-software system that realizes the above optimizations. It comprises (i) a custom instruction set (PLENA\_ISA) for large Transformer inference; (ii) a PyTorch-to-PLENA\_ISA compiler; (iii) an HBM-enabled transaction-level simulator; (iv) an

automated, accuracy aware design space exploration (DSE) flow; and (v) a full RTL implementation. We demonstrate that PLENA supports different SOTA transformer model variants (e.g., GQA, MHA and MLA [43], Dense and MoE [5]). We also show that PLENA achieves SOTA energy efficiency for agentic LLM inference tasks – it achieves 2.24× higher throughput than A100 GPUs on 3.85× higher throughput than v6e TPU. The overall PLENA system is illustrated in Figure 10, and the entire system will be fully open-sourced upon acceptance.

## 2 Background

### 2.1 Model Quantization

Quantization compresses LLMs by mapping high-precision floating-point parameters  $\mathbf{X}$  into lower-bit representations. Following the standard integer quantization definition [47], we formalize the process over the arbitrary target data format under a single-level scaling scheme using three elements: the *data format* ( $\tau$ ), the *scale factor* ( $s$ ) and the *zero point* ( $z$ ).

A data format is defined as a tuple  $\tau = (d, b)$ , where  $d$  denotes the numerical datatype and  $b$  is the bit-width specifying its precision. For a datatype  $\tau$  the values it can represent are restricted to a finite interval. We denote this interval as the representable set:

$$\Omega(\tau) = \{x \in \mathbb{R} \mid \min_{\tau} \leq x \leq \max_{\tau}\}, \quad (1)$$

with  $\min_{\tau}$  and  $\max_{\tau}$  as the representable bounds. The scale factor  $s$  maps the dynamic range of  $\mathbf{X}$  into  $\Omega(\tau)$ , typically defined as:

$$s = \frac{\max(\mathbf{X})}{\max_{\tau}}, \quad (2)$$

while the zero-point  $z$  shifts the range for alignment (with  $z = 0$  in symmetric quantization). Quantization then maps  $\mathbf{X}$  into the target format as:

$$X_{\tau} = \text{clip}\left(\text{RTN}\left(\frac{\mathbf{X}}{s}\right) + z, \min_{\tau}, \max_{\tau}\right), \quad (3)$$

where RTN( $\cdot$ ) denotes round-to-nearest. To approximate the original tensor, the quant-dequant operator is:

$$Q(\mathbf{X}; s, \tau) = s(X_{\tau} - z). \quad (4)$$

In Equation 3, values exceeding the representable range are clipped, introducing the clipping error. In this work, we address this with a novel adaptive clipping search, described in Section 4.2.

As the tensor size grows, the probability of having these outliers increases, widening the dynamic range and amplifying clipping error. Prior work mitigates this by varying the granularity at which scale and zero-point parameters are shared: from per-tensor, to per-channel, and to vector-wise schemes. In this work, we adopt block-wise micro-scaling datatypes (MXINT and MXFP), with both software and hardware implementation to support our dataformat-aware co-design, which we defer in Section 4.2.

**Table 1:** A Comparison of LLM accelerators: most lack cycle-accurate simulators for RTL-level timing, omit accurate HBM simulation in evaluation, are constrained by a lack of an ISA with compiler support, and accelerate only a subset of kernels – resulting in restricted flexibility, the need to offload to GPUs/CPPUs and frequent host-device transfers and significant data-movement overheads.

|                            | PLENA | PICACHU [53] | MicroScopiQ [54] | FlightLLM [73] | Tender [36] | FIGNA [31] | SystolicAttention [38] | Olive [25] |
|----------------------------|-------|--------------|------------------|----------------|-------------|------------|------------------------|------------|
| Simulator                  | L3    | L1           | L2               | L1             | L1          | L1         | L3                     | L1         |
| Custom ISA & Auto Code Gen | ✓     | ✓            | ✗                | ✓              | ✗           | ✗          | ✓                      | ✗          |
| DSE                        | ✓     | ✗            | ✗                | ✗              | ✗           | ✗          | ✗                      | ✗          |
| FlashAttention support     | ✓     | ✗            | ✗                | ✗              | ✗           | ✓          | ✓                      | ✗          |
| Full inference coverage*   | ✓     | ✗            | ✗                | ✓              | ✓           | ✗          | ✗                      | ✗          |
| Open source                | ✓     | ✗            | —                | ✗              | —           | ✗          | ✓                      | —          |

L1: functional simulator; L2: cycle-accurate simulator; L3: cycle-accurate simulator with HBM enabled.

—: partial or planned open-source. Full inference coverage\*: all Transformer computations executed on-accelerator.



**Figure 3:** A typical setting of the MX data formats in this design. A scale is shared by a group of elements. Scale is in power of two quantization and elements can be quantized to integer or minifloat.

Additionally, quantization approaches generally fall into two categories: *quantization-aware training* (QAT), which integrates quantization during fine-tuning, and *post-training quantization* (PTQ), which applies quantization directly to a pretrained model. Our PTQ method achieves accuracy competitive with full-precision baselines, even under aggressive low-bit, full-system quantization, as demonstrated in Table 3.

**Microscaling.** Microscaling (MX) data formats, proposed in prior work [56], define a standardized format that enables block-wise scaling sharing. These formats support multi-level scaling schemes. We adopt only the level-1 scaling strategy, as illustrated in Figure 3. The scaling factor in the MX format can be computed similarly to Equation (2), after which it is quantized using power-of-two (PoT) quantization. The data elements in MX formats can be represented either as integers or as minifloats. In our design, we include both representations in the search space to evaluate software performance.

**Quantization Comparison.** In long-context scenarios, KV cache size is a key challenge [76], but hardware support for efficient quantization of KV cache remains limited [25, 54]. Existing frameworks often treat quantization in isolation rather than as part of full-system design, leaving gaps in non-GEMM operations and causing a mismatch between algorithmic advances and practical deployment on hardware.

## 2.2 FlashAttention

FlashAttention optimizes memory I/O in the standard attention layer [16]. In a standard attention layer, computing  $QK^T$  produces a prohibitively large square matrix, often thousands by thousands in size. Because on-chip memory

cannot hold this intermediate result, it must be written to off-chip memory and later reloaded for the subsequent softmax and PV steps, which significantly degrades performance. FlashAttention avoids this round trip by tiling and fusing the attention computation (GEMM–Softmax–GEMM) so that all intermediate results fit on-chip.

## 3 PLENA Hardware System

The overall configuration of PLENA is shown in Figure 4. It is designed to support instruction-level pipelining and mainly consists of three compute units: the Matrix Unit, the Vector Unit, and the Scalar Unit. All units are highly configurable, supporting multiple data types and precisions, enabling the application of different quantization methods to the accelerator. PLENA also includes two main on-chip SRAM blocks. The Vector SRAM acts as a scratchpad for computation, storing frequently used data such as activations, which do not need to be written back to HBM, thereby reducing memory access overhead. The custom Matrix SRAM is dedicated to loading weights and KV tensors and supports reading data in either transposed or untransposed layouts with no additional overhead.

### 3.1 Hardware Support for Asymmetric Arithmetic Types

To support asymmetric quantization strategies (Section 4.1), PLENA natively offers multiple numeric formats—covering different data types and precisions—across its compute and memory units (Table 14). This innovative *asymmetric* data-handling configuration has the following characteristics:

- (i) Activations are stored in a high-precision floating-point (FP) format on-chip in the Vector SRAM, as they are more sensitive to quantization errors than KV or weights.
- (ii) KV and weights, being less accuracy-sensitive, can be more aggressively quantized and staged in the Matrix SRAM using lower-precision MX formats (MX-FP or MX-INT).
- (iii) An optional on-chip rotation step can suppress outliers before quantization to preserve accuracy.

Figure 5 illustrates the precision formats used by each unit and the dataflow between them. When appending newly computed  $K$  and  $V$  to the KV cache, we optionally apply a



**Figure 4:** PLENA architecture overview. Execution is controlled by the decoder’s system-pipeline controller, which derives control signals from decoded instructions and monitors memory dependencies. For example, if the current instruction needs to read from a Vector SRAM row that is still being updated by the vector or matrix unit, the controller inserts a stall to ensure correctness. Vector SRAM acts as the on-chip scratchpad, providing data to the matrix and vector units and accepting their results.

selective rotation (Hadamard transform) to suppress outliers before quantizing to MX-INT. Because  $K$  and  $V$  are consumed only by the attention layer’s GEMM, they are loaded exclusively into the Matrix SRAM. Before use, the matrix unit applies the inverse Hadamard transform to de-rotate  $K$  and  $V$ . These rotation/de-rotation stages can be selectively applied per tensor; for example, weights loaded into the matrix unit bypass the inverse transform.



**Figure 5:** Asymmetric-precision datapath example. Vector SRAM stores FP4 values, whereas Matrix SRAM stores MX-INT4 values. Green paths denote the selective rotational quantization flow: a fast Walsh–Hadamard transform is applied, with its inverse used to map back [51]. Blue paths indicate the data flow for the remaining computation.

### 3.2 Computational Units

All compute units are optimized for feed-forward (FFN) and attention computations in transformer inference, with particular emphasis on long-context workloads. As shown in Figure 2(b), long-context workloads frequently involve *fat*



**Figure 6:** Processing flow for the weight–activation GEMM. Because memory capacity constrains batch size, the  $M$  dimension remains small. Setting  $BLEN = M$  on the flattened systolic array yields near-100% utilization.

GEMMs, where the batch-related dimension (typically  $M$  in  $(M, K) \times (K, N)$ ) is much smaller than the others, resulting in uneven matrix shapes (Figure 6). The reduction dimensions  $K$  tend to be very long. For example, the weight–activation GEMM reduces over the model’s hidden size (e.g., 4,096 for LLaMA-8B and 8,192 for LLaMA-70B). In addition, a variety of arithmetic operations—such as elementwise addition, summation, and special functions like the exponential—are required across long-dimension tensors.

**Matrix Unit.** To optimize GEMM in long-context workloads involving *fat* GEMMs, we propose flattened systolic arrays, enabling higher utilization across the entire *fat* GEMM computation flow. The unit computes a  $(BLEN, MLEN) \times (MLEN, BLEN)$  GEMM and produces results of shape  $(BLEN, BLEN)$ , and normally  $BLEN$  is set to be much smaller than  $MLEN$  to match the workload characteristics of long-context LLM inference.



**Figure 7:** The flattened systolic array is composed of a series of smaller square-shaped systolic arrays arranged in a row to form the desired *fat GEMM* shape. Each receives inputs distributed from the MLEN vector buffers W and X, as shown in Figure 4.

This flattened systolic array is designed for output-stationary dataflow in order to maintain high utilization and avoid frequent reads/writes of partial sums—and the bubbles associated with streaming operands into the systolic array. As shown in Figure 6, operands stream along the large reduction dimension  $K$  while partial sums remain resident in the PEs. The array is fully pipelined, eliminating bubbles between consecutive GEMM tiles.

The microarchitecture of the flattened systolic array is shown in Figure 7. It is built from a series of small square-shaped systolic arrays (*sub-arrs*), each consisting of a grid of processing elements (PEs). Each PE repeatedly performs multiply–accumulate operations and passes data to its neighboring PEs below and to the right across the array. As described in Section 3.1, the systolic array is designed to natively accept data in the MX format. The detailed PE configuration is provided in the Figure 13.

On each cycle, the flattened systolic array fetches two MLEN-wide inputs, one from the Matrix SRAM (top) and one from the Vector SRAM (left). These inputs are buffered and reordered, then partitioned into MLEN/BLEN vectors (assuming MLEN is divisible by BLEN), each of length BLEN. Each vector is then fed to a corresponding *sub-arrs* from the top and left direction.

However, a matrix unit composed solely of *sub-arrs* is insufficient to complete a  $(\text{BLEN}, \text{MLEN}) \times (\text{MLEN}, \text{BLEN})$  GEMM. Each array accumulates only partial sums for a fragment of the final result; producing a complete  $(\text{BLEN}, \text{BLEN})$  output requires a cross-array reduction that sums the partial sums held in the PEs across the tiled row. To address this, we integrate an output adder tree (see Figure 7) that performs the cross-array summation efficiently. This unit is invoked via a dedicated instruction, as only one cross-array summation is required when computing GEMM along the large reduction dimension. This could prevent bubbles and improve computational efficiency.

**Vector Unit.** This unit supports all vector operations required during LLM inference, including elementwise computations (e.g., addition, multiplication, and exponential) and reduction operations (e.g., summation, maximum). The vector dimension is parameterised by VLEN. A complete list of vector-unit instructions is provided in Table 12.

**Scalar Unit.** The scalar unit has two separate ALU units supporting the two data types of computations: Integer (INT) and Floating Point (FP). Both the INT and FP units are connected to their respective SRAMs and register files and operate independently.

INT operations are used primarily for on-chip address generation and indexing, and run on a control path decoupled from the FP datapath. In contrast, the FP unit implements basic arithmetics and the non-linear functions required by transformer workloads (e.g., exponential, reciprocal, and reciprocal square root (rsqrt)). To accommodate future models that may require additional special functions, we also include a look-up table (LUT) unit so new functions can be realized via table lookups without introducing additional logic.

### 3.3 Memory System

Our memory system is characterized by two key properties:

- Support for asymmetric precisions, variable-length memory transfers, and strided loads/stores to HBM.
- Latency hiding for HBM accesses via a hardware prefetcher, enabling high bandwidth utilization.

To make more effective use of HBM capacity, as discussed in Section 3.1, all data stored in HBM is kept in MX format. However, due to address alignment constraints, it is impractical to concatenate each data block with its associated per-block scales. This is because the resulting combined size seldom matches a  $(2^n)$  multiple, making it inefficient for the memory system.

To address this problem, we store the blocks and their scales separately – laying out all blocks contiguously, followed by the corresponding scales at the end of the block region. With this technique, the memory address alignment is preserved while locality is maintained. The resulting layout is shown in Figure 8.

To support variable-length transfers, the HBM controller integrates two data-packing units. MX-format blocks fetched via TileLink [62] (the on-chip interconnect used to access the HBM controller) are repacked into (i) MLEN-wide vectors for the Matrix SRAM and (ii) VLEN-wide vectors for the Vector SRAM. The controller automatically locates and fetches the corresponding per-block scales based on the active precision and the requested transfer size. On the write path, dedicated units accept vectors from the Matrix and Vector SRAMs, partition them into MX blocks, attach the appropriate per-block scales, and commit the aligned layout back to HBM.

The loading logic is critical to help us fully utilize the HBM memory bandwidth. The hardware load unit resides in both



**Figure 8:** Data layout and interaction in HBM. Data of different precisions can be stored simultaneously according to the defined storage pattern in HBM. Strided load and store operations are managed by the address remap unit, which generates and passes strided addresses to the TileLink channel.

the Matrix and Vector SRAMs and is connected directly to the HBM controller. This enables background fetching and streaming into each SRAM while the rest of PLENA executes other instructions, sustaining full utilization of the matrix unit and avoiding stalls on HBM accesses. The two load units are controlled directly by instructions, with the amount of data to be load encoded in each instruction. For example, during weight–activation GEMMs, where GEMM operations are invoked repeatedly while streaming data across the hidden dimension, the loaded amount is set to this dimension, so the load instruction only needs to be issued once.

### 3.4 PLENA ISA

Our customized ISA is designed to cover all operations required for transformer inference. The instructions are structured to balance efficiency with flexibility and are built to support multiple transformer-based models and computation optimizations. In addition to FlashAttention, the ISA also supports different transformer variants, such as MHA, MLA [43], and MoE [5]. A brief summary is provided in Table 11, with the detailed specification given in Table 12.

To achieve the efficiency and flexibility balance, the ISA is designed to minimize overhead while maximizing utilization of compute and memory resources. This is achieved through features such as tile-level scheduling, which enables fine-grained control of computation and memory instructions at the tile granularity. Furthermore, the ISA defines dedicated instruction classes (Matrix, Vector, Scalar, Memory, and Control) that decouple responsibilities, simplify scheduling, and allow flexible mixing across different computation types.

The instructions (32bits per instr) are dynamically passed from the CPU to the instruction buffer via PCIe. The scalar unit contains an integer register file storing on-chip addresses. Vector- or matrix-related instructions control reads

and writes to the matrix and vector SRAMs using the specified integer registers. Simple arithmetic operations in the scalar unit are used for address manipulation.

### 3.5 Flash Attention

Most current accelerators cannot execute FlashAttention natively because (i) they expose only GEMM primitives and lack in-line, row-wise reductions and nonlinear operations ( $\max/\sum$ ,  $\exp$ ,  $\text{div}$ ) required for the online softmax; (ii) they lack memory-layout support such as transpose-on-read and efficient strided/blocked streaming; and (iii) they rely on rigid ISAs with fixed scheduling and coarse-grained kernel boundaries, preventing tile-by-tile flexible execution.

In PLENA, we address (i) with tightly coupled vector and scalar units that implement the required reductions and elementwise operations; the vector unit’s width is configurable to match the tile dimensions used by FlashAttention. For (ii), we introduce a *Matrix SRAM* that can be read in either standard or transposed order without extra data movement. In the  $QK^T$  step, explicitly transposing large tiles on the fly is costly in area, energy, and latency, and storing  $K^T$  in HBM is impractical because it complicates appending new  $K$  vectors to the existing  $K$  cache during decoding. The Matrix SRAM avoids both issues by banking the storage across multiple sub-SRAMs and using lightweight address remapping to present a transposed view at read time (implementation details in Figure 12). For (iii), our custom ISA offers composable, fine-grained control that enables persistent, tile-by-tile scheduling of the fused attention pipeline. This allows each operation in FlashAttention to be controlled individually at the tile level. Combined with the above capabilities, this allows PLENA to support FlashAttention natively.

## 4 Quantization

### 4.1 Asymmetric Quantization

The proposed quantization framework supports a wide range of datatypes and precisions. As shown in Figure 9, to accurately reflect hardware behavior in LLM architectures, the framework must satisfy two key requirements: 1) different operands within the same operation can be quantized to different datatypes and precisions, and 2) all operations in the model must be quantized. Table 2 summarizes existing quantization methods. Most of these approaches focus on GEMMs only, several support mixed precision, while none of them support mixed data types. In contrast, our quantization flow allows both mixed precision and mixed data types in GEMMs, with all intermediate data between GEMM operations quantized.

For GEMM operations (e.g., linear layers and matrix multiplications between activations), the two operands can have two different precisions; e.g., INT4 activations multiplied

**Table 2:** Comparison of post-training quantization methods for LLMs across key features. (QW, QACT, QKV) denote quantization of weights, activations, and key-value cache, respectively. Each decoder layer in LLAMA contains nine matrix multiplications, as outlined in Algorithm 2. **PLENA** introduces the first accuracy evaluator supporting mixed MX datatypes, providing software emulation for MXINT, MXFP, and MiniFloat formats. Unlike prior approaches, it fully simulates hardware-precision behavior in software, extending quantization beyond matrix multiplications to also include RMSNorm, embedding layers, LM output heads, and nonlinear operations such as RMSNorm, softmax, and SiLU (see Algorithm 2).

|                     | PLENA                  | MicroScopiQ* [54] | GPTQ[22] | QuaRot[6] | OmniQuant[60] | SmoothQuant[70] | Atom[75] | KiVi[76] | M-ANT[29] |
|---------------------|------------------------|-------------------|----------|-----------|---------------|-----------------|----------|----------|-----------|
| (QW, QACT, QKV)     | (✓,✓,✓)                | (✓,✓,✓)           | (✓,✗,✗)  | (✓,✓,✓)   | (✓,✓,✓)       | (✓,✓,✓)         | (✓,✓,✓)  | (✗,✓,✗)  | (✓,✓,✓)   |
| No. GEMMs           | 9/9                    | 9/9               | 7/9      | 7/9       | 9/9           | 7/9             | 9/9      | 0/9      | 9/9       |
| Nonlinear_FN        | ✓                      | ✗                 | ✗        | ✗         | ✓*            | ✗               | ✗        | ✗        | ✗         |
| Embeddin & lm_head  | ✓                      | ✗                 | ✗        | ✗         | ✗             | ✗               | ✗        | ✗        | ✗         |
| RMSNorm             | ✓                      | ✗                 | ✗        | ✗         | ✗             | ✗               | ✗        | ✗        | ✗         |
| ROPE                | ✓                      | ✗                 | ✗        | ✗         | ✗             | ✗               | ✗        | ✗        | ✗         |
| Supported datatypes | MXFP, MXINT, MiniFloat | MXFP, MXINT       | INT      | INT       | INT           | INT             | INT, FP  | INT      | MANT      |
| Mixed-precision     | ✓                      | ✓                 | ✗        | ✓         | ✓             | ✓               | ✓        | ✗        | ✓         |

✓\* denotes partial quantization support ; \* At the time of writing, MicroScopiQ has not yet released its code; the comparison is based on information obtained directly from the authors.



**Figure 9:** A dataflow graph of LLM workloads in PLENA. The blue lines indicate data represented in MX datatypes, and the orange lines indicate minifloat datatypes. The GEMM and projection layers are executed on the PLENA matrix unit, which takes inputs in MX and produces outputs in minifloats. All other operations are executed on the vector unit in minifloats.

with INT8 weights. The operands may also use two different datatypes, e.g., MXFP activations multiplied with MX-Int weights. Besides, the GEMM operation also models the behaviour that the output will be casting to minifloat. For non-GEMM operations, which are executed on the vector machine in hardware, the data is stored as minifloats. When data flows from a non-GEMM operation to a GEMM, a cast module is required to convert minifloats to the corresponding target formats, and this is also modelled in the quantization framework. Beyond this basic setup, we adopt and refine advanced quantization tactics for a more aggressive quantization scheme than plain casting, including Hessian-based quantization optimization (GPTQ) and selective online activation rotation (QuaRot).

## 4.2 Fusing Output-Guided Blockwise Clipping into GPTQ

GPTQ was initially designed for integer quantization. When adapting GPTQ for MXFP/MXINT, we observe that the clipping range within each microscaling block significantly affects overall model performance. To address this problem, we propose a blockwise clipping range search method that minimizes the quantization error of each output block.

Algorithm 1 outlines the quantization process of PLENA. PLENA uses per-microscaling-block quantization error to guide the search of the clipping range, and fuses this clipping range optimization into GPTQ’s iterative error propagation. This also mitigates the outlier problem of weights, which later on affects the value of the shared exponents in MX format, and eventually enables a better end-to-end model performance.

Formally, let  $\mathbf{X} \in \mathbb{R}^{M \times K}$  be the inputs for calibration, and  $\mathbf{W} \in \mathbb{R}^{N \times K}$  the layer weights. Given a linear layer  $\mathbf{Y} = \mathbf{X}\mathbf{W}^\top$ , slice the weights across the  $K$  dimension with the block size  $B$  (i.e., MLEN in Figure 6) defined in our MX data format  $\tau$ , yielding  $\mathbf{W}_b \in \mathbb{R}^{N \times B}$  to be quantized. We also slice activations across the  $K$  dimension with the same block size, giving  $\mathbf{X}_b \in \mathbb{R}^{M \times B}$ . Let  $\text{QUANTIZE}(\cdot; p, \tau)$  denote per-row quantization in data format  $\tau$  with clipping percentile  $p$ . For each row  $i = 1, \dots, N$ , we search for the percentile by

$$p_i^* = \arg \min_{p \in \mathcal{P}} \left\| \mathbf{x}_{b:b+B-1} \left( \mathbf{w}_{i,b:b+B-1} - \text{QUANTIZE}(\mathbf{w}_{i,b:b+B-1}; p, \tau) \right)^\top \right\|_2^2 \quad (5)$$

and get the quantized weight block:

$$\mathbf{q}_{i,b:b+B-1} = \text{QUANTIZE}(\mathbf{w}_{i,b:b+B-1}; p_i^*, \tau). \quad (6)$$

We now detail the per-block clipping search, following the quantization definitions in Section 2.1, consider a block of weights  $w_\tau$  in data format  $\tau$ , with representable range  $[\min_\tau, \max_\tau]$  and empirical weight range  $[x_{\min}, x_{\max}]$ . Directly mapping the full weight range usually wastes precision due to extreme outliers. To mitigate this, we introduce a *clipping parameter*  $p \in \mathcal{P} \subset [0.5, 0.99]$ , which shrinks the

**Algorithm 1** PLENA L2-Norm-Guided Hessian-Based Weights Quantization

---

**Require:** full-precision weight matrix  $\mathbf{W} \in \mathbb{R}^{N \times K}$   
**Require:** calibration activations  $\mathbf{X} \in \mathbb{R}^{M \times K}$   
**Require:** block size  $B$  (i.e., MLEN) defined in our MX data format  $\tau$   
**Require:** percentile set  $\mathcal{P}$ , target format  $\tau$   
**Ensure:** quantized weight matrix  $\mathbf{Q}$ ; block quantisation errors  $\mathbf{E}$

- 1: Initialize quantized weights  $\mathbf{Q} \leftarrow \mathbf{0} \in \mathbb{R}^{N \times K}$
- 2: Initialize quantisation errors  $\mathbf{E} \leftarrow \mathbf{0} \in \mathbb{R}^{N \times B}$
- 3:  $\mathbf{H}^{-1} = (2\mathbf{XX}^\top + \lambda\mathbf{I})^{-1}$
- 4:  $\mathbf{H}^{-1} \leftarrow \text{Cholesky}(\mathbf{H}^{-1})^\top$
- 5: **for** each block  $b = 0, B, 2B, \dots, K - 1$  **do**
- 6:    $b_2 \leftarrow \min(b+B, K)$
- 7:    $\mathbf{W}_b \leftarrow \mathbf{W}_{:,b:b_2}$  ▷ Extract weight block
- 8:    $\mathbf{X}_b \leftarrow \mathbf{X}_{:,b:b_2}$  ▷ Extract activation block
- 9:   Initialize  $\mathbf{Q}_b^{\text{best}} \leftarrow \mathbf{0}$ ,  $\epsilon^{\text{best}} \leftarrow \infty$
- 10:   **for** each candidate percentile  $p \in \mathcal{P}$  **do**
- 11:      $\tilde{\mathbf{W}}_b \leftarrow \text{QUANTIZE}(\mathbf{W}_b, p, \tau)$ ;  $\epsilon \leftarrow \|\mathbf{X}_b \mathbf{W}_b^\top - \mathbf{X}_b \tilde{\mathbf{W}}_b^\top\|_2^2$
- 12:      $\text{mask} \leftarrow \epsilon < \epsilon^{\text{best}}$
- 13:      $\mathbf{Q}_b^{\text{best}}[\text{mask}, :] \leftarrow \tilde{\mathbf{W}}_b[\text{mask}, :]$ ;  $\epsilon^{\text{best}}[\text{mask}] \leftarrow \epsilon[\text{mask}]$
- 14:    $\mathbf{Q}_{:,b:b_2} \leftarrow \mathbf{Q}_b^{\text{best}}$ ;  $\Delta_b \leftarrow \mathbf{W}_b - \mathbf{Q}_b^{\text{best}}$ ;  $\mathbf{d}_{bb} \leftarrow \text{diag}(\mathbf{H}_{b:b_2, b:b_2}^{-1})$
- 15:    $\mathbf{E}_b \leftarrow \Delta_b \text{diag}(\mathbf{d}_{bb})^{-1}$ ;  $\mathbf{W}_{:,b_2:} \leftarrow \mathbf{W}_{:,b_2:} - \mathbf{E}_b \cdot \mathbf{H}_{b:b_2, b_2}^{-1}$

---

effective range to  $[p x_{\min}, p x_{\max}]$ . We then adopt the symmetric quantization setting with zero-point fixed at  $z = 0$ , the corresponding scale factor is

$$s(p) = \frac{px_{\max}}{\max_\tau}. \quad (7)$$

The blockwise quant-dequant operator then becomes

$$\hat{w}(p) = s(p) \text{clip}\left(\text{RTN}\left(\frac{w}{s(p)}\right), \min_\tau, \max_\tau\right). \quad (8)$$

where RTN denotes round-to-nearest MX numbers.

By sweeping over a discrete candidate set  $\mathcal{P}$  of clipping parameters, we evaluate multiple effective ranges and select the  $p$  per block that minimizes the output reconstruction loss defined in Equation (5).

### 4.3 Selective Online Activation Rotation

As shown in prior work, activations in LLMs typically contain more outliers than weights [37], therefore more sensitive to quantization. QuaRoT [6] recently demonstrated that applying a rotation matrix to LLMs can effectively suppress outliers. However, we observe that rotating all tensors suggested by QuaRoT may not yield the best performance for MX formats. When a tensor (e.g., the weight matrix) does not exhibit significant outliers, the benefit of rotation diminishes. Equation (9) is a simplified rotation mechanism in QuaRot [6], where a Hadamard matrix  $\mathbf{H}$  smooths out the activation distributions, and its inverse is fused into weights. We performed experiments to empirically identify the activation tensors with extreme outliers and propose a selective online rotation scheme.

We notice that applying the rotation to finer-grained weight quantization (e.g., MXInt with smaller block sizes) may increase perplexity. Intuitively, weights have smaller dynamic ranges compared to activations. The rotation may be unnecessary since most weight outliers are effectively captured by the shared exponents, while permuting the weights with  $\mathbf{H}$  leads to different quantized values, which may impact the model performance.

$$\mathbf{Y} = \text{Quantize}(\mathbf{XH}) \cdot \text{Quantize}(\mathbf{H}^{-1}\mathbf{W}) \quad (9)$$

To address the issue that weight with fine-grained blocking does not need rotation, we propose an *activation-only rotation* strategy. As shown in Equation (10), the inverse rotation matrix  $H^{-1}$  is decoupled from weight quantization and is instead applied directly to the quantized rotated activation at runtime.

$$\mathbf{Y} = \text{Quantize}(\mathbf{XH}) \cdot \mathbf{H}^{-1} \cdot \text{Quantize}(\mathbf{W}) \quad (10)$$

The activation distribution varies significantly across layers. Consequently, the effect of rotation also differs from layer to layer. Rather than rotating all activations, we apply the rotation matrix *selectively*. A search is performed to identify the layers where rotation yields the greatest benefit. This selective activation rotation is performed on-the-fly (the green paths in Figure 5). The ablation of the above quantization modifications is shown in Table 8.

## 5 PLENA Software Tooling

As discussed in Table 1, existing works lack several key components necessary to achieve complete end-to-end LLM inference. Some of these missing elements include a compiler, a simulator, and design-space exploration tools. In contrast, PLENA features a complete design and verification framework that allows it to rapidly adapt to new models or even new hardware accelerators and optimize for them. We also anticipate that future accelerators in the field could reuse certain components of this comprehensive framework to establish end-to-end performance comparisons.

### 5.1 Compiler

To efficiently deploy decoder-style LLMs, we design a compiler stack targeting only LLM models on our PLENA hardware. The models are first exported from the PyTorch framework into the ONNX format [7], where standard graph optimizations such as constant folding are applied. The optimized graph is then parsed into our custom IR through pattern matching, this essentially lowers high-level operators into primitives such as GEMM, quantization, dequantization, and FlashAttention.

The critical challenge lies in searching for an optimal scheduling strategy tailored to PLENA. Our scheduling policies include operator fusion, tiling configurations, memory

placement, and loop transformations, which jointly determine data reuse, memory traffic, and compute unit utilization. To accelerate the search, we systematically traverse candidate configurations and validate them by checking memory footprint constraints and transaction requirements. Feasible candidates are further evaluated by a lightweight roofline-based performance model, and finally, the top-K schedules are selected to generate the assembly code for execution on PLENA.

## 5.2 Cycle-accurate Simulator

Our Rust-based cycle-accurate simulator offers significant advantages over the functional-level simulators used in most published accelerators:

- Supports full cycle-accurate emulation.
- Event-driven simulation that directly executes the generated machine code from the compiler.
- HBM-enabled simulation, incorporating realistic HBM timing and bandwidth characteristics (via Ramulator [42]).

This simulator supports the same data types and precisions as the PLENA accelerator, and we verified that it could generate closely matching results as the RTL simulation for the accelerator.

## 5.3 Hardware-Software Co-Design

To automate finding optimal hardware design and quantization parameters, we propose to employ active learning for design space exploration (DSE). We also provide capability for investigating the trade-offs between optimizing different objectives, such as maximizing accuracy, while minimizing latency and area. For this, we employ multi-objective Bayesian optimization (BO), which allows exploring the Pareto frontier in an active manner.

BO is a framework for optimising non-differentiable functions [59]. Multi-objective BO searches for optimal points in the design space that minimize a multi-objective function  $f$ , i.e.  $f(\mathbf{x}_*) = \min_{\mathbf{x}} f(\mathbf{x})$ . In our case, the objective function has three components: perplexity, latency, and chip area:  $f = [f_p(\cdot), f_l(\cdot), f_a(\cdot)]$ .  $f$  is modelled with a multi-output Gaussian Process, which keeps track of the predictive mean and uncertainty for all points  $\mathbf{x}$  in the design space. BO selects which candidate to evaluate next, such that uncertainty is reduced globally, but also comes back to regions with high predictive mean to further improve upon the previous points with favorable outcomes. BO scales to high-dimensional spaces [40, 66], supports both discrete and continuous search variables [9, 17, 19], and doesn't impose limiting restrictions on the properties of the objective  $f$ . Its model of the global posterior also facilitates interpretable analysis of the search results. Hence, this setup yields a flexible and informative framework for automating DSE.

We base our DSE implementation on the Optuna package [3] and conduct experiments with a BOTorch sampler



**Figure 10:** An overview of the open-source PLENA system.

and a tree-search sampler. With BOTorch [9] sampler we treat the design space as continuous during posterior modelling, but discretize the points proposed by BO for evaluating concrete design choices. We also test an alternative of using the Tree-Structured Parzen Estimator [68], often used for discrete spaces.

In our co-design setup, we incorporate post-training quantization directly into the optimization loop. This allows us to evaluate candidate hardware and quantization configurations jointly, using pre-trained model weights while searching over quantization parameters such as datatype and precision settings for activations and KV cache. The joint search space is defined in Table 14. For each candidate design, we assess *accuracy*, *latency*, and *area*:

- Accuracy is measured in terms of language modeling quality, where we evaluate perplexity on Wikitext2 using our accuracy evaluator.
- Latency and area utilization are obtained from our Roofline-based simulators, as illustrated in Figure 10.

To ensure efficient exploration, we impose input constraints over the design space (Table 15) and apply rejection sampling to discard invalid or infeasible candidates. This avoids unnecessary costly objective evaluations and accelerates convergence of the search. We first conduct experiments on LLAMA3.2-1B to enable rapid iteration, and then extend our evaluation to LLAMA3-8B. The results are described in Section 5.3.

## 6 Evaluation

### 6.1 Experiment Setup

**Models and Datasets.** We evaluate our quantization framework on mainly two families of LLMs, namely LLaMA-2 [64] and LLaMA-3 [45]. We also demonstrate our system on MoE (eg. GPT-OSS) and MLA-based (QWen-MLA) models. Quantization performance is measured in terms of perplexity on the WikiText-2 dataset [44]. The entire quantization process requires approximately 2–20 GPU hours on NVIDIA H100 GPUs, depending on the model size and configuration.

**Quantization Baselines.** We compare against several state-of-the-art quantization methods, including software-based approaches such as GPTQ [22], OmniQuant [61], and QuaRot [6], as well as hardware-accelerated approaches such as Atom [75] and MicroscopiQ [54].

**Accelerator Implementation.** PLENA is implemented in SystemVerilog RTL. We perform synthesis using Synopsys Design Compiler with the 7 nm OpenROAD predictive process design kit [13] to generate area and power estimates under a 1 GHz clock frequency.

**Accelerator Baselines.** Since the works we selected for comparison, MicroscopiQ [54], FIGNA [31], and Olive [25], are not open-source and were not evaluated using the same technology node or toolchain, we re-implemented their core components and integrated them into the PLENA system for a fair inference performance comparison. Additionally, Deep-Scale [58] is used for overall system performance estimation, scaling all designs to the 7 nm process. Their detailed area and power of the core units are evaluated using our own implementations.

**Inference Process.** Instead of comparing only with SOTA accelerators, we also evaluate against the latest high performance commercial compute units, including GPUs (A100 80G) and TPUs (v6e-8). The GPU experiments are conducted in an environment with Ubuntu 22.04, CUDA 12.8, Python 3.11, PyTorch 2.8.0, and vLLM 0.10 V1. The TPU experiments are conducted in an environment with v2-alpha-tpuv6e software and vllm\vllm\_tpu docker image.

### 6.2 Quantization

**Quantization Comparison.** We evaluate our quantization method against related work; results are summarized in Table 3. For a fair comparison, we first match prior settings by quantizing only the nine GEMMs in the Llama decoder. We then report full-system experiments that also quantize nonlinear functions, RoPE, and embeddings in table 4. In the W4A4KV16 setting, our results outperform all related work. For LLaMA-3-8B, compared with prior approaches, our method achieves at least a 1.24 reduction in perplexity. The key contributions to this performance improvement come from three aspects: 1). **MXInt operation:** While previous

**Table 3:** WikiText-2 perplexity (lower is better) under GEMM-only emulation (nonlinear ops left in full precision) for LLaMA. Results for GPTQ, AWQ, OmniQuant, and Atom are taken from MicroScopiQ; QuaRot numbers are from the paper or reproduced when not reported. W/A/KV denote bit widths for weights, activations, and KV cache.

| <b>Method</b>    | <b>W/A/KV</b> | <b>LLaMA-2 [64]</b> |             |             | <b>LLaMA-3 [45]</b> |             |
|------------------|---------------|---------------------|-------------|-------------|---------------------|-------------|
|                  |               | <b>7B</b>           | <b>13B</b>  | <b>70B</b>  | <b>8B</b>           | <b>70B</b>  |
| Baseline         | 16/16/16      | 5.47                | 4.83        | 3.31        | 6.13                | 2.85        |
| GPTQ [22]        | 4/16/16       | 6.23                | 5.58        | 4.28        | 8.12                | 3.75        |
| AWQ [39]         | 4/16/16       | 5.82                | 5.19        | 4.08        | 7.96                | 3.58        |
| OmniQuant [61]   | 4/16/16       | 5.74                | 5.02        | 3.47        | 7.09                | 3.46        |
| MicroScopiQ [54] | 4/16/16       | 5.65                | 5.02        | 3.42        | 6.89                | 3.25        |
| QuaRot [6]       | 4/16/16       | <b>5.60</b>         | 5.00        | 3.41        | 6.52*               | 3.53*       |
| Ours             | 4/16/16       | 5.61                | <b>4.97</b> | <b>3.41</b> | <b>6.45</b>         | 3.59        |
| OmniQuant [61]   | 4/4/16        | 11.47               | 8.32        | 5.41        | 10.21               | 5.30        |
| SmoothQuant [70] | 4/4/16        | 20.47               | 15.63       | 17.62       | 29.54               | 19.32       |
| Atom [75]        | 4/4/16        | 6.16                | 6.12        | 5.20        | 8.12                | 4.69        |
| MicroScopiQ [54] | 4/4/16        | 6.11                | 5.57        | 4.48        | 8.12                | 4.65        |
| QuaRot [6]       | 4/4/16        | 6.02*               | 5.36*       | 3.78        | 8.00*               | 6.33*       |
| M-ANT [29]       | 4/4/16        | 5.92                | 5.24        | -           | -                   | -           |
| Ours             | 4/4/16        | <b>5.69</b>         | <b>5.03</b> | <b>3.59</b> | <b>6.76</b>         | <b>4.51</b> |
| QuaRot [6]       | 4/4/4         | 6.10                | 5.40        | 3.79        | 8.16                | 6.66        |
| QuaRot-128G [6]  | 4/4/4         | 5.93                | 5.26        | <b>3.61</b> | 7.36                | 5.51        |
| Ours             | 4/4/4         | <b>5.89</b>         | <b>5.18</b> | 3.62        | <b>7.22</b>         | <b>4.77</b> |

Note: Results marked with \* are reproduced from the authors' released code. Specifically, for the QuaRot 4/4/16 configuration, we follow the experimental setup described in their paper, where activations are per-token symmetric-quantized with a clipping ratio of 0.9.

**Table 4:** Quantization results for LLaMA comparing GEMM-only quantization with full-system quantization (including GEMM, nonlinear ops, input embeddings, and LM head). Nonlinear operators are simulated in MiniFloat E6M5.

| <b>Method</b>    | <b>W/A/KV</b> | <b>LLaMA-2 [64]</b> |            |            | <b>LLaMA-3 [45]</b> |            |
|------------------|---------------|---------------------|------------|------------|---------------------|------------|
|                  |               | <b>7B</b>           | <b>13B</b> | <b>70B</b> | <b>8B</b>           | <b>70B</b> |
| Baseline         | 16/16/16      | 5.47                | 4.83       | 3.31       | 6.13                | 2.85       |
| Ours             | 4/4/4         | 5.89                | 5.18       | 3.62       | 7.22                | 4.77       |
| Ours-Full System | 4/4/4         | 5.91                | 5.19       | 3.63       | 7.23                | 4.82       |

work [29] adopts a group size of 32, our design keeps the group size small while still maintaining high hardware efficiency. 2). **Selective rotation:** Our approach searches for the best layer-wise rotation combination for each model. Unlike QuaRot [6], which merges rotation into weights, we apply online rotation only to specific layers. This provides an additional design space for finding optimal solutions in the PTQ setting. 3). **Clipping strategy:** By integrating *output-guided*, blockwise clipping into iterative weight quantization, we validate that output reconstruction error correlates strongly with end-task performance; consequently, our approach substantially reduces perplexity degradation.

**Full System Design.** We performed a brute-force sweep to select the vector-core precision, where we find quantizing the remaining operators to a MiniFloat E6M5 format is effectively lossless in perplexity while reducing precision

**Table 5:** System-level comparison across standard (Prompt = 1k, Gen = 128) and agentic (Prompt = 5.6k, Gen = 85k) workloads. For fairness, we use four A100 GPUs with a total HBM capacity of 320 GB as the reference. PLENA and MicroScopiQ are both assumed to have four cores and identical HBM configurations, including capacity and bandwidth. The selected configurations are listed in Table 17. Since the GPU’s silicon area includes significant overhead for non-compute functionality, we ensure that the multiplier count is matched across systems for a balanced comparison.

| System             | LLaMA-3.1-8B |                                |          |                                | LLaMA-3.3-70B |                      |          |                                |
|--------------------|--------------|--------------------------------|----------|--------------------------------|---------------|----------------------|----------|--------------------------------|
|                    | Standard     |                                | Agentic  |                                | Standard      |                      | Agentic  |                                |
|                    | TTFT (s)     | TPS ( $\times$ A100)           | TTFT (s) | TPS ( $\times$ A100)           | TTFT (s)      | TPS ( $\times$ A100) | TTFT (s) | TPS ( $\times$ A100)           |
| A100 80G           | 6.20         | 1.00 $\times$                  | 0.22     | 1.00 $\times$                  | 1.12          | 1.00 $\times$        | 1.05     | 1.00 $\times$                  |
| A100 80G (With Q)* | 5.13         | 1.66 $\times$                  | 0.19     | 1.39 $\times$                  | 3.41          | 1.23 $\times$        | 2.46     | 1.32 $\times$                  |
| TPU v6e            | 5.63         | 0.90 $\times$                  | 4.58     | 0.39 $\times$                  | 50.07         | 0.31 $\times$        | 7.98     | 0.84 $\times$                  |
| MicroScopiQ [54]   | 16.43        | 0.35 $\times$                  | 3.27     | 0.57 $\times$                  | 61.63         | 0.16 $\times$        | 19.23    | 0.09 $\times$                  |
| <b>PLENA</b>       | 4.19         | <b>2.24<math>\times</math></b> | 0.21     | <b>1.58<math>\times</math></b> | 5.65          | 1.19 $\times$        | 1.27     | <b>1.49<math>\times</math></b> |

Note: MicroScopiQ [54] was developed by us, and we deploy its replicated compute unit on the PLENA platform to do the testing. The version of LLaMA-3.1-8B used is LLaMA-3.1-8B-Instruct-quantized.w8a8. With Q\* means QuaRot quantization [6].

**Table 6:** System-level comparison on GPT-OSS 20B (MoE) [5] and QWen with MLA [43], showing that PLENA can be adapted to new models with both MLA and MoE configurations and achieve higher TPS than A100 80G under the same experimental settings as Table 5.

| System       | GPT-OSS 20B (MoE) |                                |          |                                | qwen2.5-7B* |                                |          |                                |
|--------------|-------------------|--------------------------------|----------|--------------------------------|-------------|--------------------------------|----------|--------------------------------|
|              | Standard          |                                | Agentic  |                                | Standard    |                                | Agentic  |                                |
|              | TTFT (s)          | TPS ( $\times$ A100)           | TTFT (s) | TPS ( $\times$ A100)           | TTFT (s)    | TPS ( $\times$ A100)           | TTFT (s) | TPS ( $\times$ A100)           |
| A100 80G     | 9.39              | 1.00 $\times$                  | 1.87     | 1.00 $\times$                  | 8.21        | 1.00 $\times$                  | 1.17     | 1.00 $\times$                  |
| <b>PLENA</b> | 6.13              | <b>1.36<math>\times</math></b> | 1.41     | <b>1.21<math>\times</math></b> | 5.71        | <b>1.42<math>\times</math></b> | 1.52     | <b>1.30<math>\times</math></b> |

Note: The remaining accelerators and TPUs are not included since they do not support these configurations.

by 25% relative to FP16. As shown in table 4, the maximum perplexity increase under full-system quantization is  $\leq 0.05$ .

### 6.3 Co-design

This subsection shows the results of our design space exploration experiments. Figure 11 shows the Empirical Attainment Surfaces (EAS) for the Pareto fronts found when optimizing with LLaMA3.2-1B and LLaMA-3-8B. EAS is a visualization approach well-suited for conveying the uncertainty of the Pareto fronts from multiple runs with different random seeds [21, 33]. Existing tools support visual analysis for two objectives [67], hence we plot EAS for accuracy and latency first. Then, in Table 16 we analyze the relationship between all objectives. Figure 11 shows that active learning with BOtorch sampler achieves a significantly better trade-off between latency and perplexity than naive randomized sampling. Tree-Structured Parzen Estimator (TPE) shows more modest gains when optimizing with LLaMA3.2-1B compared to using BOtorch sampler, thus we focus on the latter for experiments with LLaMA-3-8B.

### 6.4 Compute Performance

The system-level performance comparison is shown in Table 5, evaluating both small and large GQA-based LLaMA



**Figure 11:** Empirical Attainment Surfaces for latency ( $\downarrow$ ) and perplexity ( $\downarrow$ ) objectives across multiple seeds, evaluated with LLaMA3.2-1B and LLaMA-3-8B over the co-design space shown in Table 14. For the 1B model, we run 9 seeds with 50 trials, comparing BoTorch and TPE methods against Random sampling. For the 8B model, we run 5 seeds with 50 trials, comparing BoTorch against Random. Shaded regions show the 25% and 75% attainment bands across seeds.

models as well as the recently published MoE-based GPT-OSS model, all implemented in 7 nm technology and supporting long-context inputs. This experiment investigates peak TPS by scaling the batch size to the maximum capacity that HBM can accommodate. As shown, PLENA achieves consistently higher TPS than both the A100 and TPU v6e under identical HBM settings and multiplier counts, with peak performance reaching up to 2.24 $\times$  that of the A100 and 3.85 $\times$  that of the TPU v6e. The higher TTFT observed in PLENA is

explained by its ability to store more batches within the same HBM capacity using our quantization scheme. As batch size increases, the prefill stage grows longer due to additional memory accesses and computation.

**Table 7:** Compute area, utilization, and attainable FLOPs of systolic arrays under W4A4KV4 bitwidth for LLaMA-3.3-70B. Baselines use  $64 \times 64$  arrays, while PLENA employs a flattened (4, 512) array. Results are shown for Standard (Prompt = 1k, Gen = 128) and Agentic (Prompt = 5.6k, Gen = 8k) workloads.

| Metric                       | Micro [54] | Olive [25] | FIGNA [31] | <b>PLENA</b> |
|------------------------------|------------|------------|------------|--------------|
| Comp Area (mm <sup>2</sup> ) | 0.1378     | 0.319      | 0.471      | 0.237        |
| TOPs/mm <sup>2</sup>         | 59.45      | 25.66      | 17.39      | 34.49        |
| S. A FLOPs/mm <sup>2</sup> * | 28.76      | 11.59      | 7.51       | <b>32.80</b> |
| A. A FLOPs/mm <sup>2</sup> * | 1.08       | 0.44       | 0.31       | <b>5.31</b>  |

\*Attainable FLOPs are computed from utilization and peak design throughput.  
 Micro = MicroscopicQ. S. A FLOPs = Standard workload attainable FLOPs. A. A FLOPs = Agentic workload attainable FLOPs.

As shown in Table 7, PLENA achieves significantly higher utilization than prior designs in both short- and long-context workloads, with up to 8.5× improvement in attainable utilization.

## 7 Conclusion

This paper introduces **PLENA**, a hardware–software co-designed system that features a flattened systolic array, an asymmetric quantization scheme, and native architectural support for FlashAttention, addressing the underutilization challenges posed by memory bandwidth and capacity walls. Beyond the hardware, PLENA is supported by a full toolchain—including a compiler, cycle-accurate simulator, and design space exploration framework—that enables rapid adaptation and optimization for emerging transformer models. Future work will focus on further optimizing GEMM utilization in FlashAttention and extending PLENA with a multi-core flattened systolic array to better exploit parallelism. In addition, the compiler can be enhanced to provide finer-grained control over execution scheduling. Finally, we plan to integrate PLENA with GPUs to form a heterogeneous LLM acceleration system.

## References

- [1] Mayank Agarwal, Jorge J. Barroso, Tathagata Chakraborti, Eli M. Dow, Kshitij Fadnis, Borja Godoy, Madhavan Pallan, and Kartik Talamadupula. 2020. Project CLAI: Instrumenting the Command Line as a New Environment for AI Agents. arXiv:2002.00762 [cs.HC] <https://arxiv.org/abs/2002.00762>
- [2] Meta AI. 2025. *The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation*. <https://ai.meta.com/blog/llama-4-multimodal-intelligence/> Accessed: 2025-08-16.
- [3] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv:1907.10902
- [4] Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 [cs.LG] <https://arxiv.org/abs/2207.00032>
- [5] Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 2022. Efficient Large Scale Language Modeling with Mixtures of Experts. arXiv:2112.10684 [cs.CL] <https://arxiv.org/abs/2112.10684>
- [6] Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefer, and James Hensman. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. arXiv:2404.00456 [cs.LG] <https://arxiv.org/abs/2404.00456>
- [7] Junjie Bai, Fang Lu, and Ke Zhang. 2019. ONNX: Open Neural Network Exchange. <https://github.com/onnx/onnx>. GitHub repository (2019).
- [8] Yushu Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. arXiv:2408.07055 [cs.CL] <https://arxiv.org/abs/2408.07055>
- [9] Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G Wilson, and Eytan Bakshy. 2020. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. *Advances in neural information processing systems* 33 (2020), 21524–21538.
- [10] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In *Proceedings of the AAAI conference on artificial intelligence*, Vol. 34. 7432–7439.
- [11] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL] <https://arxiv.org/abs/2005.14165>
- [12] Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. 2025. The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467 [cs.LG] <https://arxiv.org/abs/2412.05467>
- [13] L. T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric. 2016. ASAP: A 7-nm finFET predictive process design kit. *Microelectronics Journal* 53 (July 2016), 105–115. doi:10.1016/j.mejo.2016.04.006
- [14] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. *arXiv preprint arXiv:1803.05457* (2018).
- [15] NVIDIA Corporation. 2024. *NVIDIA Blackwell Architecture Technical Brief*. Technical Report. NVIDIA Corporation. <https://resources.nvidia.com/en-us-blackwell-architecture>
- [16] Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] <https://arxiv.org/abs/2307.08691>
- [17] Samuel Daulton, Xingchen Wan, David Eriksson, Maximilian Balandat, Michael A Osborne, and Eytan Bakshy. 2022. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. *Advances in Neural Information Processing Systems* 35 (2022), 12760–12774.
- [18] Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. 2025. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need. arXiv:2507.14397 [cs.AR] <https://arxiv.org/abs/2507.14397>
- [19] Aryan Deshwal, Syrine Belakaria, and Janardhan Rao Doppa. 2021. Bayesian optimization over hybrid spaces. In *International Conference on Machine Learning*. PMLR, 2632–2643.
- [20] Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?. In *Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235)*, Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 11642–11662. <https://proceedings.mlr.press/v235/drouin24a.html>
- [21] Carlos M Fonseca, Andreia P Guerreiro, Manuel López-Ibáñez, and Luís Paquete. 2011. On the computation of the empirical attainment function. In *International Conference on Evolutionary Multi-criterion Optimization*. Springer, 106–120.
- [22] Elias Frantar, Saleh Ashkboos, Torsten Hoefer, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. *arXiv preprint arXiv:2210.17323* (2022).
- [23] Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024. AI and Memory Wall. *IEEE Micro* 44, 3 (May 2024), 33–39. doi:10.1109/MM.2024.3373763
- [24] Google. 2025. *System Architecture: TPU VM*. Technical Report. Google Cloud. Last updated August 1, 2025.
- [25] Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyu Guo, and Yuhao Zhu. 2023. Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In *Proceedings of the 50th Annual International Symposium on Computer Architecture*. 1–15.
- [26] Cong Guo, Chiyue Wei, Jiaming Tang, Bowen Duan, Song Han, Hai Li, and Yiran Chen. 2025. Transitive Array: An Efficient GEMM Accelerator with Result Reuse. arXiv:2504.16339 [cs.AR] <https://arxiv.org/abs/2504.16339>
- [27] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL] <https://arxiv.org/abs/2401.13919>
- [28] Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, XiuHong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2024. FlashDecoding++: Faster Large Language Model Inference on GPUs. arXiv:2311.01282 [cs.LG] <https://arxiv.org/abs/2311.01282>
- [29] Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyu Guo, and Jingwen Leng. 2025. M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type. In *2025 IEEE International*

- Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 1112–1126.
- [30] Yoichi Ishibashi and Yoshimasa Nishimura. 2024. Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization. arXiv:2404.02183 [cs.SE] <https://arxiv.org/abs/2404.02183>
- [31] Jaeyong Jang, Yulhwa Kim, Juheun Lee, and Jae-Joon Kim. 2024. FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy. In *2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. 760–773. doi:10.1109/HPCA57654.2024.00064
- [32] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL] <https://arxiv.org/abs/2406.00515>
- [33] Joshua Knowles. 2005. A summary-attainment-surface plotting method for visualizing the performance of stochastic multiobjective optimizers. In *5th International Conference on Intelligent Systems Design and Applications (ISDA'05)*. IEEE, 552–557.
- [34] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] <https://arxiv.org/abs/2205.11916>
- [35] Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, and Kimin Lee. 2025. Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents. arXiv:2503.10689 [cs.CL] <https://arxiv.org/abs/2503.10689>
- [36] Jungi Lee, Wonbeom Lee, and Jaewoong Sim. 2024. Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization. In *2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)*. 1048–1062. doi:10.1109/ISCA59077.2024.00080
- [37] Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. 2024. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. *arXiv preprint arXiv:2411.05007* (2024).
- [38] Jiawei Lin, Guokai Chen, Yuanlong Li, and Thomas Bourgeat. 2025. SystolicAttention: Fusing FlashAttention within a Single Systolic Array. arXiv:2507.11331 [cs.AR] <https://arxiv.org/abs/2507.11331>
- [39] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. *Proceedings of Machine Learning and Systems* 6 (2024), 87–100.
- [40] Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. 2020. When Gaussian process meets big data: A review of scalable GPs. *IEEE transactions on neural networks and learning systems* 31, 11 (2020), 4405–4423.
- [41] Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024. OmniParser for Pure Vision Based GUI Agent. arXiv:2408.00203 [cs.CV] <https://arxiv.org/abs/2408.00203>
- [42] Hacong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator. arXiv:2308.11030 [cs.AR] <https://arxiv.org/abs/2308.11030>
- [43] Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, and Muhan Zhang. 2025. TransMLA: Multi-Head Latent Attention Is All You Need. arXiv:2502.07864 [cs.LG] <https://arxiv.org/abs/2502.07864>
- [44] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. *arXiv preprint arXiv:1609.07843* (2016).
- [45] AI Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date. *Meta AI* (2024).
- [46] Magnus Müller and Gregor Žunić. 2024. *Browser Use: Enable AI to control your browser*. <https://github.com/browser-use/browser-use>
- [47] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. 2021. A white paper on neural network quantization. *arXiv preprint arXiv:2106.08295* (2021).
- [48] Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, and Tongliang Liu. 2025. Flow: Modularized Agentic Workflow Automation. arXiv:2501.07834 [cs.AI] <https://arxiv.org/abs/2501.07834>
- [49] OpenAI. 2024. ChatGPT. <https://openai.com/index/chatgpt/>. Accessed: 2024-08-04.
- [50] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecco, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschle, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kirov, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanj, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Rei-ichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emry Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley,

- Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Jun-tang Zhuang, William Zhuk, and Barret Zoph. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] <https://arxiv.org/abs/2303.08774>
- [51] Hongyi Pan, Diaa Dabawi, and Ahmet Enis Cetin. 2021. Fast Walsh-Hadamard Transform and Smooth-Thresholding Based Binary Layers in Deep Neural Networks. arXiv:2104.07085 [cs.CV] <https://arxiv.org/abs/2104.07085>
- [52] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. *arXiv preprint arXiv:1606.06031* (2016).
- [53] Jiajun Qin, Tianhua Xia, Cheng Tan, Jeff Zhang, and Sai Qian Zhang. 2025. PICACHU: Plug-In CGRA Handling Upcoming Nonlinear Operations in LLMs. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2* (Rotterdam, Netherlands) (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 845–861. doi:10.1145/3676641.3716013
- [54] Akshat Ramachandran, Souvik Kundu, and Tushar Krishna. 2025. Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture*. 1193–1209.
- [55] Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evaluating Coding LLMs at 1M Context Windows. arXiv:2505.07897 [cs.CL] <https://arxiv.org/abs/2505.07897>
- [56] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmkhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, and Eric Chung. 2023. Microscaling Data Formats for Deep Learning. arXiv:2310.10537 [cs.LG] <https://arxiv.org/abs/2310.10537>
- [57] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. *Commun. ACM* 64, 9 (2021), 99–106.
- [58] Satyabrata Sarangi and Bevan Baas. 2021. DeepScaleTool: A Tool for the Accurate Estimation of Technology Scaling in the Deep-Submicron Era. In *2021 IEEE International Symposium on Circuits and Systems (ISCAS)*. 1–5. doi:10.1109/ISCAS51556.2021.9401196
- [59] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2015. Taking the human out of the loop: A review of Bayesian optimization. *Proc. IEEE* 104, 1 (2015), 148–175.
- [60] Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023. Omnidquant: Omnidirectionally calibrated quantization for large language models. *arXiv preprint arXiv:2308.13137* (2023).
- [61] Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023. Omnidquant: Omnidirectionally calibrated quantization for large language models. *arXiv:2308.13137* (2023).
- [62] SiFive, Inc. 2020. *TileLink Specification*. Specification v1.8.1. SiFive, Inc. [https://starfivetech.com/uploads/tilelink\\_spec\\_1.8.1.pdf](https://starfivetech.com/uploads/tilelink_spec_1.8.1.pdf) Version
- 1.8.1.
- [63] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL] <https://arxiv.org/abs/2302.13971>
- [64] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. *arXiv preprint arXiv:2307.09288* (2023).
- [65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] <https://arxiv.org/abs/1706.03762>
- [66] Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando De Feitas. 2016. Bayesian optimization in a billion dimensions via random embeddings. *Journal of Artificial Intelligence Research* 55 (2016), 361–387.
- [67] Shuhei Watanabe. 2023. Python tool for visualizing variability of Pareto fronts over multiple runs. *arXiv preprint arXiv:2305.08852* (2023).
- [68] Shuhei Watanabe. 2023. Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance. arXiv:2304.11127
- [69] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sébastien Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs.CL] <https://arxiv.org/abs/2206.07682>
- [70] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In *International Conference on Machine Learning*. PMLR, 38087–38099.
- [71] Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. arXiv:2504.02605 [cs.SE] <https://arxiv.org/abs/2504.02605>
- [72] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? *arXiv preprint arXiv:1905.07830* (2019).
- [73] Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. In *Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays* (Monterey, CA, USA) (FPGA '24). Association for Computing Machinery, New York, NY, USA, 223–234. doi:10.1145/3626202.3637562
- [74] Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff. 2024. LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference. In *2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)*. 1080–1096. doi:10.1109/ISCA59077.2024.00082
- [75] Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate ILM serving. *Proceedings of Machine Learning and Systems* 6 (2024), 196–209.

- [76] Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2023. KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization. (2023). doi:10.13140/RG.2.2.28167.37282

## A Appendix

### A.1 Ablation Study on Quantization Methods

**Table 8:** Ablation study on quantization techniques. Covering all 9 GEMMs in the LLaMA-3-8B model. Quantization configured with W\_MXINT4, A\_MXINT4, KV\_MXINT4 with block size 16.

| Method                                               | Metric (e.g., PPL $\downarrow$ ) |
|------------------------------------------------------|----------------------------------|
| <b>Baseline FP16</b>                                 | 6.14                             |
| RTN                                                  | 8.2763(2.3793 $\uparrow$ )       |
| RTN + Err <sub>w</sub> Clip                          | 8.1948(0.0815 $\downarrow$ )     |
| GPTQ + Err <sub>w</sub> Clip                         | 8.5193 (0.3245 $\uparrow$ )      |
| GPTQ + Err <sub>y</sub> Clip                         | 7.6026 (0.9167 $\downarrow$ )    |
| GPTQ + Err <sub>y</sub> Clip<br>+ Selective Rotation | 7.2218(0.3808 $\downarrow$ )     |

### A.2 A/KV Datatype Search

**Table 9:** Perplexity on WikiText2 (lower is better) for various quantization settings applied to LLaMA-3 8B. We kept MXINT4 for weights.

| Quant Method  | A/KV Datatype Search |              |        |
|---------------|----------------------|--------------|--------|
|               | e1m2 (MXFP4)         | e2m1 (MXFP4) | MXINT4 |
| Baseline FP16 | 6.14                 |              |        |
| Ours          | 8.7205               | 23.1579      | 7.22   |

### A.3 Selective Rotation

**Table 10:** This table investigates the effect of online rotation on activations in the linear layers. For the LLaMA2-7B model, applying rotation to the down\_proj layer results in worse performance compared to not rotating it, whereas this effect is not observed in the LLaMA3-8B model. Moreover, rotating the o\_proj layer severely degrades the performance of LLaMA3-8B. These results suggest that the effectiveness of rotation is highly model-dependent.

| Rotated Layer    | LLaMA2-7B | LLaMA3-8B |
|------------------|-----------|-----------|
| Attn Only        | 5.9367    | 7.3933    |
| Attn + Down_proj | 5.9405    | 7.2721    |
| Attn + Up_proj   | 5.9263    | 7.3529    |
| Attn + Gate_proj | 5.9241    | 7.3875    |
| Attn + Q_proj    | 5.9183    | 7.3616    |
| Attn + K_proj    | 5.9182    | 7.3555    |
| Attn + V_proj    | 5.9322    | 7.3788    |
| Attn + O_proj    | 5.9238    | nan       |

### A.4 Computation Flow of LLaMA Decoder-only Transformer

**Algorithm 2** Computation flow of a LLaMA decoder-only Transformer with embedding, lm\_head, and with  $L$  layers: each decoder layer performs [MatMul1–9] interleaved with RMSNorm, RoPE, and nonlinear activations (Softmax, SiLU).

---

**Require:**  $t \in [V]^T$  ▷ token ids  
**Require:**  $B, T, d, L, H, H_{kv}$  ▷ batch, seq, hidden\_dim, #layers, #Q heads, #KV heads  
**Require:**  $(\cos \theta, \sin \theta)$  ▷ RoPE parameters  
1:  $X^{(1)} \leftarrow \text{EMBED}(t)$   $\triangleright X^{(1)} \in \mathbb{R}^{B \times T \times d}$   
2: **for**  $\ell = 1$  **to**  $L$  **do**  
3:     *Layer input:*  $X^{(\ell)} \in \mathbb{R}^{B \times T \times d}$   
4:      $X_n \leftarrow \text{RMSNorm}(X^{(\ell)})$   
5:      $Q \leftarrow X_n W_Q$  [MatMul1]  
6:      $K \leftarrow X_n W_K$  [MatMul2]  
7:      $V \leftarrow X_n W_V$  [MatMul3]  
8:      $(Q, K) \leftarrow \text{RoPE}(Q, K; \cos \theta, \sin \theta)$   
9:      $(K, V) \leftarrow \text{RepeatGroups}(K, V, H/H_{kv})$  ▷ GQA  
10:     $A_w \leftarrow \text{SOFTMAX}\left(\frac{QK^\top}{\sqrt{d_h}}\right)$  [MatMul4]  
11:     $A_w \leftarrow A_w V$  [MatMul5]  
12:     $A_o \leftarrow A_w W_O$  [MatMul6]  
13:     $X' \leftarrow X^{(\ell)} + A_o$  ▷ residual add  
14:     $X'_n \leftarrow \text{RMSNorm}(X')$   
15:     $X_{\text{act}} \leftarrow \text{SiLU}(X'_n W_{\text{up}})$  [MatMul7]  
16:     $X_{\text{gate}} \leftarrow X'_n W_{\text{gate}}$  [MatMul8]  
17:     $X_{\text{mlp}} \leftarrow (X_{\text{act}} \odot X_{\text{gate}}) W_{\text{down}}$  [MatMul9]  
18:     $X^{(\ell+1)} \leftarrow X' + X_{\text{mlp}}$  ▷ residual add  
19:    logits  $\leftarrow X^{(L+1)} W_{\text{LM}}$   
20:     $\hat{p} \leftarrow \text{SOFTMAX}(\text{logits})$   
21:    **return**  $\hat{p}$

---

## A.5 Matrix SRAM



**Figure 12:** This matrix SRAM supports both transposed and untransposed reads without additional cost. The key idea is to store each row of data separately across a set of sub-SRAMs, where the number of sub-SRAMs equals the vector dimension being stored. The row index assigned to each element differs across the sub-SRAMs, ensuring that elements from the same matrix column (green dotted line) are distributed across different sub-SRAMs. With this organization, when reading from the SRAM—whether in transposed or untransposed mode—each requested element resides in a different sub-SRAM. As a result, only one read port per sub-SRAM is required.

## A.6 PE Array



**Figure 13:** In the hardware implementation of the PE array, element and scale will flow from top to bottom, left to right. All computations are performed using integer arithmetic.

## A.7 Custom ISA

**Table 11:** A summary of the PLEANA customized ISA for the accelerator

| Instruction Type | Description                                                               | Instruction No. |
|------------------|---------------------------------------------------------------------------|-----------------|
| Matrix           | Controls GEMM and GEMV operations, with or without matrix transposition   | 6               |
| Vector           | Performs elementwise, reduction operations, and rotation for quantization | 13              |
| Scalar           | Performs scalar INT and FP arithmetic                                     | 17              |
| HBM              | Handles data transfers between HBM and matrix/vector SRAMs                | 3               |
| Control          | Defines operation settings, including the HBM physical address            | 4               |

## A.8 Custom Instructions

**Table 12:** Summary of Custom ISA Instructions.

| Type        | Instruction (Format)                               | Description                                                                 |
|-------------|----------------------------------------------------|-----------------------------------------------------------------------------|
| Matrix (M)  | M_MM (opcode, rd, rs1, rs2)                        | Multiply Matrix[rs2] and Vector[rs1]; accumulate in systolic array.         |
|             | M_TMM (opcode, rd, rs1, rs2)                       | Same as M_MM but with matrix transpose.                                     |
|             | M_MV (opcode, rd, rs1)                             | Multiply Matrix[rs2] and Vector[rs1]; store in first row of systolic array. |
|             | M_TMV (opcode, rd, rs1)                            | Same as M_MV but with matrix transpose.                                     |
|             | M_MV_WO (opcode, rd, imm)                          | Write out first row of systolic array to Vector SRAM[rd+imm].               |
|             | M_MM_WO (opcode, rd, imm)                          | Write out systolic array results to Vector SRAM[rd+imm].                    |
| Vector (V)  | V_ADD_VV (opcode, rd, rs1, rs2)                    | Elementwise vector addition.                                                |
|             | V_ADD_VF (opcode, rd, rs1, rs2)                    | Vector plus broadcasted FP register.                                        |
|             | V_SUB_VV (opcode, rd, rs1, rs2)                    | Elementwise vector subtraction.                                             |
|             | V_SUB_VF (opcode, rd, rs1, fp2)                    | Vector minus broadcasted FP register.                                       |
|             | V_MUL_VV (opcode, rd, rs1, rs2)                    | Elementwise vector multiplication.                                          |
|             | V_MUL_VF (opcode, rd, rs1, fp2)                    | Vector times broadcasted FP register.                                       |
|             | V_EXP_V (opcode, rd, rs1)                          | Elementwise exponentiation.                                                 |
|             | V_REC_V (opcode, rd, rs1)                          | Elementwise reciprocal.                                                     |
|             | V_LD_F (opcode, rd, rs1)                           | Broadcast FP register value to vector.                                      |
|             | V_RED_SUM (opcode, rd, rs1)                        | Reduction sum of vector into FP register.                                   |
|             | V_RED_MAX (opcode, rd, rs1)                        | Reduction max of vector into FP register.                                   |
|             | V_ROTATION_EN (opcode, rd, rs1)                    | Selectively apply Hadamard rotation                                         |
|             | V_INV_ROTATION_EN (opcode, rd, rs1)                | Selectively apply inverse Hadamard rotation                                 |
|             | S_ADD_INT (opcode, rd, rs1, rs2)                   | Integer addition.                                                           |
|             | S_ADDI_INT (opcode, rd, rs1, imm)                  | Integer add immediate.                                                      |
| Scalar (S)  | S_SUB_INT (opcode, rd, rs1, rs2)                   | Integer subtraction.                                                        |
|             | S_LUI_INT (opcode, rd, imm)                        | Load upper immediate.                                                       |
|             | S_MUL_INT (opcode, rd, rs1, rs2)                   | Integer multiplication.                                                     |
|             | S_DIV_INT (opcode, rd, rs1, rs2)                   | Integer division.                                                           |
|             | S_LD_INT (opcode, rd, rs1, imm)                    | Load from FIX_MEM into integer register.                                    |
|             | S_ST_INT (opcode, rd, rs1, imm)                    | Store integer register to FIX_MEM.                                          |
|             | S_ADD_FP (opcode, rd, rs1, rs2)                    | FP addition.                                                                |
|             | S_SUB_FP (opcode, rd, rs1, rs2)                    | FP subtraction.                                                             |
|             | S_MUL_FP (opcode, rd, rs1, rs2)                    | FP multiplication.                                                          |
|             | S_EXP_FP (opcode, rd, rs1)                         | FP exponentiation.                                                          |
|             | S_MAX_FP (opcode, rd, rs1, rs2)                    | FP maximum.                                                                 |
|             | S_LD/ST_FP (opcode, rd, rs1, imm)                  | Load/store FP register from/to FP_MEM.                                      |
| Memory (H)  | H_PREFETCH_M (opcode, rd, rs1, rs2, rstride, prec) | Prefetch specified rows from HBM to Matrix SRAM.                            |
|             | H_PREFETCH_V (opcode, rd, rs1, rs2)                | Prefetch specified amount of rows from HBM to Vector SRAM.                  |
|             | H_STORE_V (opcode, rd, rs1, rs2, stride, prec)     | Store VLEN rows from Vector SRAM to HBM.                                    |
| Control (C) | C_SET_ADDR_REG (opcode, rd, rs1, rs2)              | Set HBM address register from two FIX regs.                                 |
|             | C_SET_SCALE_REG (rd, opcode)                       | Set MX scale offset for quantized data.                                     |
|             | C_SET_LUT_REG (rd, opcode)                         | Set MX scale offset for quantized data.                                     |
|             | C_BREAK (opcode)                                   | Trigger breakpoint exception.                                               |

## A.9 Downstream Tasks

**Table 13:** Zero-shot accuracy of LLAMA-3 and LLAMA-2 models with 4 bits (A4W4KV4) only comparing with QuaRot on PIQA (PQ), WinoGrande (WG), HellaSwag (HS), Arc-Easy (A-e), Arc-Challenge (A-c), and LAMBADA (LA). Baseline and quarot results taken from QuaRot Table2 and Table 12.

| Model      | Method     | PQ [10] | WG [57] | HS [72] | A-e [14] | A-c [14] | LA [52] | Avg.  |
|------------|------------|---------|---------|---------|----------|----------|---------|-------|
| LLAMA-3-8B | FP16       | 80.74   | 72.77   | 79.06   | 77.82    | 53.33    | 75.63   | 73.22 |
|            | QuaRot [6] | 75.14   | 65.82   | 72.94   | 68.01    | 43.34    | 65.81   | 65.18 |
|            | Ours       | 79.11   | 71.35   | 76.97   | 74.07    | 50.51    | 74.07   | 71.01 |
| LLAMA-2-7B | FP16       | 79.11   | 69.06   | 75.99   | 74.58    | 46.25    | 73.90   | 69.82 |
|            | QuaRot [6] | 76.77   | 63.77   | 72.16   | 69.87    | 40.87    | 70.39   | 65.64 |
|            | Ours       | 78.73   | 68.19   | 74.24   | 72.52    | 43.69    | 73.30   | 68.45 |

## A.10 Co-Design Space and Analysis

**Table 14:** Hardware and quantisation parameters co-design search space. Categorical parameters are one-hot encoded, integer parameters are expressed as a power of 2.

| Parameter       | Description                              | Search range                                         |
|-----------------|------------------------------------------|------------------------------------------------------|
| BLEN            | Tile size of block unit                  | [2, 4, 8, 16, 32]                                    |
| MLEN            | Tile size of matrix unit                 | [2, 4, 8, 16, 32, 64, 128, 256, 512]                 |
| VLEN            | Tile size of vector unit                 | [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]           |
| HBM_M_Prefetch  | Prefetch amount for matrix data from HBM | [2, 4, 8, 16, 32, 64, 128, 256]                      |
| HBM_V_Prefetch  | Prefetch amount for vector data from HBM | [2, 4, 8, 16, 32, 64, 128, 256]                      |
| HBM_V_Writeback | Writeback amount for vector data to HBM  | [2, 4, 8, 16, 32, 64, 128, 256]                      |
| ACT_WIDTH       | Activation precision                     | MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, E5M2} |
| KV_WIDTH        | Key/Value precision                      | MXINT_{2,3,4,8}, MXFP_{E1M2, E2M1, E3M4, E4M3, E5M2} |
| FP_SETTING      | Floating-point precision setting         | FP_{E3M2, E2M3, E6M5, E5M6, E4M7, E8M5}              |
| INT_DATA_WIDTH  | Integer data width                       | [16, 32, 64]                                         |

**Table 15:** Constraints applied to the hardware and quantisation co-design search space.

| Constraint                                                                 | Description                                               |
|----------------------------------------------------------------------------|-----------------------------------------------------------|
| $MLEN \geq BLEN$                                                           | Matrix tile size must be at least the block tile size     |
| $MLEN \bmod BLEN = 0$                                                      | Matrix tile size must be divisible by the block tile size |
| $MATRIX\_SRAM\_DEPTH \geq 2 \times MLEN$                                   | Matrix SRAM depth must accommodate $2 \times MLEN$        |
| $VECTOR\_SRAM\_DEPTH \geq 2 \times HEAD\_DIM + \frac{HIDDEN\_DIM}{VLEN}$   | Vector SRAM depth must store heads and hidden slices      |
| $INT\_SRAM\_DEPTH \geq 16$                                                 | Minimum integer SRAM depth                                |
| $FP\_SRAM\_DEPTH \geq 3 \times MLEN + FP\_CONSTANT\_NUM$                   | Floating-point SRAM depth constraint                      |
| $(MLEN \times ACT\_WIDTH + (MLEN / BLEN) \times ACT\_SCALE\_WIDTH) < 1510$ | Bandwidth constraint at 1 GHz, 1 TB/s                     |
| $(VLEN \times ACT\_WIDTH + (VLEN / BLEN) \times ACT\_SCALE\_WIDTH) < 1510$ | Bandwidth constraint at 1 GHz, 1 TB/s                     |
| $(MLEN \times ACT\_WIDTH + (MLEN / BLEN) \times ACT\_SCALE\_WIDTH) < 1510$ | Bandwidth constraint at 1 GHz, 1.5 TB/s                   |
| $(VLEN \times ACT\_WIDTH + (VLEN / BLEN) \times ACT\_SCALE\_WIDTH) < 1510$ | Bandwidth constraint at 1 GHz, 1.5 TB/s                   |

**Table 16:** Design space exploration on Llama-3-8B: multi-objective results for five configurations from a BoTorch run. We report perplexity ( $\downarrow$ ) from the accuracy evaluator, end-to-end latency (seconds  $\downarrow$ ), and area (micrometer $^2$   $\downarrow$ ) from the respective cost models. Perplexity is computed with GEMM-only emulation (nonlinear ops omitted) for faster iteration, therefore the FP setting affects latency and area but not the accuracy metric. We load weights pre-quantized to MXINT4 via our PTQ method and quantize activations and the KV cache on-the-fly during inference.

| Parameters |      |      |          |          |           |           |           |         |          | Metrics                 |                      |                              |
|------------|------|------|----------|----------|-----------|-----------|-----------|---------|----------|-------------------------|----------------------|------------------------------|
| BLEN       | MLEN | VLEN | HBM_M    | HBM_V    | HBM_V     | ACT       | KV        | FP      | INT_DATA | Perplexity $\downarrow$ | Lat (s) $\downarrow$ | Area (mm $^2$ ) $\downarrow$ |
|            |      |      | Prefetch | Prefetch | Writeback | WIDTH     | WIDTH     | SETTING | WIDTH    |                         |                      |                              |
| 32         | 128  | 32   | 16       | 8        | 256       | MXFP_E4M3 | MXFP_E3M4 | FP_E4M7 | 64       | 6.70                    | 0.24                 | 49615017.52                  |
| 32         | 128  | 64   | 4        | 8        | 256       | MXINT_8   | MXINT_4   | FP_E3M2 | 32       | 6.76                    | 0.24                 | 51639793.20                  |
| 32         | 256  | 128  | 256      | 64       | 128       | MXFP_E1M2 | MXINT_8   | FP_E6M5 | 16       | 12.14                   | 0.15                 | 99425984.56                  |
| 8          | 128  | 32   | 128      | 8        | 256       | MXFP_E3M4 | MXFP_E3M4 | FP_E5M6 | 16       | 6.54                    | 1.47                 | 26456937.52                  |
| 16         | 128  | 16   | 4        | 16       | 64        | MXINT_8   | MXFP_E4M3 | FP_E3M2 | 64       | 6.60                    | 0.49                 | 31983011.76                  |

### A.11 Compute Performance Experiment Settings

**Table 17:** Configuration settings for compute performance experiments, chosen to match the multiplier count of the A100 GPU. For MicroscopiQ, MLEN and BLEN are set to the same value to form a square shape.

| System      | Freq (GHz) | MLEN | BLEN | VLEN | SRAM (MB) | W. Width | A. Width | KV Width | FP Setting |
|-------------|------------|------|------|------|-----------|----------|----------|----------|------------|
| PLENA       | 1          | 2048 | 32   | 2048 | 128       | MXINT4   | MXINT4   | MXINT4   | FP E4M3    |
| MicroscopiQ | 1          | 256  | 256  | 2048 | 128       | MXINT4   | MXINT4   | MXINT4   | FP E4M3    |