



# REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Kangqi Chen  
ETH Zürich  
Zürich, Switzerland  
kangqichen695@gmail.com

Nika Mansouri Ghiasi  
ETH Zürich  
Zürich, Switzerland  
n.mansorighiasi@gmail.com

Jisung Park  
POSTECH  
Pohang, Republic of Korea  
jisung.park@postech.ac.kr

Rakesh Nadig  
ETH Zürich  
Zürich, Switzerland  
rakesh.nadig@gmail.com

Yu Liang  
ETH Zürich  
Zürich, Switzerland  
yulianglenny@gmail.com

Mohammad Sadrosadati  
ETH Zürich  
Zürich, Switzerland  
m.sadr89@gmail.com

Manos Frouzakis  
ETH Zürich  
Zürich, Switzerland  
manos.frouzakis@gmail.com

Haiyu Mao  
ETH Zürich  
Zürich, Switzerland  
King's College London (KCL)  
London, United Kingdom  
maohaiyu1993@gmail.com

Onur Mutlu  
ETH Zürich  
Zürich, Switzerland  
omutlu@gmail.com

## Abstract

Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. This limitation, combined with the significant cost of retraining renders them incapable of providing up-to-date responses. To overcome these issues, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: (i) indexing, which creates a database that facilitates similarity search on text embeddings, (ii) retrieval, which, given a user query, searches and retrieves relevant data from the database and (iii) generation, which uses the user query and the retrieved data to generate a response.

The retrieval stage of RAG in particular becomes a significant performance bottleneck in inference pipelines. In this stage, (i) a given user query is mapped to an embedding vector and (ii) an Approximate Nearest Neighbor Search (ANNS) algorithm searches for the most semantically similar embedding vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS workloads by performing computations inside the storage system. However, existing works that leverage ISP for ANNS (i) employ algorithms that are *not* tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications to the storage system, limiting performance and hindering their adoption.

We propose REIS, the first Retrieval system tailored for RAG with In-Storage processing that addresses the limitations of existing implementations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored algorithm and data placement technique that: (i) distributes embeddings across all planes of the storage system to exploit parallelism, and (ii) employs a lightweight Flash Translation Layer (FTL) to improve performance. Third, REIS leverages an ANNS engine that uses the *existing* computational resources inside the storage system, without requiring hardware modifications. The three key mechanisms form a cohesive framework that largely enhances both the performance and energy efficiency of RAG pipelines. Compared to a high-end server-grade system, REIS improves the performance (energy efficiency) of the retrieval stage by an average of 13 $\times$  (55 $\times$ ). REIS offers improved performance against existing ISP-based ANNS accelerators, without introducing any hardware modifications, enabling easier adoption for RAG pipelines.

## CCS Concepts

- Information systems → Top-k retrieval in databases; • Hardware → Memory and dense storage.

## Keywords

Retrieval-Augmented Generation, In-Storage Processing, SSD, LLM

## ACM Reference Format:

Kangqi Chen, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, and Onur Mutlu. 2025. REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)*, June 21–25, 2025, Tokyo, Japan. ACM, New York, NY, USA, 22 pages. <https://doi.org/10.1145/3695053.3731116>



This work is licensed under a Creative Commons Attribution 4.0 International License.  
ISCA '25, Tokyo, Japan  
© 2025 Copyright held by the owner/author(s).  
ACM ISBN 979-8-4007-1261-6/25/06  
<https://doi.org/10.1145/3695053.3731116>

## 1 Introduction

The rapid development of Large Language Models (LLMs) [73, 74, 122, 181, 272, 331] during the past decade has led to their widespread adoption, as witnessed by the popularity of chatbots such as ChatGPT [222], Gemini [271, 272] and DeepSeek [181]. Despite this progress, modern LLMs remain limited in generating responses only from data present in their training sets. The significant cost and hardware requirements [73, 181, 285] of training further compound this problem, making frequent retraining on new data impractical, thus limiting the effectiveness of LLMs in especially domain-specific and real-time scenarios [119, 120, 200].

Retrieval-Augmented Generation (RAG) [24, 39, 45, 77, 83, 84, 111, 123, 124, 145, 166, 169, 232, 233, 253, 303, 314, 324, 326] presents a compelling solution to this problem by leveraging information retrieval techniques to feed relevant content from a document database into LLMs for text generation. At inference-time, RAG systems retrieve documents from the database that are relevant to user queries, using these to complement the training-derived knowledge of LLMs and generate contextually relevant responses. Many recent works demonstrate the applicability of RAG to fields such as healthcare [186, 287, 309, 325], law [56, 105, 185, 304], finance [322, 329, 336], and scientific research [156, 302].

The general workflow of RAG consists of a pipeline comprised of three stages: (i) indexing, (ii) retrieval, and (iii) generation. First, the indexing stage is an *offline* process that builds a *vector database* of high-dimensional embeddings [58, 158, 160, 164, 212, 221, 274]. Indexing employs algorithms that cluster similar data or create graph-like structures [63, 68, 195, 343], in order to facilitate future search operations on the data. Second, for each incoming query, the retrieval stage identifies document chunks that are semantically relevant to the query. To perform this, RAG employs a process known as *dense retrieval* [81, 108, 335], which encodes the query in the same vector space as document chunks and performs a similarity search between the query and the database embeddings. Third, the generation stage feeds both the identified document chunks and the query into the LLM to generate the final response.

Although dense retrieval enables accurate semantic similarity comparison between incoming queries and document chunks [83, 273, 274, 335], the large embedding space results in expensive distance computations. For RAG pipelines, a retriever that achieves *both* high recall and low latency is essential because it (i) determines the quality of generated responses, and (ii) resides in the critical path of the response generation process. To strike a balance between these two conflicting objectives, RAG commonly performs dense retrieval with *Approximate Nearest Neighbor Search* (ANNS) techniques, e.g., [78, 79, 97, 109, 153, 172, 183, 194, 232, 295]. Examples of such techniques are: (i) employing data structures that accelerate the search [63, 195, 328, 343], and (ii) quantizing data to reduce the computational complexity [116, 239] of search operations without significantly affecting recall.

The reduced computational complexity of ANNS renders I/O data transfers a significant bottleneck that limits search performance [106, 178, 299, 310]. As a result, several works [106, 178, 299, 310] propose In-Storage Processing (ISP) as a promising solution to accelerate ANNS-based workloads [79, 81, 108, 170, 330, 333]. In particular, NDSearch [299] demonstrates that (i) storage I/O

accounts for up to 75% of the end-to-end ANNS latency, and (ii) ISP improves ANNS performance by 31.7 $\times$  over a conventional CPU-based system, effectively mitigating the aforementioned I/O bottleneck.

We empirically make a similar observation for RAG pipelines, where the ANNS-based retrieval stage becomes the performance bottleneck due to substantial I/O overheads, as presented in Sec. 3.1. For example, when examining a RAG database containing 41.5 million document entries [60], the I/O traffic from the storage system accounts for 84% of the overall latency of the entire RAG pipeline. Although various software and hardware solutions that reduce the storage footprint do exist, these approaches are either unscalable (e.g., quantization methods [81, 135, 209]) or unsustainable (e.g., memory expansion [113]). We conclude that In-Storage Processing (ISP) techniques are essential for fundamentally addressing the critical I/O data movement bottleneck in RAG pipelines.

Existing ISP-based ANNS accelerators [106, 178, 299, 310] face three key limitations that hinder their application to RAG workloads. First, previous works employ search algorithms that cause performance degradation in ISP systems. Graph-based algorithms [115, 195] used by ISP accelerators [178, 299] perform searches using graph traversal, a sequential process. During graph traversal, the algorithm determines the next vertex to visit in the graph based on the analysis of the vertex currently being visited. However, this process exhibits irregular data [75, 76] access patterns, complicating optimization and efficient execution in ISP systems. Second, existing ISP schemes mainly focus on accelerating ANNS, the search stage in RAG applications, without optimizing the document retrieval stage, which, as we show in Sec. 3.2, contributes significant latency to the RAG pipeline. Third, in their quest to accelerate ANNS applications, existing ISP schemes introduce significant storage [106] or hardware [192] overheads.

**Our goal** is to fundamentally alleviate the I/O data movement bottlenecks in the retrieval stage of the RAG pipeline. To this end, we propose REIS, A Retrieval system with In-Storage Processing that employs three new key ideas: 1) an efficient ISP implementation of the clustering-based Inverted File (IVF) algorithm [63, 328, 343] to improve end-to-end retrieval performance, 2) a new low-cost hardware-assisted mechanism in the storage system to link embeddings to their corresponding document chunks, enabling their faster retrieval, 3) a customized in-storage ANNS computation engine using the already available resources within a modern storage system to enhance the energy efficiency of the retrieval process without additional hardware.

**Key Mechanism.** To implement the aforementioned ideas, REIS leverages three key mechanisms. First, we propose an ISP-tailored data placement technique and execution flow that take into account the properties of the Inverted File (IVF) algorithm [63, 343]. Since IVF organizes embeddings into clusters of similar vectors, our data placement technique (i) stores embeddings contiguously, reducing the address translation overhead from the Flash Translation Layer (FTL) and, (ii) distributes embeddings across planes to exploit the available parallelism. To execute the IVF algorithm, REIS uses: (i) the existing logic within the planes to calculate the similarity between embeddings and (ii) the SSD controller to identify the most similar embeddings. Second, to efficiently link embeddings to documents, REIS employs a new database layout, that (i) stores

embeddings and document chunks in separate regions, and (ii) creates connections between the two using the Out-Of-Band (OOB) area of the NAND Flash array, enabling efficient document retrieval. Third, we customize the ANNS engine by using binary quantization [81, 209, 260] and a hybrid SSD design [36]. Binary quantization reduces the computational complexity of ANNS, while the hybrid SSD design combines reliable ISP with high storage density. Specifically, our hybrid SSD design employs (i) SLC, using Enhanced SLC programming [224] for high-performance and reliable In-Storage computation on embeddings and (ii) TLC for storing document chunks at high density.

**Key Results.** We evaluate REIS on two SSD configurations based on a cost- [250] and a performance-oriented [207] SSD design. We compare its performance and energy efficiency against a high-end 256-core CPU system on two commonly used benchmark datasets from BEIR [274] and a large-scale public dataset [60], demonstrating that REIS (i) achieves an average speedup of  $13\times$  and up to  $112\times$ , and (ii) improves energy efficiency by an average of  $55\times$  and up to  $157\times$ . Compared to a state-of-the-art ISP-based ANNS accelerator [106] REIS yields an average speedup of  $21.4\times$  ( $7.67\times$ ) and  $24.2\times$  ( $9.76\times$ ) at 0.98 (0.90)  $Recall@10$  across all evaluated datasets for the cost- and performance-oriented SSDs, respectively. Since REIS does not introduce any hardware modifications to the storage system, its adoption for RAG is much easier than prior ISP-based accelerators.

The contributions of this work are listed as follows:

- This is the first work to quantitatively evaluate the large performance overheads of I/O data movement in the retrieval stage of the Retrieval-Augmented Generation pipeline.
- We comprehensively analyze the limitations of existing techniques that aim to alleviate the I/O data movement bottleneck of the RAG pipeline. We identify two major issues that make integrating existing ISP-based ANNS accelerators into the RAG pipeline inefficient and impractical.
- We propose REIS, the first ISP-based retrieval system tailored for RAG. REIS (i) supports efficient document retrieval by building the correlation between embeddings and documents within the storage system, (ii) improves retrieval performance by introducing an ISP-friendly algorithm, and (iii) improves energy and area efficiency via a customized in-storage ANNS computation engine using computational resources already available in a modern storage system.
- We implement REIS based on a cost- and a performance-oriented SSD design and evaluate its performance and energy efficiency. Against a 256-core CPU system, REIS provides an average speedup (energy efficiency improvement) of  $13\times$  ( $55\times$ ). Compared to a state-of-the-art ISP-based ANNS accelerator, REIS accelerates RAG retrieval from  $7.67\times$  and up to  $24.1\times$  depending on (i) the SSD configuration used, and (ii) the target  $Recall@10$  value.

## 2 Background

### 2.1 Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) [24, 39, 45, 77, 83, 84, 111, 123, 124, 145, 166, 169, 232, 233, 253, 303, 314, 324, 326] is the process of incorporating knowledge from an external document database

into LLM inference. To identify the most relevant data, RAG employs dense retrieval [83, 335], a similarity search operation on dense vectors representing the semantics of the text, called embeddings [20, 158, 160, 164, 202, 211, 292, 293]. To encode this information, embeddings feature high dimensionality, often containing 768 to 8192 dimensions [20, 158, 160, 164, 202, 211, 212, 292, 293].

As mentioned in Sec. 1, RAG is a pipeline comprised of three stages: (i) indexing, (ii) retrieval, and (iii) generation. Indexing creates data structures such as clusters or graphs, that facilitate faster semantic similarity search on the embeddings [63, 68, 195, 343]. In the retrieval step, the RAG system receives a query and encodes it as an embedding. It then searches the database for the  $k$  most similar embeddings, with  $k$  being a parameter specified by the system. Once the most similar embeddings are identified, the RAG system retrieves the corresponding document chunks. In the generation stage, both the retrieved document chunks and the query are fed to the LLM in order to perform inference and generate a response.

While the main application for RAG currently is document retrieval for question answering [38], researchers have also proposed multi-modal RAG pipelines [100, 303, 337]. For example, Vision Transformers [71] enable joint image and text retrieval [121, 234, 338]. Other works [88] combine even more modalities into the same embedding space such as audio, depth, thermal and movement data.

### 2.2 Approximate Nearest Neighbor Search

The retrieval stage forms a critical bottleneck in the RAG pipeline, as generation cannot begin before the relevant document chunks have been retrieved. The simplest method of identifying the  $k$  most relevant (top- $k$ ) embeddings is *Nearest Neighbor Search* (NNS), which entails: (i) calculating the distances (e.g. Euclidean Distance [12, 15]) between the query and all database embeddings and (ii) selecting the  $k$  database embeddings with the lowest distance. However, a brute-force approach incurs significant computational overheads due to (i) the large size of embedding vectors [212] and (ii) the large number of database embeddings, reaching multiple millions [60, 320] and even billions [8, 114], resulting in expensive distance computations. To accelerate the retrieval stage, RAG often performs *Approximate Nearest Neighbor Search* (ANNS) [172], trading off some retrieval accuracy for faster similarity search. To quantify this drop in accuracy, researchers often use the  $Recall@k$  metric [42, 172, 183, 197, 295], which is defined as the fraction of how many of the  $k$  most relevant document chunks have actually been retrieved by ANNS.

Two popular methods for performing ANNS are (i) quantization [81, 116, 135, 209, 239] and (ii) algorithm-based techniques [78, 79, 97, 109, 153, 172, 183, 194, 232, 295]. Quantization methods compress data, reducing their storage footprint and speeding up computation. For Example, Product Quantization (PQ) [116] partitions large embedding vectors into smaller sub-vectors and assigns each sub-vector to a cluster. PQ then concatenates the IDs of the clusters into a new vector that represents the original vector. Binary Quantization (BQ) compresses each embedding component from its original floating-point precision (e.g., FP32) down to a single bit, achieving a  $32\times$  compression ratio. Recent studies [135, 212, 239, 260] show that BQ accelerates ANNS by up to  $40\times$ , with a small impact on recall when combined with a low-cost rescoring step [239].

Algorithm-based methods organize data by clustering them or creating graph-like data structures, which can be searched efficiently without traversing the entire database. For example, the Inverted-File (IVF) algorithm [63, 328, 343] organizes embeddings into clusters that are each represented by a centroid. To perform a search for a given query embedding, first a *coarse-grained* search identifies the cluster centroids closest to the query embedding. Second, a *fine-grained* search on all embeddings within these clusters (approximately) yields the closest neighbors to the query embeddings. Other algorithms also exist, such as: (i) *Hierarchically Navigable Small World* (HNSW) [195], which constructs a hierarchy of graphs, where higher and lower levels of the hierarchy direct the search in a coarse- and fine-grained manner, respectively, and (ii) Locality-Sensitive Hashing (LSH) [68], which hashes similar embeddings into the same bucket with high probability.

### 2.3 SSDs & NAND Flash Memory

Figure 1 presents an overview of a modern SSD architecture based on NAND flash memory. An SSD comprises of an SSD controller, DRAM and multiple NAND flash chips. The SSD controller (1) [3, 25, 27–29, 32, 270] handles the I/O requests from the host, and performs maintenance tasks such as garbage collection (e.g., [3, 26, 27, 29, 53, 130, 161, 259, 270, 307, 316]) and wear-leveling (e.g., [3, 180, 262, 340]). The SSD controller contains multiple embedded microprocessors (2) [14] that execute the firmware called the Flash Translation Layer (FTL) [95, 180, 262, 270, 340]. The SSD controller stores metadata (e.g., logical-to-physical page mapping table [95]) and frequently-accessed pages in a DRAM (3) internal to the SSD. The DRAM size is typically 0.1% of the storage capacity (e.g., 1GB DRAM for each TB of storage capacity [251]). The SSD controller translates the logical page address of each I/O request to a physical page address, and issues commands to the flash chips [3, 26, 214] via the flash controllers. An SSD consists of multiple flash controllers (4) [141, 143, 203, 205, 305], which are embedded processors that interface the SSD controller with flash chips (5). Each flash controller is responsible for communication with multiple flash chips sharing the same channel. The flash controller selects a flash chip for read/write operations and initiates command and data transfers.



Figure 1: NAND Flash Memory Architecture

Each NAND flash chip is comprised of multiple flash dies (6), which operate independently of each other. Each die consists of 2–16 planes (7) [51, 147, 152], that can perform read or write operations in parallel. Planes are further divided into groups of blocks, with each block consisting of hundreds of pages (8). A flash page,

typically sized at 16KB, consists of thousands of NAND flash cells placed horizontally. A flash page typically stores user data, and consists of a dedicated out-of-band area (9) [107, 267, 321] (e.g., 64–256 bytes) to store metadata related to error correction codes and logical-to-physical mapping. NAND flash memory executes read and program operations at the page granularity, and performs erase operations at block granularity [25, 30, 32, 62, 184, 203, 224]. A flash die employs a page buffer which acts as an intermediate buffer during read and write operations. The page buffer consists of multiple buffers [101, 137, 165, 189, 204, 206, 246, 261, 284] (e.g., three buffers if each flash cell stores 3 bits) to store the bits in a flash page. The sensing buffer (10) is the main buffer that temporarily stores flash page data during the read operation. The cache buffer (11) improves read performance by enabling data transfer from the flash chip to the flash controller in parallel with the next read operation. Data buffers (12) are typically used when (1) programming multiple bits per cell, and (2) reading a single bit from a flash cell that stores multiple bits.

Based on the number of bits stored in a flash cell, a flash cell can be classified as a single-level cell (SLC; 1 bit) [49], multi-level cell (MLC; 2 bits) [163], triple-level cell (TLC; 3 bits) [51, 147, 190], or quad-level cell (QLC; 4 bits) [50]. While the SSD capacity increases as each flash cell stores more bits, the increased value density leads to higher latency and lower endurance [26, 27, 29, 31, 203]. To enable reliable writes to flash cells, SSD manufacturers use Incremental Step Pulse/Erasure Programming (ISPP/ISPE) techniques [117, 266]. ISPP/ISPE technique iterates through multiple steps of gradually inserting/ejecting electrons into/from the flash cell until the desired charge level is reached. The peripheral circuitry (13) in each flash die includes an on-chip digital bit counter, pass/fail checking logic and XOR logic between the latches. The digital bit counter and pass/fail checking logic [48, 52, 203] are used to test the state of the cells and guide the ISPP/ISPE process. To further increase reliability, the XOR logic between the latches is used for on-chip data randomization [33, 106, 139, 188, 224].

### 2.4 In-Storage Processing

In-storage processing (ISP) is a computation paradigm that enables processing of data within the storage device. ISP techniques provide significant performance and energy efficiency benefits over conventional systems for data-intensive applications, such as genomics [85, 198], neural networks [112, 142, 215, 300], databases [1, 35, 89, 93, 96, 98, 125, 148, 173, 174, 219, 238, 255–257, 306] and graph analytics [19, 21, 82, 98, 127, 162, 171, 227, 257, 283]. Unlike conventional systems, ISP techniques leverage the high internal bandwidth of the storage system and reduce data movement across the memory hierarchy. ISP techniques perform computation by (1) leveraging the embedded general-purpose cores (e.g., [2, 17, 22, 94, 131, 133, 136, 146, 151, 167, 179, 199, 254, 280–282, 291, 298]) already present in the SSDs, or (2) placing hardware accelerators (e.g., [7, 46, 69, 118, 127–129, 149, 162, 176, 177, 193, 226, 243, 244, 283]) near the flash chips.

Prior works (e.g., [2, 17, 22, 94, 131, 133, 136, 146, 151, 167, 179, 199, 254, 280–282, 291, 298]) propose techniques to utilize the embedded cores in the SSD for computations such as filtering, aggregation, and encryption. These general-purpose embedded cores

are beneficial only for simple computations because the primary responsibility of these cores is to execute the FTL and handle I/O requests. A large body of prior work (e.g., [7, 46, 69, 118, 127–129, 149, 162, 176, 177, 193, 226, 243, 244, 283]) proposes to embed hardware accelerators near the flash packages to accelerate application-specific computations. These hardware accelerators provide significant performance benefits, but add area and power overheads to the SSD. Several prior works [106, 144, 178, 192, 299, 310] identify the substantial I/O traffic (up to 70–75% of the end-to-end search latency [178, 299]) in billion-scale ANNS applications [11, 115, 126, 264, 265] and propose offloading ANNS to the storage system. These ISP-based ANNS accelerators can largely alleviate the I/O overhead, improving performance over conventional systems.

### 3 Motivation

A key limitation of modern LLMs is their inability to generate responses with information beyond their training data. To solve this issue, modern LLM application frameworks [37, 182, 228] support Retrieval-Augmented Generation, combining the text generation capabilities of LLMs [73, 74, 122, 181, 272, 331] with an external knowledge database, as described in Sec. 2.1. Beyond information retrieval, RAG also enables long-tail knowledge memorization [196], alleviating the need for large models with billions of parameters [111], and mitigating the risk of revealing training data [326].

While current research on RAG [39, 84, 169, 324, 334] mainly focuses on further enhancing these capabilities, to our knowledge, no existing works attempt to characterize and address the inefficiencies found in RAG pipelines. In this section, we analyze the performance bottlenecks of RAG and discuss the issues existing systems face when tackling these bottlenecks.

#### 3.1 Performance Bottleneck of RAG Pipelines

As described in Sec. 2.1, RAG pipelines consist of one offline stage, indexing, and two online stages, i.e., retrieval and generation. The online stages of RAG entail (i) encoding the query into an embedding vector, (ii) performing dense retrieval for relevant documents, and (iii) using the retrieved documents and the query to generate a response. For these steps, respectively, the RAG system has to load (i) the embedding model, (ii) the RAG database and (iii) the generation model (i.e., the LLM) from the storage system. With the aim of identifying potential inefficiencies, we measure the latency contributions of the above stages to the RAG pipeline.

**Methodology.** For encoding and generation we chose two popular open-source models, *all-roberta-large-v1* [240, 241] and Llama 3.2 1B [73], respectively. We use FAISS [72] flat indexes to link embeddings to document chunks. We evaluate RAG performance on two datasets, *HotpotQA* [320] with 5.3 million entries and the English subset of a Wikipedia-based dataset (*wiki\_en*) [60], with 41.5 million entries. For each query, we retrieve the 10 most relevant document chunks (top-10 retrieval). Our RAG system consists of a high-end NVIDIA A100 GPU [55] for embedding and generation and two high-end Intel Xeon Gold 5118 CPUs [110], with a Samsung PM9A3 PCIe 4.0 SSD [250] for retrieval. The system is also equipped with 1.5TB of DDR4 memory [208].

**Results.** Figure 2 shows the contribution of different operations in the RAG pipeline to end-to-end execution time. We make two



Figure 2: Latency breakdown for a typical RAG pipeline. Total time is displayed next to each bar.

key observations. First, dataset loading accounts for a substantial portion of the pipeline’s overall latency, reaching 84% for *wiki\_en*. Second, the latency attributed to dataset loading increases with dataset size. For example, as the dataset size grows by approximately 8× from *HotpotQA* to *wiki\_en*, the percentage of latency attributed to dataset loading increases by around 1.7×. We conclude that dataset loading during the retrieval stage contributes significant latency to the RAG pipeline and becomes a performance bottleneck, especially for large datasets. We refer to this bottleneck as the *I/O data movement bottleneck* in RAG pipelines.



Figure 3: Latency breakdown for a RAG pipeline using Binary Quantization (BQ). Total time is displayed next to each bar.

As an important caveat, we acknowledge that the contribution of *I/O data movement* to end-to-end RAG performance largely depends on encoding and generation model sizes. Larger models (e.g., Llama 3.2 90B [73]) increase generation latency, due to the increased computation cost, potentially reducing the impact of I/O data movement in the RAG retrieval stage. Even in this case, I/O data movement can still bottleneck the RAG pipeline for two key reasons. First, LLM acceleration techniques [6, 65, 66, 104, 155, 237, 258, 263, 289, 323] and more powerful hardware [217, 225, 269, 278] can substantially reduce generation latency, exacerbating the I/O bottleneck of the RAG pipeline. For instance, tensor parallelism [155, 263, 289] enables efficient LLM generation on multi-node GPU systems [54, 217, 278, 279], significantly improving performance. Second, the increasingly popular Mixture-of-Experts (MoE) LLM architecture [122, 181, 339] can reduce computational cost and increase generation performance of large LLMs. As a result, we anticipate that the retrieval, and not the generation stage, will remain a significant bottleneck in future RAG pipelines.

#### 3.2 Limitations of Existing RAG Optimizations

We discuss the limitations of existing optimizations when trying to alleviate the I/O data movement bottleneck in RAG pipelines.

**Batching.** One possible solution is to batch multiple queries before performing retrieval to amortize dataset loading overheads. However, the effectiveness of this technique remains limited in practice

as queries from different domains (e.g., medical, law, finance) must be served from *different*, domain-specific [105, 156, 185, 287, 302, 304, 309, 322, 325, 329, 336] or multi-modal datasets [16, 87, 90, 201, 210, 223, 311, 315, 344] to enhance generation quality.

**Quantization.** Quantization techniques, such as Product Quantization (PQ) or Binary Quantization (BQ), can reduce the memory footprint of RAG applications. Recent studies [135, 209, 212, 239, 260] demonstrate that BQ provides a good trade-off between storage footprint and recall. To further evaluate this trade-off, we repeat the previous experiment using BQ for the embeddings. As shown in Fig 3, while BQ reduces the I/O data movement overhead by 17–29% for both datasets, dataset loading remains the bottleneck for the larger *wiki\_en* dataset, amounting to 67% of the total latency.

While quantization significantly reduces the size of embeddings, this is not possible for the document chunks, which amount to 9GB of the total 14GB transferred for the *wiki\_en* dataset (after BQ on the embeddings). Therefore, we conclude that quantization techniques are useful in reducing the I/O data movement bottleneck, but they cannot eliminate it.

**Algorithmic Optimization.** ANNS algorithms often improve retrieval performance by using sophisticated indexes [63, 68, 195, 343], which reduces search time. The data structures used to store these indexes are often larger than the flat indexes used for simple brute-force approaches, potentially exacerbating the I/O data movement bottleneck. Hybrid ANNS algorithms [41, 115] attempt to overcome the I/O data movement bottleneck by storing the index in SSDs and loading parts of it in memory for distance computations on demand. SPANN [41] provides the state-of-the-art performance-accuracy tradeoff among hybrid ANNS solutions, enabling small amounts of DRAM (e.g. 32GB) to accelerate searches in TB-sized SSD-resident datasets. Specifically, SPANN groups embeddings into clusters and stores them in the SSD, only keeping cluster centroids in memory. We conduct an experimental study on SPANN and find two major limitations of this type of solution. First, we observe that achieving a reasonable recall-accuracy tradeoff requires selecting a large number of centroids, increasing memory footprint and lowering performance. For example, reaching 0.92 *Recall@10* in *HotpotQA* requires storing 24% of all embeddings as centroids in memory, yielding only a 22% speedup over exhaustive search. This observation also matches with the original study of this algorithm [41]. Second, hybrid ANNS algorithms such as SPANN only optimize storage and retrieval for embeddings and not for the document chunks of a vector database. We conclude that hybrid ANNS algorithms also do *not* fundamentally alleviate the I/O data movement bottleneck.

**Memory Expansion.** As our analysis in Sec. 3.1 shows, data movement between storage and the host contributes significant latency to the RAG retrieval stage. Memory expansion techniques such as those enabled by Compute Express Link (CXL) [4, 67, 91, 113, 168] enable very large memory capacities that could theoretically keep RAG datasets resident in memory. However, such approaches suffer from two key drawbacks. First, main memory is significantly (i.e., more than an order of magnitude) more expensive per GB than flash storage, at approximately 3.10 [248] vs 0.1 [250] USD per GB, respectively. Second, such approaches are unsustainable as (i) continuously increasing dataset sizes, and (ii) the growing number of datasets for domain-specific applications [105, 156, 185, 287, 302,

304, 309, 322, 325, 329, 336] eventually overwhelm the capacity of such systems.

**ANNS Acceleration Inside the Storage.** Prior works propose In-Storage processing (ISP) techniques [178, 192, 299] to alleviate the I/O data movement bottleneck in the ANNS kernel. Although ANNS forms a key component of RAG, existing ISP-based ANNS accelerators cannot entirely eliminate the I/O data movement bottleneck for three key reasons. First, prior ANNS acceleration works [178, 192, 299] employ graph-based algorithms such as HNSW [195] and DiskANN [115], using graph-traversal to identify similar neighbors. During graph traversal, the algorithm performs an analysis on the current vertex to identify the next vertex. As a result, graph traversal induces irregular access patterns [75, 76] that underutilize the internal bandwidth of the SSD due to costly channel and NAND Flash chip conflicts [143, 214]. Second, prior ISP-based ANNS accelerators [106, 178, 192, 299] focus primarily on accelerating the search operation without providing efficient support for retrieving the associated documents. However, as shown in Figs. 2 and 3, the dataset loading step contributes significant latency to RAG retrieval. Third, works such as [106, 192] introduce significant overheads storage and hardware overheads. For example, ICE [106] in order to perform computations inside NAND flash dies, stores data in a format that can tolerate errors without error correction. This format incurs a 32× (8×) storage overhead for data in 8-bit (4-bit) precision, resulting in high storage overheads. Another example is DeepStore [192], which incurs significant area and power overheads by introducing a systolic array-based architecture in the storage system to perform query matching by executing Deep Neural Networks. Overall, these limitations hinder the adoption on ISP-based acceleration techniques in RAG pipelines.

### 3.3 Our Goal

Based on our observations and analyses in Sec. 3.1 and 3.2, we conclude that (1) the I/O data movement of RAG significantly bottlenecks its performance, and (2) none of the prior techniques effectively eliminate this bottleneck in the RAG pipeline. **Our goal** is to fundamentally alleviate the I/O data movement bottleneck in RAG through an ISP design that does not introduce modifications to the hardware of the storage system.

## 4 REIS

REIS is an In-Storage Processing (ISP)-based retrieval system that alleviates the I/O data movement bottleneck in the RAG pipeline. REIS works by receiving query embeddings from the host, querying the database inside the storage, and then returning relevant document chunks, greatly reducing communication between host and storage system.

ISP introduces two significant design challenges. First, the available embedded cores are limited in terms of both performance and functionality (e.g., lack of floating point support [13]). Second, the flash channel bandwidth is limited compared to the total NAND flash read bandwidth. As described in Sec. 4.3, REIS uses the existing hardware inside the NAND flash planes to alleviate the load on the embedded cores, which, however, introduces new limitations: (I) The logic inside flash dies only supports simple bitwise and bit-counting operations. (II) NAND flash reads are unreliable, requiring

the use of error correction codes (ECC) [31] to achieve robust operation. Since ECC is typically performed by the controller [14], performing computation inside flash dies requires fundamentally different error mitigation mechanisms.

In this section, we explain the design decisions behind REIS, which alleviate the aforementioned issues. Figure 4 presents an overview of the system and the key mechanisms it consists of. First, REIS employs a vector database layout that links embeddings with documents in order to enable efficient document retrieval (Sec. 4.1). Second, REIS introduces support for the Inverted File (IVF) algorithm in ISP systems, improving the end-to-end retrieval performance (Sec. 4.2.1). Third, an in-storage ANNS engine efficiently executes the ANNS kernel (Sec. 4.3).



Figure 4: Overview of REIS.

## 4.1 Database Layout

REIS introduces a vector database layout that distributes and links embeddings and documents in order to maximize the data access parallelism for in-storage computation. The database layout (i) distributes the vector database into an index region and a document region, (ii) creates low-overhead links between each embedding and its associated document chunk, and (iii) provides coarse-grained access to each dataset to avoid frequent FTL invocations.

**4.1.1 Database Distribution.** During the retrieval stage of RAG, the ANNS kernel performs distance calculations on the database embeddings to select the top- $k$  most similar documents. As a result, accesses to embeddings are far more frequent than accesses to documents. Based on this observation, we distribute the database in three ways to improve the efficiency of accessing embeddings. First, we map embeddings and documents to two separate regions of the NAND flash array, the *embedding* (❶ in Fig. 4) and the *document* (❷) regions, respectively. Second, we employ Parallelism-First Page Allocation [332] to evenly distribute embeddings across all planes of the storage system. Third, we assign each document chunk to an individual 4KB sub-page or a 16KB page, adapting to different document chunking granularities [5, 20, 158, 175, 211, 245, 293].

**4.1.2 Hybrid SSD design.** Modern SSDs employ Triple-Level Cells (TLC) which rely on ECC to combine high density with data integrity, requiring data transfers to the embedded cores of the SSD controller for error correction. As will be shown in Sec. 4.3, REIS performs operations within the planes and dies of the storage system.

Thus, performing ECC on the controller would create significant data movement overheads, negating potential speedups. In order to: (i) eliminate such overheads and (ii) allow error-free in-plane embedding distance calculation without ECC, REIS employs *HybridSSD* [247, 286, 308, 332] techniques in the ANNS engine. Specifically, we employ soft partitioning to create (i) a robust, non-ECC Single Level Cell (SLC) partition for storing binary embeddings, and (ii) a typical, high-density TLC partition that stores the database's document chunks and embeddings that are not processed within the planes (e.g. INT8 embeddings for reranking). To further improve the robustness of the SLC partition, REIS makes use of the Enhanced SLC-mode Programming (ESP) [224], which maximizes the margin between the voltage ranges of the values in SLC, achieving zero BER *without* ECC. As an added benefit, SLC programming slightly enhances RAG performance due to decreased read latency of SLC compared to TLC [247].

**4.1.3 Embedding-Document Linkage.** While the database layout of Sec. 4.1.1 can increase performance by separating the frequently accessed embeddings from the less frequently accessed document chunks, performing document retrieval requires a connection between the two. To achieve this, REIS employs a low-cost linkage mechanism within the storage system that associates each embedding with the address of its corresponding document chunk.

Modern NAND flash memory provisions some storage space for ECC bits known as the Out-Of-Band (OOB) area (e.g., 2208 spare bytes for each 16KB page [230, 236]). During each page read, the page buffer loads OOB data together with the page. We re-purpose a small portion of the OOB area to store the address of the document chunk that is associated with each embedding (❸ in Fig. 4). For example, assuming a dataset where (i) each embedding and document chunk occupies 4KB (i.e., a sub-page [159]) and (ii) each document chunk requires a 4-byte address, linking embeddings to documents requires 16 spare bytes (or 0.7% of the OOB area) for each page. This approach ensures that whenever an embedding is loaded to the page buffer, the address of its associated document chunk is also loaded. Therefore, when the storage system conducts distance computation for a page of embeddings using the mechanisms proposed in Sec. 4.3, the addresses of associated document chunks are available in the page buffer for document identification and retrieval. Our proposed mechanism eliminates the need to maintain a specialized data structure for document retrieval with minimal space overhead to the storage system.

**4.1.4 Coarse-Grained Access.** With the aim of (i) distinguishing between different RAG datasets in the storage system and (ii) reducing the frequent address translation overheads when accessing embeddings, REIS introduces a coarse-grained access scheme. Specifically, REIS stores an address information entry for each region of the database in the internal DRAM. Each entry includes an integer index as the distinct signature of a database and the addresses of the first and last entries of the embedding and document regions. The coarse-grained access scheme enables database management in two ways. First, during database deployment, the storage system reserves two non-overlapped and consecutive regions and creates the address entries based on the size of a database before deploying the database to the storage system. In this way, we ensure the isolation of the database from other user data or databases. Second, during a

database search operation, the storage system finds the starting embedding address of a database through the address entry to start the retrieval process. For each upcoming page read, the SSD controller infers the next address to read by incrementing the current address, instead of frequently invoking the address translation using the L2P mapping table. To ensure data integrity, REIS retains page-level FTL metadata, which contain essential information for operations such as refresh and wear-leveling. This metadata is used for: (i) writes during database initialization and (ii) periodic maintenance operations such as data refresh, which however are rare (e.g., once a year [207]). After these operations, FTL metadata is flushed from the SSD's DRAM.

Coarse-grained access eliminates the need to maintain the page-level FTL for both regions of the database after deployment, conserving the valuable space of the internal DRAM for other operations (see Sec. 4.3). For example, for a 1TB vector database that originally demands 1GB for page-level FTL [99, 296, 341], the maintenance cost for addressing is reduced to 21 bytes. Since REIS is designed with the aim of serving potentially many different RAG databases, we store the necessary information (i.e., the integer index of the database, the entries of the first/last entries in the embedding and document regions) in a small array in the SSD Controller's DRAM. This structure is called R-DB (**A**) in Fig. 4) and serves as a record of deployed databases. A potential downside of coarse-grained access is that it requires the existence of a large contiguous block of storage, which may necessitate defragmentation operations during database deployment. However, this is an initial upfront overhead that can be amortized over time.

## 4.2 ISP-Friendly ANNS Algorithms

Apart from graph-based ANNS algorithms used by prior works [115, 195, 299], two other types of mainstream ANNS algorithms exist: cluster-based (e.g., Inverted File (IVF) [63, 343]) and hash-based algorithms (e.g., Locality Sensitive Hashing (LSH) [68]). With the aim of selecting the most suitable algorithms for our system, we perform a qualitative comparison, measuring throughput and recall on a CPU-based system (described in Table 3). Specifically, we compare the performance of IVF, HNSW, and LSH on the *wiki\_en* dataset [61] using the Cohere [239] embedding model and the FAISS [72] library. We measure throughput in Queries per Second (QPS) and normalize it to that of exhaustive search. We first evaluate the performance of different implementations without quantization. Figure 5 demonstrates that: (i) HNSW is the best performing base (i.e., without quantization) algorithm, (ii) both HNSW and IVF provide up to 0.99 recall, and (iii) LSH is the worst performing algorithm, with lower performance than exhaustive search (result) for recall values above 0.8 (1.2x slower for Recall@10=0.9).

Since ISP hardware has limited capabilities (e.g., lack of floating-point support [13]), ISP-based ANNS requires quantization. For this reason, in Fig. 5 we also analyze the performance of IVF and HNSW when using Binary Quantization (BQ) and Product Quantization (PQ), combined with reranking [239]. We make four key observations: (i) IVF recall remains high even with BQ (PQ) at 0.97 (0.96), (ii) PQ performs worse than BQ and even floating-point IVF, (iii) IVF throughput increases significantly with BQ, and (iv) HNSW throughput remains constant with BQ, while still outperforming



**Figure 5: Comparison of ANNS algorithms in terms of throughput and recall running on CPU. For IVF, *nlist* denotes the number of clusters for a dataset. For HNSW, *M* denotes the number of neighbors for each vertex.**

IVF by approximately 3x. While these observations suggest that both HNSW and IVF are compelling options for ANNS-based RAG, graph-based algorithms (e.g. HNSW) feature irregular access patterns [75, 76] that underutilize the internal bandwidth of the SSD, making them unsuitable for ISP. In contrast, IVF performs searches in contiguous data, exhibiting streaming access patterns. We thus select IVF as our algorithm of choice, and perform modifications to our database layout that support its execution.

**4.2.1 IVF-tailored Database Layout.** In order to accelerate retrieval, REIS employs ANNS via the Inverted File (IVF) algorithm. As will be shown in Sec. 4.3, REIS uses IVF with quantization and reranking, which requires storing data in both binary and INT8 precision. To efficiently support IVF with these optimizations, we modify the database layout of Sec. 4.1 in three ways. First, we divide the embedding region into three sub-regions, one for storing cluster centroids, and two other regions for storing embeddings in binary and INT8 precision, respectively. Second, to facilitate IVF search operations, we create an array which serves as a record of all clusters. Each element of the array corresponds to an IVF cluster and contains: (i) the address of the cluster centroid, (ii) the index of the first and the last embedding within the cluster and (iii) a 8-bit tag associated with the cluster. We name this array R-IVF (**B**) and store it in the SSD's DRAM, resulting in a memory footprint of *Number\_of\_entries* × 15B. Third, we extend the *Embedding-Document Linkage* of Sec. 4.1.3 in two distinct ways. (I) In order to link binary embeddings to their INT8 counterparts for reranking, apart from the document address corresponding to each embedding, we also store the address of the INT8 embedding (RADR) in the OOB region. (II) For reasons that will become apparent in Sec. 4.3, we store the 8-bit tag of the cluster in the OOB area of the page that contains the cluster centroid.

Supporting IVF also requires allocating data structure in the SSD Controller's DRAM. Specifically, during IVF operations REIS maintains lists containing (i) clusters and (ii) embedding vectors as well as their distances from the query embedding. These structures are called *Temporal Top Lists* (TTL) (**C**) in Fig. 4) and as will be shown in Sec. 4.3 are employed in our In-Storage ANNS Engine.

## 4.3 In-Storage ANNS Engine

Prior ISP-based ANNS accelerators [178, 192, 299] commonly integrate Multiple-Accumulate (MAC) units to compute Euclidean

distance [12, 15] for ANNS. Introducing such changes to the storage system creates (i) significant power and area overheads, and (ii) adoption issues due to the intrusive nature of such modifications. As explained in Sec. 2.2, there exist opportunities to reduce computational overhead of ANNS, while retaining accuracy. Recent studies [135, 209, 212, 239, 260] have shown that Binary Quantization (BQ) can achieve a recall of 96%, due to the large dimensionality of text embeddings [20, 158, 160, 164, 202, 211, 212, 292, 293]. With REIS, our goal is to avoid the power and area overheads of prior designs. To this end, we design an In-Storage ANNS engine based on BQ which (i) utilizes only existing components within the SSD system to perform retrieval, (ii) exploits the plane-level, die-level, and channel-level parallelism of the storage system, and (iii) incorporates two major optimizations, distance filtering and pipelining.

**4.3.1 Search Process.** The search process for the Inverted File algorithm (IVF) [63, 343] consists of two steps, a coarse- and fine-grained search. First, in the coarse-grained search, REIS searches through all cluster centroids to identify those closest to the query embedding. To achieve this, REIS starts by reading and calculating the distance for all embeddings stored in one page. For each embedding, it then creates an entry consisting of the distance value (DIST), embedding (EMB), embedding address (EADR), and the associated tag (TAG). It sends these entries to a table, the Temporal Top List for Centroids (TTL-C), which resides in the SSD’s DRAM. After filling the TTL-C for each page read, the embedded cores of the SSD controller execute a quickselect kernel [191] on the distance numbers, identifying the entries that correspond to the  $N$  nearest clusters to the query. Quickselect has an average time complexity of  $O(N)$  and finds the  $k$ -th smallest element in an unordered array, simultaneously selecting the  $k$  smallest elements in the process without sorting them. At the same time, the storage system reads the next page of centroids and conducts distance computations to hide the latency of selection. Each iteration consists of (i) a page read, (ii) distance computations, and (iii) embedding selection, updating the TTL-C with the new closest clusters. After the last iteration, REIS selects the nearest clusters according to the finalized TTL-C. In the second step, REIS conducts a fine-grained search inside the clusters identified in the first step. The fine-grained search has two major differences compared to the coarse-grained search. (i) Instead of forming the TTL entry using TAG, for the fine-grained search, each TTL entry consists of DIST, EMB, RADR, and the address of the associated document (DADR). We name the table for the fine-grained search Temporal Top List for Embeddings (TTL-E). (ii) After the last iteration of selecting the  $k$  nearest embeddings to the query, the storage system performs quicksort [102] to obtain a distance-ordered top- $k$  list for the query.

**4.3.2 Retrieval architecture and execution.** Document retrieval is performed by the ANNS engine, which (i) receives the query embedding from the host system, (ii) computes the distance between the query embedding and database embeddings, and returns the top- $k$  results. Fig. 6 breaks down REIS’s execution flow in nine steps.

The execution flow begins with the reception of a new query by the storage system, which is placed in the SSD’s DRAM and which triggers the execution of the ANNS kernel (steps ②–⑧). The storage system first transfers the query embedding from the DRAM to the data buffer in each NAND Flash plane ① and then



Figure 6: REIS’s In-Storage ANNS Engine

writes multiple copies of the data, filling the whole Cache Latch (CL). These copies are aligned to the database embeddings in order to enable bitwise operations, as will be described in step ③. We refer to this step as *Input Broadcasting* (IBC). After IBC each CL holds  $N$  duplicates, where  $N = \text{Page\_Size} / \text{Embedding\_Size}$ . In step ②, the storage system issues a page read command to each plane, loading a page of database embeddings to the Sensing Latch (SL). By performing an XOR operation between the CL (which stores the query embedding) and the SL (which now stores the database embeddings), and storing the result in the Data Latch (DL) ③, each plane calculates the bitwise difference of the query and the database embeddings. Next, in step ④, we employ the fail-bit counter [48, 52, 203] within the peripheral logic to measure the number of logical ones in the DL, which corresponds to the distance between the query and the database embeddings.

The data that is transferred out from the flash dies to the SSD Controller’s DRAM changes depending on whether the steps ①–④ are executed during coarse- or the fine-grained search. For coarse-grained search, the ANNS engine transfers the (i) the embedding vector (EMB), (ii) its calculated distance (DIST), and (iii) the tag of the cluster that this embedding belongs to, forming a single entry. For fine-grained search, instead of transferring the tag of the cluster, the ANNS engine transfers (iv) the addresses of the INT8 version of the embeddings (RADR), and (v) the correlated document chunk address (DADR). Steps ②–④ are repeated until the whole database is searched. The SSD controller retrieves distance numbers from the TTL and performs quickselect [103] using the embedded core ⑥, selecting the  $10k$  embeddings closest to the query. In step ⑦, the embedded core of the SSD controller executes the reranking kernel [135, 239, 312]. Reranking performs a *costlier* but more accurate search on the subset of data elements that are selected by ANNS. Rerankers usually (i) employ cross-encoder models that accurately calculate the similarity between queries and document chunks [40, 216], or (ii) recalculate distances with higher precision (e.g., INT8) [260]. REIS uses the second approach: ANNS is performed using Binary Quantization, while reranking is performed using INT8 embeddings. For reranking, the embedded core first fetches the top- $10k$  embeddings from the INT8 embedding region using the RADR. It then recalculates the distances in INT8 precision and sorts them using quicksort [102] ⑧ to finally select the top- $k$  embeddings, which ends the search process. Once the ANNS search is completed, the ANNS engine executes *document identification* to find relevant document chunks according to the DADR of the top- $k$  results and transfers them to the host system ⑨ for generation.

**Exploiting SSD Parallelism.** As described in Sec. 4.3.2, REIS uses the buffers and the peripheral logic within the planes and the dies of the storage system in order to perform distance computations. This approach allows multiple simultaneous XOR and bit-counting operations across planes and dies, exploiting the available parallelism within the storage system. Once these computations have been performed, the flash channels of the storage system collectively provide massive internal bandwidth (e.g., 9.6 GB/s bandwidth for an 8-channel system with 1.2 GB/s bandwidth per channel [47]), which can efficiently transfer entries from the flash dies to the SSD controller’s DRAM by leveraging the channel-level parallelism.

**Fine-grained Embedding Access.** To ensure fine-grained access to each embedding, REIS introduces *Mini-Pages* for addressing. REIS composes a *Mini-Page* address by appending an offset to the original physical page address, filling each page with as many embeddings as possible (e.g., 128 binary 1024-dimension embeddings per 16KB page, leading to a 7-bit offset for the *Mini-Page* address). During execution of the ANNS engine, REIS performs retrieval using the *Mini-Page* address as the embedding address (EADR) for each entry in the TTL.

**4.3.3 Distance Filtering.** We experimentally find that, for each query, a significant fraction of document chunks within the database are irrelevant (i.e., the distance between their embeddings and the query embedding is very large). For example, various retrieval tasks, such as fact-checking [276], retrieve only 1.2–3.0 relevant document chunks per query on average from the BEIR datasets [274]. To avoid forwarding irrelevant data to the SSD controller, we employ distance filtering, which discards database embeddings when their distance from the query embedding exceeds a certain threshold. By discarding highly irrelevant queries, distance filtering (i) conserves SSD channel bandwidth, and (ii) reduces the number of entries that the SSD controller has to select and sort.

We introduce a modification to step ④ with which we apply distance filtering to the ANNS kernel. To determine suitable thresholds, we perform filtering experiments on 4 BEIR [59] datasets targeting different retrieval tasks: *HotpotQA* [320], *NQ* [154], *FEVER* [276], and *Quora* [64]. We make two observations: First, for *HotpotQA* we can filter out 99% of the documents and still retrieve the  $k=10$  most relevant ones for each query. Second, the choice of filtering threshold only weakly depends on the dataset size. For  $k = 10$ , the threshold would only be 1.6% higher for the biggest dataset, *FEVER* compared to the smallest, *Quora*. We conclude that (i) distance filtering significantly reduces the number of candidate embeddings and thus computation, and (ii) it is possible to employ one filtering threshold for effectively filtering datasets with different sizes.

We implement distance filtering using the comparator logic within the flash dies (i.e., the pass/fail checker) [48, 52, 203], which compares distance numbers with a pre-defined threshold. Each embedding whose distance (DIST) value is below the threshold is transferred to the SSD’s DRAM for further processing.

**4.3.4 Pipelining.** To further accelerate RAG retrieval, REIS exploits three pipelining opportunities within the storage system. First, REIS leverages the widely implemented *Read Page Cache Sequential* mode [203], inside the flash chips, to overlap operations between two iterations of steps ②–④. Specifically, during step ④, after the PL transfers its data to the DL for readout, it can immediately read the

next page. Second, REIS overlaps distance calculation on the NAND Flash dies with kernel execution on the embedded cores. According to our evaluation, a single core can efficiently run Quicksort and reranking without stalling the pipeline. Therefore, REIS only uses one core for Quicksort and reranking, while the other cores (e.g., 3 out of 4 [249, 251]) are still available for regular SSD operations. Third, during IBC (see Sec. 4.3.2), REIS enables all planes per die to receive the input query from the die I/O simultaneously, an optimization that we name Multi-Plane IBC (MPIBC). This reduces the IBC latency by a factor equivalent to the number of planes per die. We assume the plane selection is handled by a dedicated Multiplexer logic within the die periphery. Therefore, enabling MPIBC requires raising the select signal for all planes together so that they can receive the input query embedding concurrently.

## 4.4 System Integration of REIS

To enable communication with the host, REIS introduces an Application Programming Interface (API) that defines RAG-specific extensions to the NVM command set [218]. Similarly, to support the operations described in Sec. 4.3, REIS extends the NAND flash command set with commands that enable communication between the controller and the flash dies.

**4.4.1 Application Programming Interface.** REIS specifies a high-level API for the host system to perform the indexing and the retrieval stage of the RAG workflow. To achieve this, we extend the NVM command set [218] with custom REIS operations. The specification provides a range (80h–FFh) in the opcode values for vendor-specific commands, which are adequate for implementing all REIS operations. To perform indexing, the host system issues *DB\_Deploy()* (or *IVF\_Deploy()*) to the SSD. REIS reserves the required space in the NAND Flash memory according to the API and performs de-fragmentation operations to create a physical contiguity. It then waits for the host to write the database content to the DRAM, which it subsequently writes to storage as explained in Sec. 4.1. When REIS receives *Search()* (or *IVF\_Search()*) from the host system, it performs retrieval and returns a *done* signal once it has identified the document chunks to be retrieved. Once the host system acknowledges the signal, the storage system starts to transfer the identified document chunks to the host system. Table 1 describes each API command.

**4.4.2 NAND Flash Command Set.** REIS adds new commands to the NAND flash die control logic to support the operations of the in-storage ANNS engine for retrieval tasks. To enable this, the controller first receives the previously described API commands and translates them into the flash command set. It then issues the flash commands to the flash dies to perform the necessary operations. The control logic within each flash die is a finite-state machine, which receives the commands and uses them to control the peripheral logic in the flash array. Table 2 describes the NAND flash command set extensions for querying the database.

## 5 Methodology

**Evaluated System Configurations.** We evaluate REIS on two SSD configurations, **REIS-SSD1** and **REIS-SSD2**, based on two commercial SSD products, Samsung PM9A3 [250] and Micron 9400 [207].

**Table 1: REIS Application Programming Interfaces**

| API Commands                                               | Description                                                                                                                                      |
|------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| <i>DB_Deploy(DB, D<sub>id</sub>, N)</i>                    | Write the $N$ -entry database $DB$ , with ID $D_{id}$ to storage.                                                                                |
| <i>IVF_Deploy(DB, D<sub>id</sub>, N, CI)</i>               | Write the $N$ -entry IVF-based database $DB$ , with ID $D_{id}$ to storage. $CI$ contains information on the IVF clusters.                       |
| <i>Search(Q, Q<sub>id</sub>, D<sub>id</sub>, k)</i>        | Perform a top- $k$ search for a batch of queries $Q$ , indexed by $Q_{id}$ , in the database with ID $D_{id}$ .                                  |
| <i>IVF_Search(Q, Q<sub>id</sub>, D<sub>id</sub>, k, R)</i> | Perform a top- $k$ IVF search for a batch of queries $Q$ , indexed by $Q_{id}$ , in the database with ID $D_{id}$ . The target accuracy is $R$ . |

**Table 2: NAND Flash Command Set Extensions**

| ISA Format    | Description                                                                                                  |
|---------------|--------------------------------------------------------------------------------------------------------------|
| IBC Q_EMB     | Send a copy of the query (Q_EMB) to each page buffer of the NAND Flash memory. ( <i>Input Broadcasting</i> ) |
| XOR ADR_P     | Perform the XOR operation between PL and CL of a plane (addressed by ADR_P).                                 |
| GEN_DIST EADR | Compute the distance for a database embedding stored at address EADR.                                        |
| RD_TTL EADR   | Transfer the TTL entry for the embedding stored at EADR to the SSD DRAM.                                     |

These SSDs focus on low cost and high performance, respectively. As a baseline for document retrieval, we use a high-end server equipped with an AMD EPYC 9554 CPU [9] and a Samsung PM9A3 SSD [250]. Table 3 provides the properties of our SSDs and the baseline CPU system (**CPU-Real**). To highlight the improvements stemming from our database layout and In-Storage Processing, we first compare REIS and CPU-Real using brute force search (BF). We then compare REIS and CPU-Real on Approximate Nearest Neighbor Search. Since (i) the loading time makes up the biggest fraction of the execution time (see Sec. 3.2), and (ii) HNSW indexes take up significantly more space than IVF ones, IVF outperforms HNSW when loading time is taken into account. We evaluate both REIS and CPU-Real with the IVF algorithm using BQ and reranking, provided by the FAISS library [72], sweeping the accuracy of IVF from 0.98 down to 0.9 *Recall@10*. In order to perform a sensitivity study, we introduce **No-OPT** as a baseline, a REIS configuration that uses the In-Storage ANNS Engine without DF, PL, and MPIBC. To quantify the performance overheads stemming from ANNS only, we introduce an additional comparison point based on the CPU baseline, which incurs zero overheads from data movement due to storage I/O, called **No-I/O**. We additionally compare REIS to two state-of-the-art designs, NDSearch [299] and ICE [106], which use graph-based and cluster-based ANNS, respectively. To ensure a fair comparison we make the appropriate modifications to our experimental methodology whenever required.

**Performance & Energy Evaluation.** Our SSD operation model and parameters are based on Flash-Cosmos [224] while the internal SSD DRAM is modeled using CACTI7 [18]. We use Zsim [252] and Ramulator [57, 150] to simulate the embedded SSD controller cores. We model SSD power consumption based on a commodity product [249] and real chip characterization results from Flash-Cosmos [224]. The power of the SSD’s internal DRAM and that of the embedded cores are also derived from CACTI7 [18] and the characteristics of a commodity embedded SSD controller processor [13], respectively. We measure the power of CPU-Real using AMD  $\mu$ Prof [10] for the CPU and a DDR4 model [86, 208] for DRAM.

**Evaluated Datasets.** We evaluate two datasets from an information retrieval benchmark [274], *NQ* and *HotpotQA*, a public dataset based on wikipedia [61] (*wiki\_full*) and its English subset (*wiki\_en*). For the comparison to NDSearch [299] we use two billion-scale datasets that were used to evaluate NDSearch, *SIFT1B* and *DEEP1B* [265].

**Table 3: Evaluated System Configurations**

| System    | Configuration                                                                                                                                      |
|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| CPU-Real  | CPU: 2 sockets, 128 cores, 3.1GHz [9]; DRAM: 1.5TB DDR4 [208]; SSD: PM9A3 [250]                                                                    |
| REIS-SSD1 | 8 channels; 16 512Gb dies/channel; 2 planes; 1.2 GB/s channel bandwidth; 22.5 $\mu$ s tR (ESP-SLC) [224]; Embedded Cores: Cortex R8 [13]; 4 cores; |
| REIS-SSD2 | 16 channels; 8 512Gb dies/channel; 4 planes; 2.0 GB/s channel bandwidth; 22.5 $\mu$ s tR (ESP-SLC) [224]; Embedded Cores: Cortex R8 [13]; 4 cores; |

## 6 Evaluation

We evaluate the effectiveness of REIS compared to different baselines. First, we evaluate the effectiveness of REIS at improving the performance and energy efficiency of the retrieval stage of the RAG pipeline. Second, we evaluate the effect of REIS on the performance of the end-to-end RAG pipeline. Third, we conduct a sensitivity study to analyze the effect of different optimization techniques in REIS. Fourth, we compare REIS to two prior works [106, 299] that use cluster- and graph-based ANNS algorithms, respectively.

### 6.1 Retrieval Performance & Energy Efficiency

**Performance.** Figure 7 shows the performance of REIS, measured in Queries-per-Second (QPS) and normalized to CPU-Real. We make three observations. First, REIS-SSD1 and REIS-SSD2 improve performance over CPU-Real by an average of 13 $\times$  with a maximum of 112 $\times$ , demonstrating the benefit of alleviating the I/O bottleneck of the RAG retrieval process. Second, REIS-SSD1 and REIS-SSD2 outperform No-I/O by an average of 1.8 $\times$  with a maximum of 5.3 $\times$  due to the massive internal parallelism of storage systems that REIS exploits. Third, REIS-SSD2 provides a 2.6 $\times$  average speedup over REIS-SSD1, with a maximum of 3.2 $\times$ , reflecting the benefits of higher channel counts (2 $\times$ ) and channel bandwidth (1.7 $\times$ ).

**Energy Efficiency.** Figure 8 presents the energy efficiency (QPS/W) of REIS normalized to CPU-Real. We make two observations. First,



Figure 7: Performance (QPS) normalized to CPU-Real



Figure 8: Energy efficiency (QPS/W) normalized to CPU-Real

REIS-SSD1 and REIS-SSD2 improve energy efficiency over CPU-Real by 55× on average and up to 157×. This improvement in energy efficiency fundamentally stems from the 29.7× lower power consumption of SSDs compared to the CPU baseline on average. Second, REIS-SSD2 provides 2.2× higher energy efficiency over REIS-SSD1 on average, with a maximum of 2.6×. This improvement in energy efficiency is similar to REIS-SSD2’s performance improvement over REIS-SSD1, suggesting that most of the energy efficiency gains stem from the higher throughput of SSD2’s design.

## 6.2 End-to-End Performance Analysis

Table 4 breaks down the latency of different stages of the RAG pipeline on REIS-SSD1 and on a CPU-based system using binary quantization (i.e., the same system as in Fig. 3). Similarly to our analysis in Fig. 3, we use the *HotpotQA* and *wiki\_en* datasets. Since REIS performs retrieval within the storage system, it does not perform the *Dataset Loading* step that transfers data to the host’s DRAM. We observe that REIS reduces the combined latency of *Dataset Loading* and *Search* from 20.3%–69.3% down to 0.02%–0.15%, which demonstrates that REIS efficiently eliminates the data movement bottleneck of RAG retrieval. When using REIS, *Generation* accounts for 92% of the total time, which demonstrates that LLM inference is now the new bottleneck. Overall, REIS reduces the average end-to-end latency by 1.25× and 3.24× on *HotpotQA* and *NQ*, respectively.

## 6.3 Sensitivity Study

Fig. 9 presents a sensitivity study of all proposed optimizations introduced by REIS, i.e., Distance Filtering (DF), Pipelining (PL) and Multi-Plane Input Broadcasting (MPIBC) on top of No-OPT. We choose *wiki\_full* [61] as the dataset to analyze and normalize results (i.e., QPS) to the performance of the CPU-Real. We make three observations. First, among all proposed optimizations, DF contributes the most to the speedup over No-OPT by an average of 4.7× and 5.7× and a maximum of 5.1× and 6.5× for REIS-SSD1 and REIS-SSD2, respectively. The main source of this speedup is that

filtering out embeddings with large distances inside each NAND flash die significantly reduces (i) unnecessary data movement to the SSD controller’s DRAM, and (ii) the amount of data input to the Quickselect kernel. Second, the benefit from PL increases for SSDs with higher internal bandwidth due to more channels and higher I/O rate. Specifically, in SSDs with high internal bandwidth (e.g., the 32GB/s of bandwidth for REIS-SSD2), pipelining can completely overlap (i) reading a new page, and (ii) transferring out the filtered TTL entries from the NAND flash dies to the SSD’s internal DRAM. Third, the benefit from MPIBC increases for SSDs with more planes per die. Specifically, the average speedup of DF+PL+MPIBC over DF+PL is 6% and 26% for REIS-SSD1 and REIS-SSD2.



Figure 9: Effects of different REIS optimizations on throughput (normalized to CPU-Real), evaluated on dataset [61].

Table 4: RAG Latency Breakdown for REIS and the CPU-based system with Binary Quantization of Fig. 3.

| Latency contribution (%)        | HotpotQA |        | NQ   |        |
|---------------------------------|----------|--------|------|--------|
|                                 | REIS     | CPU+BQ | REIS | CPU+BQ |
| Embedding Model Loading         | 3.26     | 2.61   | 3.26 | 1.01   |
| Encoding                        | 0.58     | 0.46   | 0.58 | 0.18   |
| Dataset Loading                 | N/A      | 20.0   | N/A  | 67.3   |
| Search (and retrieval for REIS) | 0.02     | 0.29   | 0.15 | 2.00   |
| Generation Model Loading        | 4.16     | 3.32   | 4.16 | 1.28   |
| Generation                      | 92.0     | 73.0   | 92.0 | 28.0   |
| End-to-End Latency (s)          | 18.97    | 23.79  | 19.0 | 61.69  |



Figure 10: Speedup of REIS over ICE [106].

**6.3.1 Comparison with REIS-ASIC.** To quantify the performance loss due to not using ESP (thus requiring ECC which incurs data transfers to the SSD controller), we compare REIS against a new scheme, REIS-ASIC, which: (i) instead of ESP, uses ECC performed by the SSD controller, (ii) performs all other operations using an ideal ASIC with no computational overhead but (iii) requires that all data be transferred to the controller. REIS-ASIC experiences a slowdown between  $4.1\times$ - $5.0\times$  ( $3.9\times$ - $6.5\times$ ) for SSD-1 (SSD-2), across all recall values and datasets, due to the data movement overheads introduced by the data transfers due to not using ESP.

#### 6.4 Comparison to Prior Works

We compare the performance of REIS to two state-of-the-art ISP-based ANNS accelerators, ICE [106] and NDSearch [299], which use cluster- and graph-based algorithms, respectively.

**Comparison to ICE.** Fig. 10 shows the speedup of REIS compared to ICE [106], a state-of-the-art ISP scheme for vector similarity search. When using brute force (BF), REIS achieves a speedup greater than 10x across all configurations. For IVF, the speedup increases with higher recall values, demonstrating superior performance to that of ICE. Specifically, across all datasets with SSD-2, REIS outperforms ICE by an average of  $7.1\times$  ( $22.9\times$ ) at 0.90 (0.98) recall@10. We also perform a comparison to ICE-ESP, an idealistic implementation of ICE that does *not* require ECC, but still uses 4-bit quantization (not shown in Fig. 10). Even compared to ICE-ESP, REIS achieves a geometric mean speedup of  $3.85\times$  ( $3.92\times$ ) in BF for SSD-1 (SSD-2). When configured to target 0.9 recall@10 using IVF, REIS achieves  $2.08\times$  ( $2.29\times$ ) higher performance over ICE-ESP, a number that rises to  $2.84\times$  ( $3.18\times$ ) for 0.98 recall@10 for SSD-1 (SSD-2).

**Comparison to NDSearch.** Fig. 11 compares the performance of REIS using IVF [72], against NDSearch using HNSW [195] and DiskANN [115]. We perform this comparison using two billion-scale datasets, SIFT-1B and DEEP-1B [265], with 0.94 and 0.93 Recall@10, respectively. We normalize the throughput of REIS to that of NDSearch with HNSW and DiskANN and observe that it outperforms NDSearch by an average of  $1.7\times$  with a maximum of  $2.6\times$ .



Figure 11: Performance comparison of REIS and NDSearch.

## 7 Discussion

In this section, we discuss potential extensions and optimizations to REIS. First, we discuss augmenting REIS with filtered search on user-defined metadata. Second, we address the impact of REIS on typical SSD management operations and lifetime. Third, we provide alternative implementations for REIS’s embedding-document linkage which alleviate the logical to physical contiguity requirements.

### 7.1 Metadata Filtering

To improve generation quality modern LLM serving frameworks [37, 182] incorporate *metadata filtering* [229, 290, 301] to RAG retrieval. Metadata filtering augments database entries with information such as timestamps, author information, or other relevant metadata that can be used during the search process to improve document retrieval. REIS could potentially be enhanced with this feature by storing the metadata of each embedding in reserved NAND flash memory (i.e., in the OOB region [319]).

To perform metadata filtering in a read-only database [124, 229], this enhanced version of REIS: (i) assigns a corresponding metadata tag (an integer number) to each embedding and (ii) places the tag in the OOB area during database deployment. During RAG retrieval, REIS receives the query embedding alongside a metadata tag and compares it to the tags of each database embedding, using the existing approach for calculating the embedding distance. Before performing the subsequent retrieval steps, REIS checks the result of the metadata computation, filtering out results that do not match. For continuously updated databases providing real-time knowledge retrieval [43, 44, 80, 134], REIS (i) periodically creates new databases to store new information at a predefined frequency (e.g., every hour), (ii) treats each sub-database as a normal database tagged with an individual timestamp, (iii) maintains an entry for each database in the internal DRAM, including the database address and the timestamp. When the host sends a query with a requested time, REIS identifies the corresponding databases to be searched by first comparing the requested time with the timestamps stored in the internal DRAM and then performs search and retrieval operations within the identified databases.

### 7.2 Implications on the Storage System

**Typical SSD operations.** While REIS is primarily designed to accelerate RAG, it also serves as a conventional storage system. As such, the SSD controller must handle routine maintenance tasks, such as data refresh and garbage collection [132, 313, 317]. To ensure uninterrupted execution of maintenance operations, we (i) confine REIS to only one of the embedded cores of the SSD and (ii) prioritize maintenance tasks over RAG operations when all cores are needed

for maintenance. Since REIS primarily targets read-intensive RAG workloads, write operations are expected to be infrequent, making full core utilization a rare occurrence. To simplify the design, REIS operates exclusively in either RAG-mode or normal SSD mode at any given time. To switch between the two modes, it is necessary to load the necessary FTL data (coarse-grained for RAG (see Sec. 4.1.4), fine-grained for normal operations). Since REIS exclusively operates in one of the two modes, performance of normal read/write operations from the host remains unaffected.

**Impact on SSD Lifetime.** Although REIS disables ECC in the SLC partition to support in-die logic operations, this does not reduce SSD lifetime for two reasons. First, using SLC-mode instead of MLC inherently increases the distance between threshold voltages, enhancing flash memory cell reliability. Second, REIS employs ESP for the SLC partition, which achieves a 0 BER [224], in a worst-case scenario, (i.e., 1-year retention time, 10k Program/Erase cycles) [224].

**Contiguity Requirements.** Coarse grained access (i.e., the lightweight L2P mapping scheme of Sec. 4.1.4) requires the existence of contiguous unallocated physical space. In order to further reduce (i) the memory footprint, and (ii) translation overheads stemming from L2P metadata, REIS also uses the same contiguity-based approach in the document region of the database. An alternative approach, which does not require contiguity in the document region, would be to link embeddings to the physical addresses of their corresponding document chunks via the OOB area, enabling document chunks to be placed anywhere in storage. However, this approach introduces additional complexity as it entails updating the physical address in the OOB region whenever the documents are remapped to another region of the SSD (e.g., during updates).

## 8 Related Work

To our knowledge, REIS is the first system based on In-Storage Processing (ISP) that accelerates the retrieval stage of Retrieval-Augmented Generation (RAG). We have already qualitatively and quantitatively compared REIS to two existing state-of-the-art ISP-based ANNS accelerators [106, 299] in Section 6.4. In this section, we discuss works that improve RAG from other perspectives and relevant works for Nearest Neighbor Search Acceleration.

### 8.1 RAG Enhancements

Prior work has proposed various optimizations to the RAG pipeline. RQ-RAG [34], a representative prompt engineering [34, 83, 140, 294] method, decomposes complex queries and disambiguates queries with more than one possible interpretation. Small-to-Big Retrieval [318], an improved document chunking strategy [157, 235, 275], uses small document chunks for the retrieval search and returns bigger chunks covering the same context. Hybrid approaches incorporate dense retrieval with sparse retrieval to capture both semantic and lexical similarity between query and documents [187, 297], or combine database search with web search when the knowledge base cannot provide relevant information [314].

### 8.2 Nearest Neighbor Search Acceleration

Due to the widespread adoption of ANNS to billion-scale recommendation systems [79, 81, 108, 170, 330, 333], recent works have proposed dedicated libraries [70, 72, 268] and optimized algorithms

[23, 194, 195, 213, 288] to improve its performance. These works improve the performance of ANNS through various optimizations for processor-centric systems. Since these optimizations target processor-centric systems, they cannot overcome the I/O data movement bottleneck that REIS aims to alleviate.

Various ANNS hardware accelerators [92, 113, 138, 178, 220, 242, 277, 299, 310, 327, 342] leverage approaches such as memory expansion [113, 242] and multi-node parallelism. [92, 277, 327]. Processing-in-Memory techniques (PIM) have also been explored for accelerating Nearest Neighbor Search. For example, [232] proposes a CXL-based device that places vector product accelerators near LPDDR memory, aiming to improve the performance of Exact Nearest Neighbor Search (ENNS). In [231], Qin et al. leverage the properties of Non-Volatile Memory technologies to perform matrix-vector multiplication in the analog domain and accelerate RAG pipelines in edge devices. Despite performance improvements, DRAM-based approaches either fail to fundamentally address the I/O data movement bottleneck from storage or incur significant costs to serve large datasets.

## 9 Conclusion

We introduce REIS, a new retrieval system tailored to Retrieval-Augmented Generation based on In-Storage Processing. REIS improves performance and energy efficiency, by leveraging the existing computational resources within the storage system. REIS comprises three key mechanisms dedicated to RAG: (i) a vector database layout builds the correlation between embeddings and documents to enable efficient document retrieval for ISP systems, (ii) algorithmic support customized for the ISP-friendly Inverted File algorithm to improve retrieval performance, (iii) an in-storage Approximate Nearest Neighbor Search (ANNS) engine to efficiently execute the ANNS kernel. Our evaluation shows that REIS significantly outperforms both (i) a modern CPU-based system for document retrieval and (ii) two state-of-the-art ISP-based ANNS accelerators. We believe and hope that REIS will inspire further research in In-Storage Processing, both in RAG and beyond.

## Acknowledgments

We sincerely thank Andreas Kosmas Kakolyris for his very significant contributions to the work during and after the rebuttal process. Andreas should be a major co-author of the published ISCA 2025 version of this paper, but due to the policy dictated by the ISCA leadership, which we, as all co-authors, wholeheartedly disagree with and find very problematic and unethical, he was not allowed to be a co-author. We thank the anonymous reviewers of ISCA 2025 for feedback. We thank the SAFARI Research Group members for feedback and the stimulating intellectual environment they provide. We acknowledge the generous gifts from our industrial partners, including Google, Huawei, Intel, and Microsoft. This work is supported in part by the ETH Future Computing Laboratory (EFCL), Huawei ZRC Storage Team, Semiconductor Research Corporation (SRC), AI Chip Center for Emerging Smart Systems (ACCESS), sponsored by InnoHK funding, Hong Kong SAR, and European Union’s Horizon programme for research and innovation [101047160 - BioPIM]. Jisung Park was supported by the National Research Foundation of Korea (RS-2024-00347394, RS-2024-00415602, RS-2024-00459026).

## References

- [1] 2004. FastBit: An Efficient Compressed Bitmap Index Technology. <https://sdm.lbl.gov/fastbit/>.
- [2] Anurag Acharya, Mustafa Uysal, and Joel Saltz. 1998. Active Disks: Programming Model, Algorithms and Evaluation. *ASPLOS* (1998).
- [3] Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D Davis, Mark Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In *USENIX ATC*.
- [4] Minseon Ahn, Andrew Chang, Donghun Lee, Jongmin Gim, Jungmin Kim, Jaemin Jung, Oliver Rebholz, Vincent Pham, Krishna Malladi, and Yang Seok Ki. 2022. Enabling CXL memory expansion for in-memory database management systems. In *Proceedings of the 18th International Workshop on Data Management on New Hardware*. 1–5.
- [5] Voyage AI. 2024. voyage-multilingual-2: Multilingual Embedding Model. <https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/>
- [6] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. *arXiv preprint arXiv:2305.13245* (2023).
- [7] Mohammadamin Ajdari, Pyeongsu Park, Joonsung Kim, Dongup Kwon, and Jangwoo Kim. 2019. CIDR: A Cost-effective In-line Data Reduction System for Terabit-per-second Scale SSD Arrays. In *HPCA*.
- [8] Amazon Web Services. 2024. Build a RAG data ingestion pipeline for large-scale ML workloads. <https://aws.amazon.com/blogs/big-data/build-a-rag-data-ingestion-pipeline-for-large-scale-ml-workloads/>.
- [9] AMD. 2023. EPYC™ 9554. <https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9554.html>
- [10] AMD. 2024. AMD® µProf. <https://www.amd.com/en/developer/uprof.html>
- [11] Laurent Amsaleg and Hervé Jégou. 2010. *Datasets for approximate nearest neighbor search*. <http://corpus-texmex.irisa.fr/>
- [12] David C Anastasiu and George Karypis. 2015. L2knnng: Fast exact k-nearest neighbor graph construction with l2-norm pruning. In *Proceedings of the 24th ACM International Conference on Information and Knowledge Management*. 791–800.
- [13] Arm. 2016. Cortex-R8. <https://www.arm.com/products/silicon-ip/cpu/cortex-r/cortex-r8>
- [14] Arm. 2020. Arm Storage Solution for SSD Controllers. <https://armkeil.blob.core.windows.net/developer/Files/pdf/solution-brief/arm-storage-solution-for-ssd-solutions-brief.pdf>
- [15] Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. *Journal of the ACM (JACM)* 45, 6 (1998), 891–923.
- [16] Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. 2023. Hiervl: Learning hierarchical video-language embeddings. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 23066–23078.
- [17] Duck-Ha Bae, Jin-Hyung Kim, Sang-Wook Kim, Hyunok Oh, and Chanik Park. 2013. Intelligent SSD: A Turbo for Big Data Mining. In *CIKM*.
- [18] Rajeev Balasubramonian, Andrew B Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. *ACM Transactions on Architecture and Code Optimization (TACO)* 14, 2 (2017), 1–25.
- [19] Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-Optimizing Breadth-First Search. In *SC*.
- [20] Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm2vec: Large language models are secretly powerful text encoders. *arXiv preprint arXiv:2404.05961* (2024).
- [21] Maciej Besta, Raghavendra Kanakagiri, Grzegorz Kwasniewski, Rachata Ausavarungnirun, Jakub Beránek, Konstantinos Kanellopoulos, Kacper Janda, Zur Vonarburg-Shmaria, Lukas Gianinazzi, Ioana Stefan, et al. 2021. SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems. In *MICRO*.
- [22] Simona Boboila, Youngjae Kim, Sudharshan S Vazhkudai, Peter Desnoyers, and Galen M Shipman. 2012. Active Flash: Out-of-core Data Analytics on Flash Storage . In *MSST*.
- [23] Sebastian Bruch, Aditya Krishnan, and Franco Maria Nardini. 2024. Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search. *arXiv preprint arXiv:2405.12207* (2024).
- [24] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Corria, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 1818–1826.
- [25] Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu. 2017. Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives. *Proc. IEEE* (2017).
- [26] Yu Cai, Saugata Ghose, Erich F Haratsch, Yixin Luo, and Onur Mutlu. 2017. Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives. *Proc. IEEE* (2017).
- [27] Yu Cai, Saugata Ghose, Erich F Haratsch, Yixin Luo, and Onur Mutlu. 2018. Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery. In *Inside Solid State Drives*.
- [28] Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu. 2018. Reliability Issues in Flash-memory-based Solid-state Drives: Experimental Analysis, Mitigation, Recovery. In *Inside Solid State Drives* (2nd ed.).
- [29] Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, and Erich F Haratsch. 2017. Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques . In *HPCA*.
- [30] Yu Cai, Erich F Haratsch, Onur Mutlu, and Ken Mai. 2012. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In *DATE*.
- [31] Yu Cai, Erich F Haratsch, Onur Mutlu, and Ken Mai. 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In *2012 Design, Automation & Test in Europe Conference & Exhibition (DATE)*. IEEE, 521–526.
- [32] Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman S. Unsal, et al. 2012. Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime. In *ICCD*.
- [33] HM Cao, F Liu, Q Wang, ZC Du, L Jin, and ZL Huo. 2022. An efficient built-in error detection methodology with fast page-oriented data comparison in 3D NAND flash memories. *Electronics Letters* 58, 12 (2022), 483–485.
- [34] Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. Rq-rag: Learning to refine queries for retrieval augmented generation. *arXiv preprint arXiv:2404.00610* (2024).
- [35] Chee-Yong Chan and Yannis E. Ioannidis. 1998. Bitmap Index Design and Evaluation. In *SIGMOD*.
- [36] Li-Pin Chang. 2010. A hybrid approach to NAND-flash-based solid-state disks. *IEEE Trans. Comput.* 59, 10 (2010), 1337–1349.
- [37] Harrison Chase. 2022. LangChain. <https://github.com/langchain-ai/langchain>
- [38] Danqi Chen and Wen-tau Yih. 2020. Open-domain question answering. In *Proceedings of the 58th annual meeting of the association for computational linguistics: tutorial abstracts*. 34–37.
- [39] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. In *Proceedings of the AAAI Conference on Artificial Intelligence*, Vol. 38. 17754–17762.
- [40] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. *arXiv:2402.03216 [cs.CL]*
- [41] Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. *Advances in Neural Information Processing Systems* 34 (2021), 5199–5212.
- [42] Rihan Chen, Bin Liu, Han Zhu, Yaoxuan Wang, Qi Li, Buting Ma, Qingbo Hu, Jun Jiang, Yunlong Xu, Hongbo Deng, et al. 2022. Approximate nearest neighbor search under neural similarity metric for large-scale recommendation. In *Proceedings of the 31st ACM International Conference on Information & Knowledge Management*. 3013–3022.
- [43] Wenhui Chen, Xinyi Wang, and William Yang Wang. 2021. A dataset for answering time-sensitive questions. *arXiv preprint arXiv:2108.06314* (2021).
- [44] Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. 2024. Unified Active Retrieval for Retrieval Augmented Generation. *arXiv preprint arXiv:2406.12534* (2024).
- [45] Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. 2024. Lift yourself up: Retrieval-augmented text generation with self-memory. *Advances in Neural Information Processing Systems* 36 (2024).
- [46] Benjamin Y Cho, Won Seob Jeong, Doohwan Oh, and Won Woo Ro. 2013. XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD . In *WoNDP*.
- [47] Jiho Cho, D Chris Kang, Jongyeol Park, Sang-Wan Nam, Jung-Ho Song, Bong-Kil Jung, Jaedoeg Lyu, Hogil Lee, Won-Tae Kim, Hongsoo Jeon, et al. 2021. 30.3 A 512Gb 3b/Cell 7 th-Generation 3D-NAND Flash Memory with 184MB/s Write Throughput and 2.0 Gb/s Interface. In *2021 IEEE International Solid-State Circuits Conference (ISSCC)*, Vol. 64. IEEE, 426–428.
- [48] Sungjun Cho, Beomjun Kim, Hyunkuk Cho, Gyeongseob Seo, Onur Mutlu, Myungsuk Kim, and Jisung Park. 2024. AERO: Adaptive Erase Operation for Improving Lifetime and Performance of Modern NAND Flash-Based SSDs. In *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3*. 101–118.
- [49] Taehee Cho, Yeong-Taek Lee, Eun-Cheol Kim, Jin-Wook Lee, Sunmi Choi, Seungjae Lee, Dong-Hwan Kim, Wook-Ghee Han, Young-Ho Lim, Jae-Duk Lee, et al. 2001. A Dual-Mode NAND Flash Memory: 1-Gb Multilevel and High-Performance 512-Mb Single-Level Modes. *JSSC* (2001).

- [50] Wanik Cho, Jongseok Jung, Jongwoo Kim, Junghoon Ham, Sangkyu Lee, Yujong Noh, Dauni Kim, Wanseob Lee, Kayoung Cho, Kwanho Kim, et al. 2022. A 1-Tb, 4b/Cell, 176-Stacked-WL 3D-NAND Flash Memory with Improved Read Latency and a 14.8 Gb/mm<sup>2</sup> Density. In *ISSCC*.
- [51] Wanik Cho, Jongseok Jung, Jongwoo Kim, Junghoon Ham, Sangkyu Lee, Yujong Noh, Dauni Kim, Wanseob Lee, Kayoung Cho, Kwanho Kim, Heejoo Lee, Sooyeon Choi, Eunwoo Jo, Hanna Cho, Jong-Seok Kim, Chankeun Kwon, Cheolioona Park, Hveonsu Nam, Haeun Won, Taeho Kim, Kyeonghwan Park, Sanghoon Oh, Jinhyun Ban, Junyoung Park, Jaehyeon Shin, Taisik Shin, Junseon Jang, Jiseong Mun, Jehyun Choi, Hyunseung Choi, Suna-Wook Choi, Wonsun Park, Dongkyu Yoon, Minsu Kim, Junvoun Lim, Chiwook An, Hyunyoung Shir, Haesoon Oh, Haechan Park, Sungbo Shim, Hwang Huh, Honasok Choi, Seungpil Lee, Jaesuna Sim, Kichana Gwon, Jumssoo Kim, Woopyo Jeong, Jungdal Choi, and Kyo-Won Jin. 2022. A 1-Tb, 4b/Cell, 176-Stacked-WL 3D-NAND Flash Memory with Improved Read Latency and a 14.8Gb/mm<sup>2</sup> Density. In *2022 IEEE International Solid-State Circuits Conference (ISSCC)*, Vol. 65. 134–135. <https://doi.org/10.1109/ISSCC42614.2022.9731785>
- [52] Nayoung Choi and Jaeha Kim. 2020. Modeling and simulation of NAND flash memory sensing systems with cell-to-cell Vth variations. In *Proceedings of the 39th International Conference on Computer-Aided Design*. 1–8.
- [53] Wonil Choi, Myoungsoo Jung, Mahmut Kandemir, and Chita Das. 2018. Parallelizing Garbage Collection with I/O to Improve Flash Resource Utilization. In *HPCD*.
- [54] Jack Choquette and Wish Gandhi. 2020. Nvidia a100 gpu: Performance & innovation for gpu computing. In *2020 IEEE Hot Chips 32 Symposium (HCS)*. IEEE Computer Society, 1–43.
- [55] Jack Choquette, Wishwesh Gandhi, Olivier Giroix, Nick Stam, and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation. *IEEE Micro* 41, 2 (2021), 29–35.
- [56] Ashish Chouhan and Michael Gertz. 2024. LexDrafter: Terminology Drafting for Legislative Documents using Retrieval Augmented Generation. *arXiv preprint arXiv:2403.16295* (2024).
- [57] CMU-SAFARI. 2015. Ramulator. <https://github.com/CMU-SAFARI/ramulator.git>.
- [58] Cohere. 2023. Introducing Embed v3. <https://cohere.com/blog/introducing-embed-v3>
- [59] Cohere. 2024. beir-embed-english-v3. <https://huggingface.co/datasets/Cohere/beir-embed-english-v3>
- [60] Cohere. 2024. wikipedia-2023-11-embed-multilingual-v3. <https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3>
- [61] Cohere. 2024. wikipedia-2023-11-embed-multilingual-v3-int8-binary. <https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3-int8-binary>
- [62] Christian Monzio Compagnoni, Akira Goda, Alessandro S Spinelli, Peter Feeley, Andrea L Lacaita, and Angelo Visconti. 2017. Reviewing the evolution of the NAND flash technology. *Proc. IEEE* 105, 9 (2017), 1609–1633.
- [63] Rickard Cöster and Martin Svensson. 2002. Inverted file search algorithms for collaborative filtering. In *Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval*. 246–252.
- [64] Kornél Csernai. 2017. First Quora Dataset Release: Question Pairs. <https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs>
- [65] Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. *arXiv preprint arXiv:2307.08691* (2023).
- [66] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. *Advances in Neural Information Processing Systems* 35 (2022), 16344–16359.
- [67] Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An introduction to the compute express link (cxl) interconnect. *Comput. Surveys* 56, 11 (2024), 1–37.
- [68] Anirban Dasgupta, Ravi Kumar, and Tamás Sarlos. 2011. Fast locality-sensitive hashing. In *Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining* (San Diego, California, USA) (*KDD '11*). Association for Computing Machinery, New York, NY, USA, 1073–1081. <https://doi.org/10.1145/2020408.2020578>
- [69] Jaeyoung Do, Yang-Suk Kee, Jignesh M Patel, Chanik Park, Kwanghyun Park, and David J DeWitt. 2013. Query Processing on Smart SSDs: Opportunities and Challenges. In *ACM SIGMOD*.
- [70] Magdalen Dobson, Zheqi Shen, Guy E. Bleloch, Laxman Dhulipala, Yan Gu, Harsha Vardhan Simhadri, and Yihao Sun. 2023. Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis. *CoRR* abs/2305.04359 (2023). <https://doi.org/10.48550/arXiv.2305.04359> arXiv:2305.04359
- [71] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. *arXiv preprint arXiv:2010.11929* (2020).
- [72] Matthijs Douze, Alexandre Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvásy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. *arXiv:2401.08281 [cs.LG]*
- [73] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. *arXiv preprint arXiv:2407.21783* (2024).
- [74] EleutherAI. 2021. GPT-J 6B. <https://huggingface.co/EleutherAI/gpt-j-6b>
- [75] Priyank Faldu, Jeff Diamond, and Boris Grot. 2019. A closer look at lightweight graph reordering. In *2019 IEEE International Symposium on Workload Characterization (IISWC)*. IEEE, 1–13.
- [76] Priyank Faldu, Jeff Diamond, and Boris Grot. 2020. Domain-specialized cache management for graph analytics. In *2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 234–248.
- [77] Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, et al. 2024. ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling. *arXiv preprint arXiv:2406.17507* (2024).
- [78] Hakan Ferhatosmanoglu, Ertenc Tuncel, Divyakant Agrawal, and Amr El Abbadi. 2001. Approximate nearest neighbor searching in multimedia databases. In *Proceedings 17th International Conference on Data Engineering*. IEEE, 503–511.
- [79] Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2017. Fast approximate nearest neighbor search with the navigating spreading-out graph. *arXiv preprint arXiv:1707.00143* (2017).
- [80] Anoushka Gade and Jorjeta Jetcheva. 2024. It's About Time: Incorporating Temporality in Retrieval Augmented Language Models. *arXiv preprint arXiv:2401.13222* (2024).
- [81] Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, and Ying Shan. 2023. Binary Embedding-based Retrieval at Tencent. In *Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*. 4056–4067.
- [82] Congming Gao, Xin Xin, Youyu Lu, Youtao Zhang, Jun Yang, and Jiwu Shu. 2021. ParaBit: Processing Parallel Bitwise Operations in NAND Flash Memory Based SSDs. In *MICRO*.
- [83] Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022. Precise zero-shot dense retrieval without relevance labels. *arXiv preprint arXiv:2212.10496* (2022).
- [84] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinlin Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofeng Wang. 2023. Retrieval-augmented generation for large language models: A survey. *arXiv preprint arXiv:2312.10997* (2023).
- [85] Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, et al. 2024. MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing. *arXiv preprint arXiv:2406.19113* (2024).
- [86] Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019. Demystifying complex workload-DRAM interactions: An experimental study. *Proceedings of the ACM on Measurement and Analysis of Computing Systems* 3, 3 (2019), 1–50.
- [87] Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, and Dinesh Manocha. 2024. Recap: retrieval-augmented audio captioning. In *ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. IEEE, 1161–1165.
- [88] Rohit Girdhar, Alaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 15180–15190.
- [89] Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety, and Yuxiong He. 2017. BitFunnel: Revisiting Signatures for Search. In *SIGIR*.
- [90] Google. 2024. Get multimodal embeddings. <https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings>
- [91] Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoung Jung. 2023. Memory pooling with cxl. *IEEE Micro* 43, 2 (2023), 48–57.
- [92] Fabian Groh, Lukas Ruppert, Patrick Wieschollek, and Hendrik PA Lensch. 2022. Ggnn: Graph-based gpu nearest neighbor search. *IEEE Transactions on Big Data* 9, 1 (2022), 267–279.
- [93] Boncheol Gu, Andre S Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, et al. 2016. Biscuit: A framework for near-data processing of big data workloads. *ACM SIGARCH Computer Architecture News* 44, 3 (2016), 153–165.
- [94] Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, et al. 2016. Biscuit: A Framework for Near-Data Processing of Big Data Workloads. In *ISCA*.
- [95] Aayush Gupta, Youngjae Kim, and Bhuvan Urgaonkar. 2009. DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings. In *ASPLOS*.
- [96] Zvika Guz, Manu Awasthi, Vijay Balakrishnan, Mrinmoy Ghosh, Anahita Shayesteh, Tameesh Suri, and Samsung Semiconductor. 2014. Real-Time Analytics as the Killer Application for Processing-In-Memory. *WoNDP* (2014).

- [97] Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, and Hong Zhang. 2011. Fast approximate nearest-neighbor search with k-nearest neighbor graph. In *Twenty-Second International Joint Conference on Artificial Intelligence*.
- [98] Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, and Onur Mutlu. 2021. SIMDRAIM: A Framework for Bit-Serial SIMD Processing Using DRAM. In *ASPLOS*.
- [99] Kyuhwa Han, Hyukjoong Kim, and Dongkun Shin. 2019. WAL-SSD: Address remapping-based write-ahead-logging solid-state disks. *IEEE Trans. Comput.* 69, 2 (2019), 260–273.
- [100] Weidong He, Zhi Li, Hao Wang, Tong Xu, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, and Enhong Chen. 2024. Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements. *ACM Transactions on Intelligent Systems and Technology* 15, 3 (2024), 1–25.
- [101] Tsutomu Higuchi, Takuyo Kodama, Koji Kato, Ryo Fukuda, Naoya Tokiwa, Mitsuhiro Abe, Teruo Takagiwa, Yuki Shimizu, Junji Musha, Katsuaki Sakurai, et al. 2021. A 1Tb 3b/Cell 3D-Flash Memory in a 170+ Word-Line-Layer Technology. In *ISSCC*.
- [102] Charles AR Hoare. 1962. Quicksort. *The computer journal* 5, 1 (1962), 10–16.
- [103] C. A. R. Hoare. 1961. Algorithm 65: find. *Commun. ACM* 4, 7 (jul 1961), 321–322. <https://doi.org/10.1145/366622.366647>
- [104] Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024. DeepSpeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. *arXiv preprint arXiv:2401.08671* (2024).
- [105] Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2024. CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation. *arXiv preprint arXiv:2406.17186* (2024).
- [106] Han-Wen Hu, Wei-Chen Wang, Yuan-Hao Chang, Yung-Chun Lee, Bo-Rong Lin, Hui-Mu Wang, Yen-Po Lin, Yu-Ming Huang, Chong-Ying Lee, Tzu-Hsiang Su, et al. 2022. Ice: An intelligent cognition engine with 3d nand-based in-memory computing for vector similarity search acceleration. In *2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. IEEE, 763–783.
- [107] Jian Huang, Anirudh Badam, Moinuddin K Qureshi, and Karsten Schwan. 2015. Unified Address Translation for Memory-Mapped SSDs with Flashmap. In *ISCA*.
- [108] Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In *Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 2553–2561.
- [109] Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In *Proceedings of the thirtieth annual ACM symposium on Theory of computing*, 604–613.
- [110] Intel. 2017. Intel® Xeon® Gold Processor 5118. <https://www.intel.de/content/www/de/de/products/sku/120473/intel-xeon-gold-5118-processor-16-5m-cache-2-30-ghz/specifications.html>
- [111] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: few-shot learning with retrieval augmented language models. *J. Mach. Learn. Res.* 24, 1, Article 251 (Jan. 2023), 43 pages.
- [112] Hongsun Jang, Jaeyong Song, Jaewon Jung, Jaeyoung Park, Youngsok Kim, and Jinho Lee. 2024. Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System. In *2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 345–360.
- [113] Junhyeok Kang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. 2023. {CXL-ANNS}: {Software-Hardware} collaborative memory disaggregation and computation for {Billion-Scale} approximate nearest neighbor search. In *2023 USENIX Annual Technical Conference (USENIX ATC 23)*, 585–600.
- [114] Tim Jansen, Yangling Tong, Victoria Zevallos, and Pedro Ortiz Suarez. 2022. Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data. *arXiv preprint arXiv:2212.10440* (2022).
- [115] Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. *Advances in Neural Information Processing Systems* 32 (2019).
- [116] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. *IEEE transactions on pattern analysis and machine intelligence* 33, 1 (2010), 117–128.
- [117] Jaeyong Jeong, Sangwook Shane Hahn, Sungjin Lee, and Jihong Kim. 2014. Lifetime Improvement of NAND Flash-based Storage Systems Using Dynamic Program and Erase Scaling. In *FAST*.
- [118] Won Seob Jeong, Changmin Lee, Keunsoo Kim, Myung Kuk Yoon, Won Jeon, Myoungsoo Jung, and Won Woo Ro. 2019. REACT: Scalable and High-performance Regular Expression Pattern Matching Accelerator for In-storage Processing. *IEEE TPDS* (2019).
- [119] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. *Comput. Surveys* 55, 12 (2023), 1–38.
- [120] Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating LLM hallucination via self reflection. In *Findings of the Association for Computational Linguistics: EMNLP 2023*, 1827–1843.
- [121] Chao Jia, Yinfei Yang, Ya Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In *International conference on machine learning*. PMLR, 4904–4916.
- [122] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. *arXiv preprint arXiv:2401.04088* (2024).
- [123] Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. *arXiv preprint arXiv:2305.06983* (2023).
- [124] Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. 2024. FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. *arXiv preprint arXiv:2405.13576* (2024).
- [125] Insoon Jo, Duck-Ho Bae, Andre S Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel DG Lee, and Jaeheon Jeong. 2016. YourSQL: a high-performance database system leveraging in-storage computing. *Proceedings of the VLDB Endowment* 9, 12 (2016), 924–935.
- [126] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. *IEEE Transactions on Big Data* 7, 3 (2019), 535–547.
- [127] Sang-Woo Jun, Ming Liu, Sungjin Lee, Jamey Hicks, John Ankcorn, Myron King, Shuotao Xu, et al. 2015. BlueDBM: An Appliance for Big Data Analytics. In *ISCA*.
- [128] Sang-Woo Jun, Huy T Nguyen, Vijay Gadepally, et al. 2016. In-Storage Embedded Accelerator for Sparse Pattern Processing. In *HPEC*.
- [129] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash Storage for External Graph Analytics. In *ISCA*.
- [130] Myoungsoo Jung, Ramya Prabhakar, and Mahmut Taylan Kandemir. 2012. Taking Garbage Collection Overheads Off the Critical Path in SSDs. In *Middleware*.
- [131] Luyi Kang, Yuqi Xue, Weiwei Jia, Xiaohao Wang, Jongryool Kim, Changhwan Youn, Myeong Joon Kang, Hyung Jin Lim, Bruce Jacob, and Jian Huang. 2021. IceClave: A Trusted Execution Environment for In-Storage Computing. In *MICRO*.
- [132] Wonkyung Kang, Dongkun Shin, and Sungjoo Yoo. 2017. Reinforcement learning-assisted garbage collection to mitigate long-tail latency in SSD. *ACM Transactions on Embedded Computing Systems (TECS)* 16, 5s (2017), 1–20.
- [133] Yangwook Kang, Yang suk Kee, Ethan L. Miller, and Chanik Park. 2013. Enabling Cost-effective Data Processing with Smart SSD. In *MSST*.
- [134] Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. 2024. REALTIME QA: what's the answer right now? *Advances in Neural Information Processing Systems* 36 (2024).
- [135] Nirant Kasliwal. 2023. Binary Quantization - Vector Search, 40x Faster. <https://qdrant.tech/articles/binary-quantization/>
- [136] Kimberly Keeton, David A Patterson, and Joseph M Hellerstein. 1998. A Case for Intelligent Disks (IDISKs). *SIGMOD Rec.* (1998).
- [137] Ali Khakifirooz, Sriram Balasubrahmanyam, Richard Fastow, Christopher H Gaewsky, Chang Wan Ha, Rezaul Haque, Owen W Jungroth, Steven Law, Aliasgar S Madraswala, Binh Ngo, et al. 2021. A 1Tb 4b/Cell 144-Tier Floating-Gate 3D-NAND Flash Memory with 40MB/s Program Throughput and 13.8Gb/mm<sup>2</sup> Bit Density. In *ISSCC*.
- [138] Saim Khan, Somesh Singh, Harsha Vardhan Simhadri, Jyothi Vedurada, et al. 2024. BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU. *arXiv preprint arXiv:2401.11324* (2024).
- [139] Chulbum Kim, Jinho Ryu, Taesung Lee, Hyunggon Kim, Jaewoo Lim, Jaeyong Jeong, Seonghwan Seo, Hongsoo Jeon, Bokeun Kim, Inyoul Lee, et al. 2012. A 21 nm high performance 64 Gb MLC NAND flash memory with 400 MB/s asynchronous toggle DDR interface. *IEEE Journal of Solid-State Circuits* 47, 4 (2012), 981–989.
- [140] Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang. 2023. Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. *arXiv preprint arXiv:2310.14696* (2023).
- [141] Jihoo Kim, Myoungsoo Jung, and John Kim. 2021. Decoupled SSD: Reducing Data Movement on NAND-Based Flash SSD. *IEEE CAL* (2021).
- [142] Junkyun Kim, Myoungsoo Jung, Yunki Han, Yang-Gon Kim, and Lee-Sup Kim. 2023. Optimstore: In-storage optimization of large scale dnns with on-die processing. In *2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 611–623.

- [143] Jiho Kim, Seokwon Kang, Yongjun Park, and John Kim. 2022. Networked SSD: Flash Memory Interconnection Network for High-Bandwidth SSD. In *MICRO*.
- [144] Ji-Hoon Kim, Yeo-Reum Park, Jaeyoung Do, Soo-Young Ji, and Joo-Young Kim. 2022. Accelerating large-scale graph-based nearest neighbor search on a computational storage platform. *IEEE Trans. Comput.* 72, 1 (2022), 278–290.
- [145] Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim. 2024. Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 13894–13904.
- [146] Minsub Kim and Sungjin Lee. 2020. Reducing Tail Latency of DNN-based Recommender Systems using In-Storage Processing. In *APSys*.
- [147] Moosung Kim, Sung Won Yun, Jungjune Park, Hyun Kook Park, Jungyu Lee, Yeong Seon Kim, Daehoon Na, Sara Choi, Youngsun Song, Jonghoon Lee, Hyunjun Yoon, Kangbin Lee, Byunghoon Jeong, Sanglok Kim, Junhong Park, Cheon An Lee, Jaeyoun Lee, Jisang Lee, Jin Young Chun, Joonsuc Jang, Youngwhi Yang, Seung Hyun Moon, Myunghoon Choi, Wontae Kim, Jungsoo Kim, Seokmin Yoon, Pansuk Kwak, Myunghun Lee, Raehyun Song, Sunghoon Kim, Chiweon Yoon, Dongku Kang, Jin-Yub Lee, and Jaihyuk Song. 2022. A 1Tb 3b/Cell 8th-Generation 3D-NAND Flash Memory with 164MB/s Write Throughput and a 2.4Gb/s Interface. In *2022 IEEE International Solid-State Circuits Conference (ISSCC)*, Vol. 65. 136–137. <https://doi.org/10.1109/ISSCC42614.2022.9731640>
- [148] Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, Sang-Won Lee, and Bongki Moon. 2016. In-storage processing of database scans and joins. *Information Sciences* 327 (2016), 183–200.
- [149] Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, Sang-Won Lee, and Bongki Moon. 2016. In-Storage Processing of Database Scans and Joins. *Information Sciences* (2016).
- [150] Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A fast and extensible DRAM simulator. *IEEE Computer architecture letters* 15, 1 (2015), 45–49.
- [151] Gunjae Koo, Kiran Kumar Matam, Te I, HV Krishna Giri Narra, Jing Li, Hung-Wei Tseng, Steven Swanson, and Murali Annavaram. 2017. Summarizer: Trading Communication with Computing Near Storage. In *MICRO*.
- [152] Toshiyuki Kouchi, Mami Kakoi, Noriyasu Kumazaki, Akio Sugahara, Akihiro Imamoto, Yasufumi Kajiyama, Yuri Terada, Bushnaq Sanad, Naoki Kanagawa, Takuyo Kodama, et al. 2020. A 128gb 1-bit/cell 96-word-line-layer 3d flash memory to improve the random read latency with  $t_{prq}=75\ \mu s$  and  $t_r=4\ \mu s$ . *IEEE Journal of Solid-State Circuits* 56, 1 (2020), 225–234.
- [153] Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In *Proceedings of the thirtieth annual ACM symposium on Theory of computing*. 614–623.
- [154] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. *Transactions of the Association for Computational Linguistics* 7 (2019), 453–466.
- [155] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica. 2023. vllm: Easy, fast, and cheap llm serving with pagedattentation. See <https://vllm.ai/> (accessed 9 August 2023) (2023).
- [156] Jakub Lála, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodrigues, and Andrew D White. 2023. Paperqa: Retrieval-augmented generative agent for scientific research. *arXiv preprint arXiv:2312.07559* (2023).
- [157] LangChain. 2023. Text Splitters. [https://python.langchain.com/v0.1/docs/modules/data\\_connection/document\\_transformers/](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)
- [158] Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. *arXiv preprint arXiv:2405.17428* (2024).
- [159] Dusol Lee, Duwon Hong, Wonil Choi, and Jihong Kim. 2022. MQSim-E: An Enterprise SSD Simulator. *IEEE Computer Architecture Letters* 21, 1 (2022), 13–16.
- [160] Jinyuk Lee, Zhuyun Dai, Xiaogi Ren, Blair Chen, Daniel Cer, Jeremy R Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, et al. 2024. Gecko: Versatile text embeddings distilled from large language models. *arXiv preprint arXiv:2403.20327* (2024).
- [161] Junghee Lee, Youngjae Kim, Galen M Shipman, Sarp Oral, and Jongman Kim. 2013. Preemptible I/O Scheduling of Garbage Collection for Solid State Drives. *IEEE TCAD* (2013).
- [162] Joo Hwan Lee, Hui Zhang, Veronica Lagrange, Praveen Krishnamoorthy, Xiaodong Zhao, and Yang Seok Ki. 2020. SmartSSD: FPGA Accelerated Near-Storage Data Analytics on SSD. *IEEE CAL* (2020).
- [163] Seungjae Lee, Young-Taek Lee, Wook-Kee Han, Dong-Hwan Kim, Moo-Sung Kim, Seung-Hyun Moon, Hyun Chul Cho, Jung-Woo Lee, Dae-Seok Byeon, Young-Ho Lim, et al. 2004. A 3.3V 4Gb Four-Level NAND Flash Memory with 90nm CMOS Technology. In *ISSCC*.
- [164] Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. 2024. Open Source Strikes Bread - New Fluffy Embeddings Model. <https://www.mixedbread.ai/blog/mxbai-embed-large-v1>
- [165] Nancy Leong, Sachit Chandra, and Hounien Chen. 2008. Random Cache Read Using a Double Memory. US Patent 7,423,915.
- [166] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. *Advances in Neural Information Processing Systems* 33 (2020), 9459–9474.
- [167] Cangyuan Li, Ying Wang, Cheng Liu, Shengwen Liang, Huawei Li, and Xiaowei Li. 2021. GLIST: Towards In-Storage Graph Learning. In *USENIX ATC*.
- [168] Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. 2023. Pond: Cxl-based memory pooling systems for cloud platforms. In *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2*. 574–587.
- [169] Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. 2022. A survey on retrieval-augmented text generation. *arXiv preprint arXiv:2202.01110* (2022).
- [170] Sen Li, Fuyi Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based product retrieval in taobao search. In *Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining*. 3181–3189.
- [171] Shuangchen Li, Cong Xu, Qiaoshia Zou, Jishen Zhao, Yu Lu, and Yuan Xie. 2016. Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-Volatile Memories. In *DAC*.
- [172] Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. *IEEE Transactions on Knowledge and Data Engineering* 32, 8 (2019), 1475–1488.
- [173] Yinan Li and Jignesh M. Patel. 2013. BitWeaving: Fast Scans for Main Memory Data Processing. In *SIGMOD*.
- [174] Yinan Li and Jignesh M. Patel. 2014. WideTable: An Accelerator for Analytical Data Processing. In *VLDB*.
- [175] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. *arXiv preprint arXiv:2308.03281* (2023).
- [176] Shengwen Liang, Ying Wang, Cheng Liu, Huawei Li, and Xiaowei Li. 2019. InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing. In *FPL*.
- [177] Shengwen Liang, Ying Wang, Youyou Lu, Zhe Yang, Huawei Li, and Xiaowei Li. 2019. Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval. In *USENIX ATC*.
- [178] Shengwen Liang, Ying Wang, Ziming Yuan, Cheng Liu, Huawei Li, and Xiaowei Li. 2022. VStore: in-storage graph based vector search accelerator. In *Proceedings of the 59th ACM/IEEE Design Automation Conference*. 997–1002.
- [179] Minje Lim, Jeeyoon Jung, and Dongkun Shin. 2021. LSM-Tree Compaction Acceleration Using In-Storage Processing. In *ICCE-Asia*.
- [180] Sang-Phil Lim, Sang-Won Lee, and Bongki Moon. 2010. FASTER FTL for Enterprise-Class Flash Memory SSDs. In *SNAPI*.
- [181] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report. *arXiv preprint arXiv:2412.19437* (2024).
- [182] Jerry Liu. 2022. LlamaIndex. <https://doi.org/10.5281/zenodo.1234>
- [183] Ting Liu, Andrew Moore, Ke Yang, and Alexander Gray. 2004. An investigation of practical approximate nearest neighbor algorithms. *Advances in neural information processing systems* 17 (2004).
- [184] Zengtao Tony Liu. 2022. Flash Memory and NAND. In *Advanced Driver Assistance Systems and Autonomous Vehicles: From Fundamentals to Applications*. Springer.
- [185] Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis. 2024. Interpretable long-form legal question answering with retrieval-augmented large language models. In *Proceedings of the AAAI Conference on Artificial Intelligence*, Vol. 38. 22266–22275.
- [186] Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. Clinfo: ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific literature. In *PACIFIC SYMPOSIUM ON BIocomputing 2024*. World Scientific, 8–23.
- [187] Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. Reac: A retrieval-augmented code completion framework. *arXiv preprint arXiv:2203.07722* (2022).
- [188] Shye-Kung Lu, Shang-Xiu Zhong, and Masaki Hashizume. 2018. Fault Leveling Techniques for Yield and Reliability Enhancement of NAND Flash Memories. *Journal of Electronic Testing* 34 (2018), 559–570.
- [189] Macronix. 2013. Technical Note: Improving NAND Throughput with Two-Plane and Cache Operations. [https://www.macronix.com/Lists/ApplicationNote/Attachments/1907/AN0268V1\\_Improving%20NAND%20Throughput%20with%20Two-Plane%20and%20Cache%20Operations.pdf](https://www.macronix.com/Lists/ApplicationNote/Attachments/1907/AN0268V1_Improving%20NAND%20Throughput%20with%20Two-Plane%20and%20Cache%20Operations.pdf)
- [190] Hiroshi Maejima, Kazushige Kanda, Susumu Fujimura, Teruo Takagiwa, Susumu Ozawa, Junpei Sato, Yoshihiko Shindo, Manabu Sato, Naoki Kanagawa, Junji Musha, et al. 2018. A 512Gb 3b/Cell 3D Flash Memory on a 96-Word-Line-Layer Technology. In *ISSCC*.

- [191] Hosam M Mahmoud, Reza Modarres, and Robert T Smythe. 1995. Analysis of quickselect: An algorithm for order statistics. *RAIRO-Theoretical Informatics and Applications* 29, 4 (1995), 255–276.
- [192] Vikram Sharma Mailthody, Zaid Qureshi, Weixin Liang, Ziyan Feng, Simon Garcia De Gonzalo, Youjie Li, Hubertus Franke, Jinjun Xiong, Jian Huang, and Wen-mei Hwu. 2019. Deepstore: In-storage acceleration for intelligent queries. In *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*. 224–238.
- [193] Vikram Sharma Mailthody, Zaid Qureshi, Weixin Liang, Ziyan Feng, Simon Garcia De Gonzalo, Youjie Li, Hubertus Franke, Jinjun Xiong, Jian Huang, and Wen-mei Hwu. 2019. Deepstore: In-Storage Acceleration for Intelligent Queries. In *MICRO*.
- [194] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. *Information Systems* 45 (2014), 61–68.
- [195] Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. *IEEE transactions on pattern analysis and machine intelligence* 42, 4 (2018), 824–836.
- [196] Alex Mallen, Akira Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, Anna Rogers, Jordan Boyd-Graber, and Naoki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 9802–9822. <https://doi.org/10.18653/v1/2023.acl-long.546>
- [197] Magdalen Dobson Manohar, Zheqi Shen, Guy Bleloch, Laxman Dhulipala, Yan Gu, Harsha Vardhan Simhadri, and Yihan Sun. 2024. ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms. In *Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming*. 270–285.
- [198] Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserri, et al. 2022. GenStore: A high-performance in-storage processing system for genome sequence analysis. In *Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*. 635–654.
- [199] Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserri, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu. 2022. GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis. In *ASPLOS*.
- [200] Ariana Martino, Michael Iannelli, and Coleen Truong. 2023. Knowledge injection to counter large language model (LLM) hallucination. In *European Semantic Web Conference*. Springer, 182–185.
- [201] Pratik Mazumder, Pavendra Singh, Kranti Kumar Parida, and Vinay P Namboodiri. 2021. Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In *Proceedings of the IEEE/CVF winter conference on applications of computer vision*. 3090–3099.
- [202] Rui Meng, Ye Liu, Shafiq Rayhan Joy, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2024. Sfrembedding-mistral: enhance text retrieval with transfer learning. *Salesforce AI Research Blog* 3 (2024).
- [203] Rino Micheloni, Luca Crippa, and Alessia Marelli. 2010. *Inside NAND Flash Memories*. Springer.
- [204] Rino Micheloni, Luca Crippa, and Alessia Marelli. 2010. *Inside NAND Flash Memories*.
- [205] Microchip. 2022. Microchip 16-Channel PCIe Gen 5 Enterprise NVMe SSD Controller, <https://www.microchip.com/en-us/about/news-releases/products/highest-performance-16-channel-pcie-gen-5-enterprise-nvme-ssd-controller>.
- [206] Micron. 2009. NAND Flash Memory Data Sheet: MT29F16G08ABABA, MT29F32G-08AFABA, MT29F64G08A[J/K/M]ABA, MT29F128G08AUABA, MT29F16G-08ABCBB, MT29F32G08AECBB, MT29F64G08A[K/M]CBB, MT29F128G-08AUCBB.
- [207] Micron. 2023. 9400 NVMe™ SSD. <https://www.micron.com/products/storage/ssd/data-center-ssd/9400-ssd>
- [208] Micron. 2025. DDR4 SDRAM. <https://www.micron.com/products/memory/dram-components/ddr4-sdram>
- [209] Microsoft. 2024. Binary quantization in Azure AI Search: optimized storage and faster search. <https://techcommunity.microsoft.com/blog/azure-ai-services-blog/binary-quantization-in-azure-ai-search-optimized-storage-and-faster-search/4221918>
- [210] Microsoft. 2024. Multimodal embeddings (version 4.0). <https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/concept-image-retrieval>
- [211] Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2024. Generative representational instruction tuning. *arXiv preprint arXiv:2402.09906* (2024).
- [212] Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive text embedding benchmark. *arXiv preprint arXiv:2210.07316* (2022).
- [213] Javier Vargas Munoz, Marcos A Gonçalves, Zanoni Dias, and Ricardo da S Torres. 2019. Hierarchical clustering-based graphs for large scale approximate nearest neighbor search. *Pattern Recognition* 96 (2019), 106970.
- [214] Rakesh Nadig, Mohammad Sadrosadati, Haiyu Mao, Nika Mansouri Ghiasi, Arash Tavakkol, Jisung Park, Hamid Sarbazi-Azad, Juan Gómez Luna, and Onur Mutlu. 2023. Venice: Improving Solid-State Drive Parallelism at Low Cost via Conflict-Free Accesses. In *ISCA*.
- [215] Fuping Niu, Jianhui Yue, Jiangqiu Shen, Xiaofei Liao, and Hai Jin. 2024. FlashGNN: An In-SSD Accelerator for GNN Training. In *2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 361–378.
- [216] Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT. *arXiv preprint arXiv:1910.14424* (2019).
- [217] NVIDIA Corp. 2023. NVIDIA H100. <https://www.nvidia.com/en-us/data-center/h100/>
- [218] nvmcommands 2024. NVM Command Set Specification Revision 1.1.
- [219] Elizabeth O’Neil, Patrick O’Neil, and Kesheng Wu. 2007. Bitmap Index Design Choices and their Performance Implications. In *IDEAS*.
- [220] Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2023. Cagara: Highly parallel graph construction and approximate nearest neighbor search for gpus. *arXiv preprint arXiv:2308.15136* (2023).
- [221] OpenAI. 2024. New embedding models and API updates. <https://openai.com/index/new-embedding-models-and-api-updates/>
- [222] OpenAI. 2025. ChatGPT. <https://chatgpt.com/>.
- [223] Aditya Pal, Chantal Eksombatchai, Yitong Zhou, Bo Zhao, Charles Rosenberg, and Jure Leskovec. 2020. Pimmersage: Multi-modal user embedding framework for recommendations at pinterrest. In *Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. 2311–2320.
- [224] Jisung Park, Roknoddin Azizi, Geraldo F Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, and Onur Mutlu. 2022. Flash-Cosmos: In-flash bulk bitwise operations using inherent computation capability of nand flash memory. In *2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. IEEE, 937–955.
- [225] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference. In *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2*. 103–119.
- [226] Shuyi Pei, Jing Yang, and Qing Yang. 2019. REGISTOR: A Platform for Unstructured Data Processing inside SSD Storage. *ACM TOS* (2019).
- [227] Ben Perach, Ronny Ronen, Benny Kimelfeld, and Shahar Kvatsinsky. 2022. PIMDB: Understanding Bulk-Bitwise Processing In-Memory Through Database Analytics. *arXiv:2203.10486* (2022).
- [228] Malte Pietsch, Timo Möller, Bogdan Kostic, Julian Risch, Massimiliano Pippi, Mayank Jobanputra, Sara Zanzottera, Silvano Cerza, Vladimir Blagojevic, Thomas Stadelmann, Tanay Soni, and Sebastian Lee. 2019. Haystack: the end-to-end NLP framework for pragmatic builders. <https://github.com/deepset-ai/haystack>
- [229] Mykhailo Poliakov and Nadiya Shvai. 2024. Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata. *arXiv preprint arXiv:2406.13213* (2024).
- [230] Hongwei Qin, Dan Feng, Wei Tong, Yutong Zhao, Sheng Qiu, Fei Liu, and Shu Li. 2021. Better atomic writes by exposing the flash out-of-band area to file systems. In *Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems*. 12–23.
- [231] Ruiyang Qin, Zheyu Yan, Dewen Zeng, Zhenghe Jia, Dancheng Liu, Jianbo Liu, Ahmed Abbasi, Zhi Zheng, Ningyuan Cao, Kai Ni, et al. 2024. Robust implementation of retrieval-augmented generation on edge-based computing-in-memory architectures. In *Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design*. 1–9.
- [232] Derrick Quinn, Mohammad Nouri, Neel Patel, John Salihu, Alireza Salemi, Sukhan Lee, Hamed Zamani, and Mohammad Aliani. 2025. Accelerating Retrieval-Augmented Generation. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1*. 15–32.
- [233] Zackary Rackauckas. 2024. Rag-fusion: a new take on retrieval-augmented generation. *arXiv preprint arXiv:2402.03367* (2024).
- [234] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In *International conference on machine learning*. PMLR, 8748–8763.
- [235] Vatsal Raina and Mark Gales. 2024. Question-Based Retrieval using Atomic Units for Enterprise RAG. *arXiv preprint arXiv:2405.12363* (2024).
- [236] Md Raquibuzzaman, Aleksandar Milenkovic, and Biswajit Ray. 2022. Intrablock wear leveling to counter layer-to-layer endurance variation of 3-D NAND flash memory. *IEEE Transactions on Electron Devices* 70, 1 (2022), 70–75.
- [237] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-speed: System optimizations enable training deep learning models with over

- 100 billion parameters. In *Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. 3505–3506.
- [238] Redis. 2025. Redis bitmaps. <https://redis.io/docs/data-types/bitmaps/>.
- [239] Nils Reimers. 2022. Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets. <https://cohere.com/blog/int8-binary-embeddings>
- [240] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics. <https://arxiv.org/abs/1908.10084>
- [241] Nils Reimers and Iryna Gurevych. 2020. Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics. <https://arxiv.org/abs/2004.09813>
- [242] Jie Ren, Minjia Zhang, and Dong Li. 2020. Hm-ann: Efficient billion-point nearest neighbor search on heterogeneous memory. *Advances in Neural Information Processing Systems* 33 (2020), 10672–10684.
- [243] Erik Riedel, Christos Faloutsos, Garth A Gibson, and David Nagle. 2001. Active Disks for Large-Scale Data Processing. *Computer* (2001).
- [244] Erik Riedel, Garth Gibson, and Christos Faloutsos. 1998. Active Storage for Large-Scale Data Mining and Multimedia Applications. *VLDB* (1998).
- [245] Meng Rui, Liu Ye, Rayhan Joty Shafiq, Xiong Caiming, Zhou Yingbo, and Yavuz Semih. 2024. SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training. [https://huggingface.co/Salesforce/SFR-Embedding\\_2\\_R](https://huggingface.co/Salesforce/SFR-Embedding_2_R)
- [246] Samsung. 2009. 32Gb A-die NAND Flash Datasheet.
- [247] Samsung. 2013. Samsung Solid State Drive TurboWrite Technology White Paper.
- [248] Samsung. 2020. Samsung 128 GB DDR4 3200 LRDIMM ECC Registered. <https://semiconductor.samsung.com/dram/module/lrdimm/m386aag40am3-cwe/>.
- [249] Samsung. 2021. 980 Pro. <https://semiconductor.samsung.com/consumer-storage/internal-ssd/980pro/>
- [250] Samsung. 2021. PM9A3 NVMe PCIe SSD. <https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/>
- [251] Samsung. 2022. 990 Pro. <https://semiconductor.samsung.com/consumer-storage/internal-ssd/990-pro/>
- [252] Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. *ACM SIGARCH Computer architecture news* 41, 3 (2013), 475–486.
- [253] Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki. 2024. Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. In *2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)*. IEEE, 155–161.
- [254] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. 2014. Willow: A User-Programmable SSD. In *USENIX OSDI*.
- [255] Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast Bulk Bitwise AND or OR in DRAM. *IEEE CAL* (2015).
- [256] Vivek Seshadri, Young Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, et al. 2013. RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization. In *MICRO*.
- [257] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. 2017. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology. In *MICRO*.
- [258] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. *arXiv preprint arXiv:2407.08608* (2024).
- [259] Narges Shahidi, Mahmut T Kandemir, Mohammad Arjomand, Chita R Das, Myoungsoo Jung, and Anand Sivasubramaniam. 2016. Exploring the Potentials of Parallel Garbage Collection in SSDs for Enterprise Storage Systems. In *SC*.
- [260] Aamir Shakir, Tom Aarsen, and Sean Lee. 2024. Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval. *Hugging Face Blog* (2024). <https://huggingface.co/blog/embedding-quantization>.
- [261] Noboru Shibata, Kazushige Kanda, Takahiro Shimizu, Jun Nakai, Osamu Nagao, Naoki Kobayashi, Makoto Miakoshi, Yasushi Nagadomi, Tomoaki Nakano, Takahisa Kawabe, et al. 2019. A 1.33-Tb 4-Bit/Cell 3-D Flash Memory on a 96-Word-Line-Layer Technology. *JSSC* (2019).
- [262] Ji-Yong Shin, Zeng-Lin Xia, Ning-Yi Xu, Rui Gao, Xiong-Fei Cai, Seungryoul Maeng, and Feng-Hsiung Hsu. 2009. FTL Design Exploration in Reconfigurable High-Performance SSD for Server Applications. In *ICS*.
- [263] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-Lm: Training multi-billion parameter language models using model parallelism. *arXiv preprint arXiv:1909.08053* (2019).
- [264] Harsha Simhadri et al. 2023. Big ANN Benchmarks. <https://github.com/harsha-simhadri/big-ann-benchmarks/tree/main>
- [265] Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze, Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Krishnaswamy, Gopal Srinivasa, et al. 2022. Results of the NeurIPS'21 challenge on billion-scale approximate nearest neighbor search. In *NeurIPS 2021 Competitions and Demonstrations Track*. PMLR, 177–189.
- [266] Kang-Deog Suh, Byung-Hoon Suh, Young-Ho Lim, Jin-Ki Kim, Young-Joon Choi, Yong-Nam Koh, Sung-Soo Lee, Suk-Chon Kwon, Byung-Soon Choi, Jin-Sun Yum, Jung-Hyuk Choi, Jang-Rae Kim, and Hyung-Kyu Lim. 1995. A 3.3 V 32 Mb NAND Flash Memory with Incremental Step Pulse Programming Scheme. *JSSC* (1995).
- [267] Jinghan Sun, Shaobo Li, Yunxin Sun, Chao Sun, Dejan Vucinic, and Jian Huang. 2023. LeafTL: A Learning-Based Flash Translation Layer for Solid-State Drives. In *ASPLOS*.
- [268] Philip Sun. 2020. Announcing ScaNN: Efficient Vector Similarity Search. <https://research.google/blog/announcing-scann-efficient-vector-similarity-search/>
- [269] Weiyi Sun, Mingyu Gao, Zhaoshi Li, Aoyang Zhang, Iris Ying Chou, Jianfeng Zhu, Shaojun Wei, and Leibo Liu. 2025. Lincoln Real-Time 50° 100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory. In *2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 1734–1750.
- [270] Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose, and Onur Mutlu. 2018. MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices. In *FAST*.
- [271] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models. *arXiv preprint arXiv:2312.11805* (2023).
- [272] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. *arXiv preprint arXiv:2403.05530* (2024).
- [273] Nandan Thakur, Nils Reimers, and Jimmy Lin. 2023. Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval. *arXiv preprint arXiv:2205.11498* (2023).
- [274] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. *arXiv preprint arXiv:2104.08663* (2021).
- [275] Ravi Theja. 2023. Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex. <https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5>
- [276] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. *arXiv preprint arXiv:1803.05355* (2018).
- [277] Bing Tian, Haikun Lian, Zhiuhui Duan, Xiaofei Liao, Hai Jin, and Yu Zhang. 2024. Scalable Billion-point Approximate Nearest Neighbor Search Using {SmartSSDs}. In *2024 USENIX Annual Technical Conference (USENIX ATC 24)*, 1135–1150.
- [278] Ajay Tirumala and Raymond Wong. 2024. Nvidia blackwell platform: Advancing generative ai and accelerated computing. In *2024 IEEE Hot Chips 36 Symposium (HCS)*. IEEE Computer Society, 1–33.
- [279] Ajay Tirumala and Raymond Wong. 2024. NVIDIA Blackwell Platform: Advancing Generative AI and Accelerated Computing. In *2024 IEEE Hot Chips 36 Symposium (HCS)*. IEEE Computer Society, 1–33.
- [280] Devesh Tiwari, Simona Boboila, Sudharshan Vazhkudai, Youngjae Kim, Xiaosong Ma, Peter Desnoyers, and Yan Solihin. 2013. Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machines. In *FAST*.
- [281] Devesh Tiwari, Sudharshan S Vazhkudai, Youngjae Kim, Xiaosong Ma, Simona Boboila, and Peter J Desnoyers. 2012. Reducing Data Movement Costs Using Energy-Efficient, Active Computation on SSD. In *HotPower*.
- [282] Mahdi Torabzadehkashi, Siavash Rezaei, Vladimir Alves, and Nader Bagherzadeh. 2018. CompStor: An In-Storage Computation Platform for Scalable Distributed Processing. In *IPDPSW*.
- [283] Mahdi Torabzadehkashi, Siavash Rezaei, Ali Heydarigori, Hosein Bobarshad, Vladimir Alves, and Nader Bagherzadeh. 2019. Catalina: In-Storage Processing Acceleration for Scalable Big Data Analytics. In *PDP*.
- [284] Toshiba. 2012. NAND Memory Toggle DDR1.0 Technical Data Sheet.
- [285] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaee, Nikolay Bashlykov, Soumya Batra, Prajwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. *arXiv preprint arXiv:2307.09288* (2023).
- [286] Shivani Tripathy and Manoranjan Satpathy. 2022. SSD internal cache management policies: A survey. *Journal of Systems Architecture* 122 (2022), 102334.
- [287] Ozan Unlu, Jiyeon Shin, Charlotte J Mailly, Michael F Oates, Michela R Tucci, Matthew Varughese, Kavishwar Wagholicar, Fei Wang, Benjamin M Scirica, Alexander J Blood, et al. 2024. Retrieval-Augmented Generation-Enabled GPT-4 for Clinical Trial Screening. *NEJM AI* (2024), Aloa2400181.
- [288] Thomas Vecchiato, Claudio Lucchese, Franco Maria Nardini, and Sebastian Bruch. 2024. A Learning-to-Rank Formulation of Clustering-Based Approximate

- Nearest Neighbor Search. In *Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*. 2261–2265.
- [289] Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2022. Tesseract: Parallelize the tensor parallelism efficiently. In *Proceedings of the 51st International Conference on Parallel Processing*. 1–11.
- [290] Haoyu Wang, Ruirui Li, Haoming Jiang, Jinjin Tian, Zhengyang Wang, Chen Luo, Xianfeng Tang, Monica Cheng, Tuo Zhao, and Jing Gao. 2024. Blendfilter: Advancing retrieval-augmented large language models via query generation blending and knowledge filtering. *arXiv preprint arXiv:2402.11129* (2024).
- [291] Jianguo Wang, Dongchul Park, Yang-Suk Kee, Yannis Papakonstantinou, and Steven Swanson. 2016. SSD In-Storage Computing for List Intersection. In *DaMoN*.
- [292] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. *arXiv preprint arXiv:2212.03533* (2022).
- [293] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Improving text embeddings with large language models. *arXiv preprint arXiv:2401.00368* (2023).
- [294] Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query expansion with large language models. *arXiv preprint arXiv:2303.07678* (2023).
- [295] Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. *arXiv preprint arXiv:2101.12631* (2021).
- [296] Shengwang Wang, Zihang Lin, Suzhen Wu, Hong Jiang, Jie Zhang, and Bo Mao. 2024. LearnedFTL: A Learning-Based Page-Level FTL for Reducing Double Reads in Flash-Based SSDs. In *2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 616–629.
- [297] Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. 2023. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. In *Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering*. 146–158.
- [298] Xiaohao Wang, Yifan Yuan, You Zhou, Chance C Coats, and Jian Huang. 2019. Project Almanac: A Time-Traveling Solid-State Drive. In *EuroSys*.
- [299] Yitu Wang, Shiyu Li, Qilin Zheng, Linghao Song, Zongwang Li, Andrew Chang, Hai “Helen” Li, and Yiran Chen. 2024. NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing. In *Proceedings of the 51st Annual International Symposium on Computer Architecture*.
- [300] Yuyue Wang, Xiurui Pan, Yuda An, Jie Zhang, and Glenn Reinman. 2024. Beacon-GNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing. In *2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 330–344.
- [301] Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. 2023. Learning to filter context for retrieval-augmented generation. *arXiv preprint arXiv:2311.08377* (2023).
- [302] Zichao Wang, Weili Nie, Zhuran Qiao, Chaowei Xiao, Richard Baraniuk, and Anima Anandkumar. 2022. Retrieval-based controllable molecule generation. *arXiv preprint arXiv:2208.11126* (2022).
- [303] Zheng Wang, Shu Xian Teo, Jieer Ouyang, Yongjun Xu, and Wei Shi. 2024. M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions. *arXiv preprint arXiv:2405.16420* (2024).
- [304] Nirmalie Wiratunga, Ramitha Abeyratne, Lasal Jayawardena, Kyle Martin, Stewart Massie, Ickeckukwu Nkisi-Orji, Ruvan Weerasringhe, Anne Liret, and Bruno Fleisch. 2024. CBR-RAG: case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In *International Conference on Case-Based Reasoning*. Springer, 445–460.
- [305] Guanying Wu and Xubin He. 2012. Reducing SSD Read Latency via NAND Flash Program and Erase Suspension. In *FAST*.
- [306] Ming-Chuan Wu and Alejandro P Buchmann. 1998. Encoded Bitmap Indexing for Data Warehouses. In *ICDE*.
- [307] Suzhen Wu, Yanping Lin, Bo Mao, and Hong Jiang. 2016. GCaR: Garbage Collection aware Cache Management with Improved Performance for Flash-based SSDs. In *ICS*.
- [308] Chunhua Xiao, Shi Qiu, and Dandan Xu. 2022. PASM: Parallelism Aware Space Management strategy for hybrid SSD towards in-storage DNN training acceleration. *Journal of Systems Architecture* 128 (2022), 102565.
- [309] Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. 2024. Benchmarking retrieval-augmented generation for medicine. *arXiv preprint arXiv:2402.13178* (2024).
- [310] Weihong Xu, Junwei Chen, Po-Kai Hsu, Jaeyoung Kang, Minxuan Zhou, Sumukh Ping, Shimeng Yu, and Tajana Rosing. 2023. Proxima: Near-storage Acceleration for Graph-based Approximate Nearest Neighbor Search in 3D NAND. *arXiv preprint arXiv:2312.04257* (2023).
- [311] Jinlong Xue, Yayue Deng, Yingming Gao, and Ya Li. 2024. Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining. *arXiv preprint arXiv:2406.03714* (2024).
- [312] Ikuya Yamada, Akari Asai, and Hannaneh Hajishirzi. 2021. Efficient passage retrieval with hashing for open-domain question answering. *arXiv preprint arXiv:2106.00882* (2021).
- [313] Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundaramaran, Andrew A Chien, and Haryadi S Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. *ACM Transactions on Storage (TOS)* 13, 3 (2017), 1–26.
- [314] Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. 2024. Corrective retrieval augmented generation. *arXiv preprint arXiv:2401.15884* (2024).
- [315] Antoine Yang, Arsha Nagrani, Paul Hongseok Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. 2023. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 10714–10726.
- [316] Ming-Chang Yang, Yu-Ming Chang, Che-Wei Tsao, Po-Chun Huang, Yuan-Hao Chang, and Tei-Wei Kuo. 2014. Garbage Collection and Wear Leveling for Flash Memory: Past and Future. In *SMARTCOMP*.
- [317] Pan Yang, Ni Xue, Yuqi Zhang, Yangxu Zhou, Li Sun, Wenwen Chen, Zhonggang Chen, Wei Xia, Junke Li, and Kihyoun Kwon. 2019. Reducing garbage collection overhead in {SSD} based on workload prediction. In *11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19)*.
- [318] Sophia Yang. 2023. Advanced RAG 01: Small-to-Big Retrieval. <https://towardsdatascience.com/advanced-rag-01-small-to-big-retrieval-172181b396d4>
- [319] Yongpeng Yang, Dejun Jiang, Bo Jiang, Hao-Chiang Hsu, Liang Peng, and Zifeng Yang. 2024. LBZ: A Lightweight Block Device for Supporting F2FS on ZNS SSD. In *2024 IEEE 42nd International Conference on Computer Design (ICCD)*. IEEE, 340–347.
- [320] Zhihan Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. *arXiv preprint arXiv:1809.09600* (2018).
- [321] Yingbiao Yao, Jinlong Fan, Jie Zhou, Xiaochong Kong, and Nenghua Gu. 2021. HDFTL: An on-demand flash translation layer algorithm for hybrid solid state drives. *IEEE Transactions on Consumer Electronics* 67, 1 (2021), 50–57.
- [322] Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Leah Li. 2024. Financial Report Chunking for Effective Retrieval Augmented Generation. *arXiv preprint arXiv:2402.05131* (2024).
- [323] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In *16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)*. 521–538.
- [324] Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2024. Evaluation of Retrieval-Augmented Generation: A Survey. *arXiv preprint arXiv:2405.07437* (2024).
- [325] Cyril Zakkia, Rohan Shad, Akash Chaurasia, Alex R Dalal, Jennifer L Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, et al. 2024. Almanac—retrieval-augmented language models for clinical medicine. *NEJM AI* 1, 2 (2024), Aloa2300068.
- [326] Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, et al. 2024. The good and the bad: Exploring privacy issues in retrieval-augmented generation (rag). *arXiv preprint arXiv:2402.16893* (2024).
- [327] Shulin Zeng, Zhenhua Zhu, Jun Liu, Haoyu Zhang, Guohao Dai, Zixuan Zhou, Shuangchen Li, Xuefei Ning, Yuan Xie, Huazhong Yang, et al. 2023. DF-GAS: a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search. In *Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture*. 283–296.
- [328] Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Learning discrete representations via constrained clustering for effective and efficient dense retrieval. In *Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining*. 1328–1336.
- [329] Boyu Zhang, Hongyang Yang, Tianyu Zhou, Muhammad Ali Babar, and Xiaoyang Liu. 2023. Enhancing financial sentiment analysis via retrieval augmented large language models. In *Proceedings of the fourth ACM international conference on AI in finance*. 349–356.
- [330] Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Weiwei Deng, et al. 2022. Uni-retriever: Towards learning the unified embedding based retriever in bing sponsored search. In *Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*. 4493–4501.
- [331] Susan Zhang, Stephen Roller, Namarn Goyal, Mikel Artetxe, Moya Chen, Shuhui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. *arXiv preprint arXiv:2205.01068* (2022).
- [332] Wenhui Zhang, Qiang Cao, Hong Jiang, Jie Yao, Yuanyuan Dong, and Puyuan Yang. 2019. SPA-SSD: Exploit heterogeneity and parallelism of 3D SLC-TLC hybrid SSD to improve write performance. In *2019 IEEE 37th International*

- Conference on Computer Design (ICCD)*. IEEE, 613–621.
- [333] Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, and Rong Jin. 2018. Visual search at alibaba. In *Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining*. 993–1001.
  - [334] Penghao Zhao, Hailin Zhang, Qinhua Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented generation for ai-generated content: A survey. *arXiv preprint arXiv:2402.19473* (2024).
  - [335] Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2024. Dense text retrieval based on pretrained language models: A survey. *ACM Transactions on Information Systems* 42, 4 (2024), 1–60.
  - [336] Yiyun Zhao, Prateek Singh, Hanooz Bhathena, Bernardo Ramos, Aviral Joshi, Swaroop Gadiyaram, and Saket Sharma. 2024. Optimizing LLM Based Retrieval Augmented Generation Pipelines in the Financial Domain. In *Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)*. 279–294.
  - [337] Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training language models with memory augmentation. *arXiv preprint arXiv:2205.12674* (2022).
  - [338] Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, and Yongping Xiong. 2024. VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval. *arXiv preprint arXiv:2406.04292* (2024).
  - [339] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. 2022. Mixture-of-experts with expert choice routing. *Advances in Neural Information Processing Systems* 35 (2022), 7103–7114.
  - [340] You Zhou, Fei Wu, Ping Huang, Xubin He, Changsheng Xie, and Jian Zhou. 2015. An Efficient Page-level FTL to Optimize Address Translation in Flash Memory. In *EuroSys*.
  - [341] You Zhou, Qiulin Wu, Fei Wu, Hong Jiang, Jian Zhou, and Changsheng Xie. 2021. {Remap-SSD}: Safely and Efficiently Exploiting {SSD} Address Remapping to Eliminate Duplicate Writes. In *19th USENIX Conference on File and Storage Technologies (FAST 21)*. 187–202.
  - [342] Zhenhua Zhu, Jun Liu, Guohao Dai, Shulin Zeng, Bing Li, Huazhong Yang, and Yu Wang. 2023. Processing-In-Hierarchical-Memory Architecture for Billion-Scale Approximate Nearest Neighbor Search. In *2023 60th ACM/IEEE Design Automation Conference (DAC)*. IEEE, 1–6.
  - [343] Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. *ACM computing surveys (CSUR)* 38, 2 (2006), 6–es.
  - [344] Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Cross-clr: Cross-modal contrastive learning for multi-modal video representations. In *Proceedings of the IEEE/CVF International Conference on Computer Vision*. 1450–1459.