

# MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem

Melina Soysal<sup>†</sup> Konstantina Koliogeorgi<sup>†</sup> Can Firtina<sup>†</sup> Nika Mansouri Ghiasi<sup>†</sup>  
 Rakesh Nadig<sup>†</sup> Haiyu Mao<sup>\*</sup> Geraldo F. Oliveira<sup>†</sup>  
 Yu Liang<sup>†</sup> Klea Zambaku<sup>†</sup> Mohammad Sadrosadati<sup>†</sup> Onur Mutlu<sup>†</sup>

<sup>†</sup> ETH Zürich      <sup>\*</sup> King's College London

Conventional genome analysis relies on translating the noisy raw electrical signals generated by DNA sequencing technologies into nucleotide bases (i.e., A, C, G, and T) through a computationally-intensive process called basecalling. Raw signal genome analysis (RSGA) has emerged as a promising approach towards enabling real-time genome analysis by directly analyzing raw electrical signals without the need for basecalling. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. Hardware-based RSGA acceleration has the potential to bridge the gap between software-based RSGA and sequencing throughput.

This paper demonstrates that while (i) conventional hardware acceleration techniques (e.g., specialized ASICs) in tandem with (ii) memory-centric approaches (e.g., Processing-In-Memory) can significantly accelerate RSGA, the high volume of genomic data greatly shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the dominant contributor to both runtime and energy consumption, limiting the scalability of both processor-centric and main-memory-centric accelerators. Therefore, there is a pressing need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities.

We propose MARS, a storage-centric system that leverages the heterogeneous resources available within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach using three major techniques. First, MARS modifies the RSGA pipeline via a previously unexplored combination of two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the modified RSGA steps directly within the storage device by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms, tailored to the internal architecture of the storage system. Third, MARS orchestrates the execution of all steps via a streamlined control and data flow to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware

accelerated state-of-the-art read mapping pipelines by 93× and 40×, on average across different datasets, while reducing their energy consumption by 427× and 72×. MARS improves the performance of state-of-the-art RSGA-based read mapping pipeline by 28× while reducing its energy consumption by 180× on average across different datasets.

## 1. Introduction

Identifying and analyzing an organism’s DNA sequence, i.e., genome analysis, has led to important advances in areas such as personalized medicine [1–3], outbreak tracing [4, 5], and evolutionary biology [6–8]. Genome sequencing is the experimental process of determining the nucleotide sequence of an organism’s DNA. As current technologies cannot generate a single long sequence for an entire genome, DNA is first fragmented into short sequences, called *reads*, which serve as input to computational analyses [9–18] to reconstruct the genome and extract biological insights. The analysis typically starts with *mapping* reads to a known *reference genome* [19, 20], followed by identifying mutations and other genetic variations [10–18, 21] during downstream analyses.

Nanopore sequencing technology [11, 22–27] enables DNA sequencing by passing DNA strands through nano-scale pores, known as *nano pores*, and measuring the resulting fluctuations in electrical current. These current fluctuations, referred to as *raw signals*, correspond to distinct sequences of DNA nucleotides and form the basis for downstream analyses. The small dimensions of the nanopores enable sequencing in compact devices [27], paving the way for portable, scalable, and low-cost [28] sequencing for a wide range of applications, including outbreak tracing and disease diagnosis [29, 30]. Nanopore sequencers’ rapid adoption is further driven by their unique capability of early termination of sequencing when further data is no longer needed [31, 32], **reducing the sequencing time and cost** and enabling **real-time** analysis [10, 33].

Typical genome analysis pipelines first translate noisy raw electrical signals into sequences of nucleobase characters through a process called basecalling [14, 23, 34, 35]. Subsequent downstream analyses are then performed on these text-based sequences. However, basecalling is computationally intensive and represents a major bottleneck for real-time analysis, as it relies heavily on sophisticated deep learning models [11, 35–37]. Given the increasing demand for real-time processing, there is

a pressing need for developing fundamentally new algorithmic approaches to keep up with the rapid advances in nanopore sequencing in terms of performance, energy consumption, and cost [10, 17, 36, 38, 39].

**Raw signal genome analysis (RSGA)** [10–13, 17, 32, 33, 36, 39–44] has been proposed as a new paradigm that bypasses traditional basecalling by operating *directly* on raw electrical signals. Instead of translating signals into nucleotide sequences, RSGA analyzes the raw signals themselves to perform genomic tasks such as read mapping and variant detection. RSGA can complement basecalling by serving as a lightweight pre-basecalling filter [45] to reduce redundant basecalling operations or even replace basecalling entirely by directly analyzing raw signals **in real-time**, without translating them to nucleotide sequences first [10, 17, 36, 39]. RSGA can lead to more comprehensive [12, 17, 46] genome analysis as it preserves richer sequencing information in the raw signals [17, 47–52]. These key benefits have fueled rapid research progress in the field of RSGA [10–13, 17, 32, 33, 36, 39–44], opening new directions such as direct alignment [17, 36, 53] and *de novo* assembly [40] on raw signals.

As advancements in sequencing technologies continue at a rapid pace, scalability challenges arise, placing increasing pressure on software-based RSGA to match the throughput of raw signal generation and meet the real-time requirements. To bridge the widening gap between sequencing throughput and downstream analysis, hardware acceleration is required to either process larger data volumes with the same computational resources or reduce execution time and energy consumption. Research efforts target the computational bottlenecks of RSGA by using GPUs (e.g., [13, 54–57]) or co-designing algorithms with specialized hardware architectures and ASICs (e.g., [33, 36, 58–60]). While these approaches effectively reduce computational overhead, they largely overlook the impact of I/O data movement from the storage subsystem on the *end-to-end* RSGA pipeline. Our motivational analysis (§3) shows that as the computational bottlenecks of RSGA are accelerated, the contribution of I/O becomes dominant and ultimately emerges as the primary bottleneck in the end-to-end analysis. For instance, as the accelerator speedups increase, the adverse impact of the storage subsystem dominates the accelerated end-to-end execution latency, reaching up to 78% of total execution time for large genomes (see §3.2). This motivational study highlights the need for an architecture for RSGA that (i) alleviates the large data movement overhead, (ii) accelerates the computational steps of RSGA, and (iii) scales to the large volumes of genomic datasets.

Our goal in this work is to design a high-performance, energy-efficient, and scalable system for RSGA by effectively addressing *both* the data movement and computation overheads of the end-to-end RSGA pipeline for read mapping. Our key idea is to design a *storage-centric* system that leverages the *heterogeneous compute-capable resources* (e.g., SSD internal DRAM, SSD controller), alongside the large storage capacity available within modern storage systems to alleviate I/O data

movement and computational bottlenecks within the RSGA pipeline in an area-efficient and low-cost manner. To this end, we propose **MARS** (**P**rocessing-**I**n-**M**emory **A**cceleration of **R**aw **S**ignal **G**enome **A**nalysis **I**nside the **S**torage **S**ubsystem), the first In-Storage-Processing (ISP) design combining Processing-Using-DRAM and Processing-Near-DRAM **within** a storage system.

**Challenges.** Despite ISP’s promising potential, designing a storage-centric system for RSGA presents several key challenges. First, RSGA steps (e.g., event detection, seeding and chaining) exhibit high memory demands and irregular data access patterns. In contrast, SSDs lack architectural support for fine-grained (i.e., small-size) memory operations and are optimized for sequential access to fully utilize the high flash memory channel bandwidth inside the SSD. Second, exploiting heterogeneous resources and computation capabilities within the storage system introduces a complex design space and a rich set of tuning parameters. Third, deploying the end-to-end RSGA pipeline consisting of multiple steps inside the storage system creates contention over shared resources, requiring careful coordination and isolation. Addressing these challenges necessitates a carefully-constructed design to ensure a synergistic and efficient orchestration of the available in-storage resources.

We address these challenges through a novel hardware/-software co-design approach that modifies and enables RSGA computational primitives to leverage in-storage execution capabilities, while carefully taking into account storage system constraints. First, we propose two software modifications: (1) a novel combination of two filtering mechanisms [15, 39, 61, 62] that selectively remove redundant or low-quality candidate matches between the input and reference genomes early in the RSGA pipeline, reducing both computational workload and intermediate data storage requirements and (2) an arithmetic conversion scheme that reduces the precision of intermediate signal representations to lower storage and computation overheads, carefully placed in the RSGA pipeline to preserve accuracy. Second, we augment the storage system’s functionality to support the RSGA pipeline by placing accelerators for individual steps in different parts of the storage subsystem, leveraging different ‘*Processing-In-Memory*’ paradigms: (i) inside the memory array of the storage-internal DRAM through the ‘*Processing-Using-DRAM*’ approach, (ii) near the subarrays of the storage-internal DRAM using the ‘*Processing-Near-DRAM*’ approach and (iii) inside the storage controller via the ‘*Processing-Near-DRAM*’ approach, which operates on data fetched from the storage-internal DRAM. MARS orchestrates these individual components through a unified control and data flow that minimizes data movement and efficiently exploits the available bandwidth between them.

We evaluate MARS-based read mapping in terms of accuracy, latency and energy consumption across five diverse genomic input datasets from different species. We compare our design against four state-of-the-art software and hardware baselines using both RSGA and basecalling-based approaches and make

four major observations. First, MARS outperforms the state-of-the-art CPU-based RSGA implementation for read mapping [39] by  $28\times$ , on average, across all datasets while improving the energy consumption by  $180\times$  on average. Second, MARS provides an average speedup of  $93\times$  over a hybrid CPU/GPU-accelerated basecalling-based pipeline [15, 63], while improving energy consumption by  $427\times$  on average. Third, MARS is superior to GenPIP [37], a state-of-the-art Processing-In-Memory-based read mapping system relying on basecalling, achieving a speedup of  $40\times$  and energy savings of  $72\times$  on average across all five datasets. Fourth, MARS provides analysis accuracy *on par* with the conventional basecalling-based pipeline.

This work makes the following **key contributions**:

- It is the first work to demonstrate the I/O bottleneck of hardware-accelerated Raw Signal Genome Analysis (RSGA) and propose In-Storage-Processing of RSGA.
- We propose MARS, the *first* In-Storage-Processing system for RSGA, which mitigates both I/O data movement and computational overheads through a tightly integrated hardware/software co-design.
- To our knowledge, MARS is the first architecture to integrate *multiple Processing-In-Memory paradigms* within the storage system. We implement accelerators *inside* the SSD's DRAM, *near the subarrays* of the SSD's DRAM as well as *inside* the SSD controller leveraging both Processing-Using-DRAM and Processing-Near-DRAM paradigms to efficiently enable diverse RSGA computation primitives.
- We extensively compare MARS to state-of-the-art software- and hardware baselines that use both RSGA and basecalling. We show that MARS improves performance over software and hardware-accelerated state-of-the-art read mapping pipelines by a factor of  $93\times$  and  $40\times$  while reducing their energy consumption by  $427\times$  and  $72\times$  on average across five real-world datasets.

## 2. Background

### 2.1. Genome Analysis

**Raw Signal Genome Analysis.** Nanopore sequencing can sequence relatively long fragments of DNA [11, 22–27], called *reads*, by measuring the electrical current changes caused when a DNA fragment traverses a tiny pore, called *nano pore*. The generated sequence data [11, 22–27], referred to as *raw signals*, are then used in downstream analysis, e.g., for read mapping [10–13, 33, 39] and alignment [17, 36, 53] purposes. In the conventional genome analysis approach, raw signals are first translated into sequences of nucleobase characters (i.e., A, C, G, T) during the basecalling process [14, 20, 45, 63, 64], and then mapped to a reference genome to find similarities and differences [14, 34, 35, 65–67]. In contrast, RSGA eliminates the need for basecalling by directly operating on raw signals [10–13, 17, 32, 33, 36, 39–42].

RSGA requires comparing sequences from the reference genome with sequences derived from each input query, i.e., the raw electrical signals generated by the sequencer for a

given DNA sample. To enable this comparison, both reference subsequences and raw signals are converted into *events*, i.e., a series of values corresponding to genomic subsequences of certain length. These event sequences are then passed through a quantization step that accounts for sequencing noise and enables robust signal-domain comparisons between reference and input query. A typical state-of-the-art RSGA pipeline for read mapping, illustrated in Fig. 1, consists of two main stages (A) Indexing and (B) Mapping.



Figure 1: Overview of a typical RSGA read mapping workflow based on a hash-table for indexing.

(A) **Indexing (offline):** The reference genome is converted into events through reference-to-event conversion and quantization. These events are then stored in an efficient data structure, e.g., a hash table, to enable fast lookup of matching signal patterns. (B) **Mapping (online):** This stage maps raw signals to the reference genome using the previously constructed index. The first step ① in mapping is *event detection*, which performs the signal-to-event conversion of raw signals and applies quantization. In the second step ②, called *seeding*, consecutive events are grouped to generate hash values which represent signal segments known as *seeds* and are used to query the reference index. Matching entries, referred to as *seed hits*, represent candidate matches between the input and reference. During the last step ③, called *chaining*, seeds are sorted based on their positions in the reference genome. Seeds that are both spatially close and colinear, i.e., those that maintain consistent relative positions in the reference and input query, are grouped into *anchors*, which form the basis for constructing *chains* representing high-confidence matching regions between the query and the reference genome.

**Filtering Techniques.** Filtering techniques [15, 21, 62, 68–79] are extensively used in genome analysis pipelines to reduce the need for costly alignment operations by eliminating unlikely candidate matches early during the read mapping process. One popular filtering approach, adopted in both conventional and RSGA approaches, is *frequency filtering* [15, 39]. The goal of frequency filtering is to identify and eliminate the seeds that cause a large number of seed hits in the reference

genome. These frequent seed hits usually appear due to repetitions in the genome or hash collisions, which can cause ambiguity [80] in read mapping and increase the computational cost of the subsequent steps [20], such as chaining. To eliminate these issues, these seed hits are *not* considered in the subsequent chaining stage, effectively reducing the computational load. A dataset-specific value defines the threshold for filtering out such frequent matches. Another promising method is the *seed-and-vote* filtering technique [61, 62] that has been applied in conventional basecalling-based pipelines to discard anchors that are unlikely to generate valid alignments. As shown in Fig. 2, the reference genome is partitioned into overlapping, equal-length windows  $W_i$ . Each anchor votes for the window(s) it appears in (see orange X's in Fig. 2). We define as *voting threshold* the minimum number of votes per window, so that it contains correct alignments. A region whose vote count falls below this predefined threshold is excluded from further analysis, reducing the computational load of chaining.



Figure 2: Overview of the seed-and-vote filtering technique for a threshold value of 5.

## 2.2. SSD Architecture

Fig. 3 depicts the architecture of a typical modern NAND flash-memory-based Solid State Drive (SSD) [81], which consists of three main components: (1) an array of NAND flash chips, (2) SSD controller, and (3) DRAM.



Figure 3: Organizational overview of a modern SSD.

**NAND Flash Memory.** NAND flash memory consists of multiple flash chips [82, 83], which are connected to the SSD controller via multiple parallel flash channels. Each flash chip typically contains one or more independent dies. Each die has multiple (e.g., 2 or 4) planes and each plane contains thousands of blocks. A block includes hundreds to thousands of pages, each of which is 4–16 KiB in size.

**SSD Controller.** The SSD controller [81, 82, 84] consists of two primary components: (1) multiple general-purpose cores running the SSD firmware, i.e., the *flash translation layer (FTL)*, and (2) per-channel hardware *flash controllers*. The FTL manages communication with the host system, maintains logical-to-physical (L2P) address mappings for read operations, handles

internal I/O scheduling, and performs various SSD management functions to hide the complexities of NAND flash memory from the host processor. Flash controllers handle (i) requests between the SSD controller and the flash chips and (ii) error-correcting codes (ECC) for the NAND flash chips [82, 84–86]. **SSD-Internal DRAM.** Modern SSDs employ DRAM to store metadata crucial for SSD management (e.g., L2P page mapping table) and to cache frequently accessed pages [87–92]. Typically, the DRAM takes up 0.1% of the SSD’s capacity (e.g., 4GB LPDDR4 DRAM [93] for a 4TB SSD [94]). As shown in Fig. 4, DRAM is organized in a hierarchical structure. At the highest level, a DRAM module comprises multiple chips, each containing several banks (e.g., 8–16), subdivided into multiple subarrays (e.g., 64–128). A subarray is a 2D array of cells organized into multiple rows (e.g., 512–1024) and columns (e.g., 2–8 KB) [95, 96]. Cells in a row share a wordline while cells in the same column share a bitline. The bitline is used to read from and write to the cells via the row buffer, which contains sense amplifiers (SA in Fig. 4).



Figure 4: Organizational overview of a DRAM module.

**SSD I/O Bandwidth.** SSDs are characterized by the external and internal bandwidth (BW). The external BW, e.g., PCIe [97, 98] lane BW, refers to the data transfer rate between the SSD and the host system and is determined by the number of PCIe lanes. In contrast, the internal BW refers to the bandwidth between the NAND flash chips and the SSD controller. The internal BW typically exceeds the external BW. For example, recent enterprise SSD controllers [99] support 6.55GB/s external and 19.2GB/s internal BW, distributed over 16 channels operating at 1.2 GB/s each [100]. To bridge the performance gap between main memory and storage systems, modern SSDs integrate cutting-edge PCIe-Gen4 interfaces, e.g., 7 GB/s PCIe in Samsung PM1735 [101].

## 3. Motivation

### 3.1. Computational Requirements of RSGA

RSGA is a promising approach for bridging the performance gap between sequencing technologies, such as Nanopore sequencing [11, 22–27], and analysis times. It can reduce the basecalling workload by serving as a pre-basecalling filtering approach [45] or enable real-time analysis by completely bypassing the costly deep-learning based basecalling step [14, 34, 35, 63, 65–67]. However, given the rapid growth of sequencing throughput, it becomes exceedingly challenging for software-based RSGA to meet the requirements of real-time analysis [36]. The increasing number of flow cells and

nanopores per flow cell lead to scalability challenges in processing generated data simultaneously and in real-time. Real-time RSGA [10, 11, 13, 31, 33, 36, 39], particularly for large genomes and extensive data sets, requires medium to large-sized server-grade systems to meet the significant computational and memory needs [10–12]. For example, mapping a human genome with RSGA on our server-grade system (configuration in §7) requires 52 CPU threads and 128 GB DRAM capacity to meet the real-time analysis requirements of a single portable palm-sized sequencing device [39]. Recent state-of-the-art works [33, 36] meet real-time requirements for small genomes, but fail to scale to larger inputs due to the computationally costly operations of full-genome *alignment*, which slows down the system at quadratic rates as the genome size increases. To further understand the acceleration obstacles of RSGA workflows and exploit the full potential for acceleration, a systematic analysis is required.

We focus on RawHash2 [10], the state-of-the-art RSGA pipeline for read mapping that uses efficient quantization and a lightweight hash-based similarity search to scale to larger genomes. We choose RawHash2 as it introduces a highly efficient seed search mechanism, that leads to a better accuracy-throughput trade-off in comparison to prior RSGA read mapping mechanisms, Sigmap [11], UNCALLED [12] and RawHash [10]. We execute RawHash2 on a high-end, latency-optimized SSD [101] with a PCIe Gen4 interface (PCIe) [97]. Fig. 5 shows the breakdown of RawHash2 into the steps described in §2 (i.e., event detection, seeding, chaining) as well as I/O overhead. We measure I/O overhead by executing the pipeline once with data fully preloaded in memory (i.e., without I/O overhead), and once with no data preloaded into memory (i.e., with full I/O overhead from storage). The difference in total runtime between the two runs reflects the I/O data movement time from SSD to memory. We use five different datasets as inputs, enumerated from the smallest (D1, viral SARS-CoV-2 genome) to the largest one (D5, human genome). For all genome sizes, chaining is consistently a primary computational overhead, contributing between 33.1% (D1) and 94.9% (D5) of the total execution time. Seeding takes up 4.3%–9.3% of the execution time. Event detection and I/O data overhead are considerable bottlenecks especially for small datasets (D1, D2, D3), taking up to 20.48% and 40.84% of the execution time respectively.



Figure 5: RawHash2 runtime breakdown for real-world genomic datasets, from smallest (D1) to largest (D5).

While no prior work has accelerated the full RSGA read mapping pipeline end-to-end, several of its individual compute

primitives have been the focus of hardware acceleration efforts [19, 64, 102, 103]. **Chaining acceleration** has received significant attention in the literature. Researchers have employed GPUs [57, 104] as well as FPGAs and custom hardware architectures [33, 57, 59, 105, 106], achieving performance improvements ranging from  $5.4\times$  to  $277\times$  compared to their respective software baselines. More recently, novel computing paradigms have been explored to accelerate chaining, including PIM architectures [107] and RISC-V custom instructions [108]. **Seeding acceleration** has similarly been investigated. Custom hardware designs [109] and GPU-based implementations [110, 111] have demonstrated the potential for significant performance gains. In particular, hash-based seeding has emerged as a promising target for in-memory acceleration, with several works proposing PIM-based solutions [112–117]. For example, [37] implements a ReRAM-based accelerator for hash-based seeding within basecalling pipelines, leveraging similar compute primitives as the seeding step in RSGA. pLUTo [118], an in-DRAM accelerator optimized for lookup-table (LUT) operations, is a promising approach for accelerating hash-based seeding and achieves speedups of up to  $700\times$  over CPU baselines for seeding-relevant workloads.

### 3.2. Impact of Data Movement on Hardware Accelerated RSGA

Despite the promising results of these standalone accelerators, there are no mature end-to-end accelerated systems for RSGA. Current works overlook the impact of storage I/O on the end-to-end accelerated system. As more RSGA pipeline steps are accelerated to meet the real-time requirements and the growing throughput of modern sequencing devices, the distribution of latency across RSGA read mapping steps will change drastically. We expect I/O to emerge as the dominant bottleneck in the end-to-end analysis as the computational steps are increasingly accelerated and thus minimized.

We validate this hypothesis through a motivational experiment that analyzes RawHash2 [10] using the same real-world datasets and hardware setup as introduced in our previous experiment (§3.1). We assume a scenario that applies state-of-the-art accelerators to the two most frequently accelerated steps: seeding and chaining. We model the latency of the accelerated workflow by incrementally reducing the latency of the seeding and chaining steps by 10% until we reach 100% total latency reduction, i.e., zero execution time. The results are shown in Fig. 6.



Figure 6: Impact of I/O on overall execution time under increasing acceleration of computation bottlenecks.

Fig. 6 shows how I/O data movement overhead progressively dominates end-to-end execution time as latency reduction increases. We make the key observation that as latency reduction increases, the I/O data overhead becomes the limiting factor across all datasets. In particular, for small genome datasets (D1-D3), I/O overhead reaches up to 66% of the total execution time. For larger genomes, I/O overhead remains modest until a large latency reduction of 90%. However, as computation bottlenecks are further minimized, storage I/O emerges as the primary performance limiter. **For example, I/O overhead accounts for 57% and 78% of the total execution time for D5 and D4 respectively when execution time of seeding and chaining is reduced by 100%.** These results indicate that accelerating the seeding and chaining alone is insufficient, and that I/O data movement from SSDs becomes the dominant overhead in accelerated RSGA.

### 3.3. Our Goal

Based on our motivational analysis, acceleration of RSGA is critical for achieving real-time genome analysis. While computational complexity is a key challenge, I/O data movement from SSDs becomes the dominant bottleneck across all genome datasets once computational steps are heavily accelerated. *In-Storage processing* (ISP) can, therefore, be a key enabler for designing a real-time system for RSGA. Specifically, ISP can uniquely address the I/O data movement bottleneck, manage the high volume of genomic data, and provide fine-grained parallelism to accelerate the computational steps. However, designing an ISP system for RSGA is challenging due to architectural constraints of SSDs. Limited hardware resources (such as main memory capacity) and inefficient random accesses prevent a straightforward implementation of the RSGA pipeline inside storage. **Our goal** in this work is to leverage ISP capabilities in a careful way to accelerate RSGA by alleviating the I/O overheads and accelerating key computational steps.

## 4. MARS Key Idea

The core design idea for MARS is to enable multiple Processing-In-Memory paradigms within the SSD and leverage the high SSD-internal flash channel bandwidth to create a highly-parallel heterogeneous computing environment for RSGA inside the storage system. Our optimization strategy consists of two key components: First, we propose targeted software modifications on existing RSGA pipelines that take into account SSD limitations and parallelization capabilities while maintaining accuracy. Second, we provide specialized near-memory computation units within the SSD for individual computational steps of the RSGA pipeline and orchestrate the data flow between them. We leverage two computational approaches within the SSD: (i) Processing-Using-DRAM, which exploits the analog properties of DRAM arrays in SSD to perform massively parallel in-memory operations with minimal data movement overhead. (ii) Processing-Near-DRAM, which adds lightweight compute logic *close to* the SSD’s internal DRAM, either *near the DRAM subarrays* or *inside the SSD controller*, tailored to the demands of each RSGA step.

## 5. MARS Genome Analysis Workflow

MARS implements a genome analysis workflow based on the state-of-the-art RSGA approach presented in Fig. 1. The scope of MARS’s software modifications is to reduce both computational workload and intermediate storage requirements, resulting in a version of RSGA that is optimized for efficient in-storage execution.

### 5.1. Filtering Techniques

We adopt two distinct filtering techniques to reduce the load on the computationally intensive and resource-demanding chaining step.

**Frequency Filters.** First, we leverage *frequency filters* [15, 39] to only examine unique, meaningful matches between signal queries, e.g., seeds, and reference genomes. Frequency filters are applied to the hash values created by multiple seeds (§2.1), and discard seeds that appear within the reference genome above a predefined threshold frequency (*thresh\_freq*).

**Seed-and-Vote Filtering.** Second, we adopt the *seed-and-vote* filtering technique [61, 62] to discard anchors unlikely to generate a correct alignment. As described in Section 2, we partition the reference genome into windows, and anchors vote for windows that contain exact matches. A window with a high number of votes is more likely to contain the correct alignment. Only windows receiving a number of votes above (*thresh\_voting*) are retained for further processing. This threshold is selected to balance accuracy (measured via F1-score) and performance, ensuring sufficient anchors are preserved for sensitivity, while discarding redundant matches to reduce workload. This is the first work to apply the seed-and-vote technique to raw signals. For raw signals, this process is particularly challenging because reads and references, when converted to events, can include noise. To address this, we apply the seed-and-vote technique *after the quantization and hash-table query steps*, to preserve accuracy.

Based on the size and characteristics of the target genome, parameter values for both filtering techniques, i.e., *thresh\_freq*, *thresh\_voting*, and the window size for seed-and-vote filtering may vary. To ensure robust performance across a wide range of datasets, we perform an offline parameter space exploration to tune the parameter values and achieve a fair trade-off between accuracy and performance of the analysis. Our exploration space is defined by the tuple (*thresh\_freq*, *thresh\_voting*, *voting\_window*). We test different configurations on a subset of each dataset (0.5-2%) and observe that genomes with similar properties (e.g., size or complexity) consistently benefit from the same parameter configurations. Small genomes yield the best trade-off between accuracy and performance across different datasets for values of (2000, 5, 256) and large genomes for (20000, 2, 256). Although the values cover a representative set of diverse genomes, they are easily reconfigurable for new genome types. The parameter exploration is performed only once offline and therefore does not impact the end-to-end runtime and energy.

## 5.2. Arithmetic Conversion Techniques

We improve the utilization of the internal flash-channel bandwidth available within the SSD by using arithmetic conversion techniques. The key idea of this optimization is to convert **floating-point** values to **fixed-point** and benefit from reduced storage requirements (i.e., mostly reduced bit-width from 64 or 32 bits to 16 bits) for intermediate data, as well as enable resource-efficient and less time-consuming fixed-point operations. We perform an experimental analysis at software level and evaluate the accuracy achieved for fixed-point arithmetic using 32, 16 and 8 bits. The use of 16 bits leads to small accuracy loss compared to floating point and significant resource utilization savings.

Our goal is to maximize savings by applying arithmetic conversion as early as possible in the pipeline. However, adopting fixed-point arithmetic at the beginning of the pipeline is challenging due to the noise of raw signals, i.e., leveraging fewer bits for raw signals interferes with subsequent signal-to-event conversion and quantization leveraged in typical RSGA pipelines [39], leading to much lower accuracy. Applying early quantization, i.e., applying quantization directly on the raw signal before signal-to-event conversion, alleviates this challenge. It increases stability against possible noise and facilitates the adoption of fixed-point arithmetic. Unlike previous works [39], our workflow first applies quantization, followed by converting floating-point to fixed-point arithmetic, and then executes the signal-to-event conversion. We show the accuracy results of our implementation for both fixed- and floating-point in §8.

## 6. MARS Architecture and System

We propose MARS, the first ISP system designed for accelerating RSGA by reducing data movement overheads and leveraging highly parallel computation capabilities present inside modern storage systems. We design MARS as an end-to-end In-Storage-Processing system that expands the capabilities of state-of-the-art SSDs and autonomously executes the RSGA pipeline without host intervention.

### 6.1. MARS In-Storage Architecture

**6.1.1. Overview.** Fig. 7 shows a high-level overview of our system and the application flow.<sup>1</sup> MARS consists of five types of components: *MARS Control Unit* ①, *Sorter Unit* ②, *Merger Unit* ③, *Arithmetic Unit* ④ and *Querying Unit* ⑤.

**SSD Controller Components:** *MARS Control Unit*, *Sorter Unit* and *Merger Unit* are placed inside the SSD controller. *MARS Control Unit* ① acts as a Finite State Machine (FSM) that controls and coordinates the data flow between MARS computation units. *MARS Sorter Unit* ② (§6.4) is an accelerator that sorts sequences up to a predefined length. The *Merger Unit* ③ (§6.4) efficiently combines short sorted sequences into longer ones. Both units follow the Processing-Near-DRAM approach, operating on data that originates from SSD-internal DRAM. One Sorter and Merger pair is added per Flash Controller, adding up to 8 instances.

<sup>1</sup>To ease readability, Fig. 7 and 10 exclude control paths.

**SSD-internal DRAM Components:** The *Arithmetic Unit* ④ and *Querying Unit* ⑤ are placed inside the SSD-internal DRAM chips. The *Arithmetic Unit* ④ (§6.2) performs arithmetic and logical operations. It leverages the Processing-Near-DRAM approach: An Arithmetic Unit is placed at the edge of each pair of subarrays' peripheral logic, leading to 256 instances. The *Querying Unit* ⑤ (§6.3) performs efficient hash-table lookups. It leverages the Processing-Using-DRAM paradigm, i.e., exploits the analog operational properties of the SSD-internal DRAM: One *Querying Unit* is placed per subarray, leading to 512 instances.



Figure 7: High-level overview of MARS architecture.

**6.1.2. Mapping Workflow to Compute Units.** We perform a detailed analysis of the RSGA workflow to partition the RSGA steps (i.e., event detection, seeding, chaining) into more fine-grained tasks such as arithmetic (e.g., addition, multiplication, division), querying and sorting operations. The entire RSGA workflow is described as a pipeline of these fine-grained tasks and each one is mapped to one of the available computation units (Arithmetic, Querying, Sorter, or Merger Unit) for efficient execution. The MARS Control Unit encodes the pipeline steps and the order of their execution into a Finite State Machine and sequentially orchestrates them at runtime. While the pipeline sequence is predefined, actual computations in each step are triggered dynamically based on the availability of inputs. Each compute unit is activated only when its inputs are available, ensuring resource efficiency and avoiding contention.

**6.1.3. Control and Data Flow.** Each step in MARS's pipeline begins as soon as the previous one finishes. Before starting execution, the index and raw input data are distributed uniformly in terms of size across all SSD channels. Data is transferred to the MARS *Arithmetic Units* ④, close to the SSD-internal DRAM subarrays. The *Arithmetic Units* ④ perform the *event detection* step ① consisting of signal-to-event conversion ①a and quantization ①b, executed sequentially. Next, as part of the seeding step ②, the *Arithmetic Units* execute the hash-value generation ②c and the frequency filter ②d. The filtered hash values are used for querying ②e the hash-table for seed hits inside the DRAM at the *Querying Units* ⑤. The seed-and-vote filter-

ing (f) step discards non-promising seed hits by leveraging once more the *Arithmetic Units* (IV). During the chaining step (3), the data is first bucketized (g) within the *Arithmetic Units* (IV) and transferred to the *Sorter* (II) and *Merger Units* (III) inside the SSD controller for the sorting (h) step. The sorted data fragments are consolidated back in the SSD-internal DRAM for the final part of chaining, i.e., a dynamic programming-based algorithm (3i) implemented within the *Arithmetic Units* (IV).

## 6.2. Event Detection Implementation

We map event detection (i.e., signal-to-event conversion and quantization §2.1) to the *Arithmetic Unit* as it mainly comprises additions and multiplications. One *Arithmetic Unit* is placed next to two SSD-internal DRAM subarrays to perform arithmetic operations close to the data and leverage the large subarray-level parallelism available within the DRAM. MARS is the first work to implement Processing-Near-DRAM inside the storage-internal DRAM.

**Arithmetic Unit Architecture and Mechanism.** Our design is inspired by a previous DRAM-based design, FULCRUM [119]. Fig. 8 illustrates the main components of the design. A single-word ALU (1) is placed next to a DRAM subarray and performs addition, comparison, multiplication, and bitwise operations. Registers (2) are placed near the ALU to store intermediate results. A programmable *Instruction Buffer* (3) stores pre-decoded information for potential instructions, i.e., different operands and branch outcomes. *Column-Selection Latches* (4) are placed on each column of each subarray to enable sequential access to individual columns [119]. A *Control Unit* (5) determines the order of instructions and location of next access to the memory array.



Figure 8: Overview of the MARS Arithmetic Unit near a DRAM subarray.

In order to map the operations of signal-to-event conversion and quantization to the Arithmetic Unit, we first break each of them down into arithmetic, predicate-based and condition-based operations. We construct pre-decoded instructions for all potential branches of execution within these operations and store them in the programmable instruction buffer. Based on the outcome of the previous operation, the Control Unit (i) selects the next instruction from the Instruction Buffer and (ii) identifies the columns of the subarray that need to be accessed. This ensures that the Column-Selection Latches either capture the correct input operands (read from the subarray) or hold the correct target values before writing them back to the subarray.

## 6.3. Seeding Implementation

The hash-value generation, frequency filter and seed-and-vote filtering steps in seeding comprise arithmetic operations and pairwise comparisons. To execute those operations efficiently, we use the *Arithmetic Unit* described in Section 6.2. However, hash table querying presents unique challenges due to the hash table’s large size and the frequent random memory accesses it requires. To address this, we implement the hash-querying mechanism inside SSD-internal DRAM leveraging Processing-Using-Memory, in particular the pLUTo [118] approach. This method exploits DRAM’s high storage density to enable massively parallel storage and querying of lookup tables (LUTs), ensuring efficient and scalable operations.

**Querying Unit Architecture and Mechanism.** Fig. 9 shows the architecture and step-by-step control flow of the Querying Unit. The hash table is stored in the SSD-internal DRAM and is queried by subsequently activating DRAM rows using custom match logic and gated sense amplifiers (SA) [118] (highlighted in orange). The custom match logic, located adjacent to the row buffer, uses comparators to compare the currently activated row index against the key values loaded into the source row buffer. A matchline is implemented as part of the custom match logic to enable the gated SA to selectively copy the corresponding value into the output buffer, when a match is detected. A single query proceeds in four steps. (1) **Key Loading**. The source row buffer is populated with the input keys (e.g., in Fig. 9: random values K, O, V). (2) **Row Sweeping & Matching**. DRAM rows containing candidate hash entries are sequentially activated. For each row, the match logic compares the row index to the loaded keys. If a match is detected, the corresponding matchline is asserted. (3) **Selective Copying**. The gated sense amplifiers sense and copy only those values in the currently activated row that correspond to matched keys. (4) **Result Assembly**. The matched hash values (e.g., in Fig. 9: 6, 1, 4) are assembled in the row buffer.



Figure 9: Overview of our hash table query mechanism in the SSD-internal DRAM.

If the DRAM size allows it, we store several copies of the hash table in the computation-enhanced subarrays to query multiple values in parallel. If the genome index exceeds DRAM capacity, MARS adopts a partitioning strategy: large indexes (e.g., 52 GB for the human genome in D5) are divided into smaller regions (e.g., 2.6 GB), which are loaded into the SSD DRAM and queried sequentially. To minimize performance impact, MARS overlaps computation with data loading, effectively hiding the data movement latency.

## 6.4. Chaining Implementation

Chaining (§2.1) consists of a sorting step (i.e., to sort seed positions) and a dynamic programming algorithm to extend chains from sorted seeds. While the dynamic programming part, based on additions and min operations, is efficiently handled by our near-DRAM *Arithmetic Unit*, sorting large sets of seeds directly near DRAM would be either slow or require substantial area due to custom comparator logic. Instead, we implement a highly parallel, custom sorter design inside the storage controller and benefit from increased scalability provided by the available SSD controller resources.

**Key Idea.** The main implementation challenge is efficiently sorting input sequences of variable length with high throughput and minimal area overhead. We address this challenge by designing a resource-efficient hierarchical mechanism consisting of (1) a *Sorter Unit* that processes input sequences of up to 128 elements and (2) a *Merger Unit* that combines smaller sorted subsequences into larger sorted outputs, enabling scalability beyond 128 elements.

**Sorter and Merger Unit Architecture.** MARS’s Sorter and Merger Unit is based on the bitonic sorter and merger, respectively [120–122], to benefit from their inherent parallelism and hardware-friendly structure and operations. Sorter and merger units are throughput-matched to prevent pipeline stalls and are sized to balance area efficiency with maximum utilization of the available internal SSD bandwidth.

**Mechanism.** MARS’s Control Unit manages the sorting and merging process, including data movement between the SSD-internal DRAM and the Sorter and Merger Units. Fig. 10 shows the Sort-and-Merge mechanism flow.



Figure 10: Simplified overview of our Sort-and-Merge workflow.

As shown in Fig. 10, (1) the Control Unit groups unsorted seeds stored in SSD-internal DRAM into eight buckets, with each bucket corresponding to a non-overlapping region of the genome. (2) It transfers each bucket to one of eight parallel Sorter–Merger units located near the storage controller. Each Sorter Unit splits its assigned bucket into smaller subsequences, i.e., shorter than or equal to 128 elements, and sorts them locally using bitonic sorting. (3) If a bucket contains longer sequences, the Sorter Unit forwards the sorted subsequences to the Merger Unit. The Merger Unit then merges these short, sorted input sequences into a longer fully sorted sequence using a streaming, one-pass merge strategy with

no intermediate buffering or feedback. This design enables continuous, one-pass merging with low control complexity and high throughput, especially for long or variable-length inputs. (4) The Control Unit writes the sorted outputs back to the SSD-internal DRAM. Since buckets are non-overlapping, they can be directly concatenated without further merging. If local registers near the Merger Units are insufficient, the Control Unit temporarily buffers intermediate results in DRAM. The final sorted sequences are subsequently consumed by the dynamic programming stage of chaining.

## 6.5. System Integration

MARS is integrated into a modern SSD with two different modes of operation: *conventional* and *accelerator*. In *conventional mode*, the SSD operates as a storage device only. In *accelerator mode*, a MARS-enabled SSD only performs RSGA. This dual-mode of operation is feasible through small changes to the Flash Translation Layer (FTL).

**MARS FTL and Data Placement.** At the beginning of *accelerator* mode, the Control Unit flushes all metadata essential to the *conventional mode* (e.g., the page status table, block read counts, logical-to-physical (L2P) mapping etc.) to the flash storage. MARS leverages the access pattern of the RSGA workflow to apply a storage-efficient custom logical-to-physical (L2P) mapping for the *accelerator* mode. Since the access pattern of the genome index and reference is sequential, data is placed on the flash chips in a log-structured manner. The Control Unit then accesses the data in a sequential manner from the starting logical page address (LPA) and reads across channels in a round robin manner. Thus, this design allows to keep a small mapping data structure consisting of: (1) the mapping between the starting LPA and the physical page address (PPA), (2) the database size, and (3) a sequence of physical block addresses (PBAs) rather than the complete LPA-to-PPA mappings to store the genomic data.

**SSD Management Tasks. Error Correction, Read Disturbance and Data Retention:** Since MARS’s accelerators operate within the SSD controller and the SSD-internal DRAM, all data is accessed by the Control Unit after ECC decoding [82, 84–86]. MARS effectively tackles read disturbance and data retention impact [82, 123–127] since: (i) The sequential access pattern of the RSGA pipeline minimizes repeated reads to the same page within short intervals, reducing the likelihood of read disturbances [123, 126], (ii) Commodity SSDs automatically apply data refresh policies that refresh pages once their read counts exceed predefined thresholds, (iii) The time interval between subsequent refreshes does not exceed the duration of RSGA, which is substantially shorter than the manufacturer-specified threshold for reliable retention age (e.g., one year [128]).

**Wear-leveling:** MARS effectively mitigates the impact of writes on flash lifetime thanks to two design choices: (i) Our design employs an out-of-place write policy and selects new blocks for writing based on their age, thereby effectively reducing long-term degradation, and (ii) *flash writes* are minimized as the Control Unit only writes the final read mapping results from the SSD-internal DRAM to the flash memory at the end

of the RSGA workflow.

**Storage Interface Commands.** MARS operates independently of the host during RSGA execution, using the FSM in the SSD controller. Our design introduces two new NVMe commands, i.e., standardized interfaces used by the host to communicate with SSDs, for the host to support MARS execution: (i) *MARS\_Init* initiates the RSGA analysis and signals the SSD to switch from the conventional into the accelerator mode, (ii) *MARS\_Write* command updates both the MARS FTL and regular FTL at the end of the application when the read mapping results are written from the SSD-internal DRAM to flash cells.

## 7. Evaluation Methodology

**Evaluated Systems.** We evaluate MARS by comparing it against state-of-the-art RSGA and conventional basecalling-based read-mapping systems in terms of accuracy, performance and energy. As a baseline for RSGA-based read mapping, we select state-of-the-art RawHash2 [39], which offers a better accuracy-throughput trade-off compared to prior RSGA tools and techniques, including RawHash [10], Sigmap [11] and UNCALLED [12].

We evaluate the following systems: (1) **BC**: a baseline pipeline for basecalling-based read mapping comprised of GPU-based basecaller *Dorado* [63] and *minimap2* [15] read-mapping tool (Version 2.24-r1122). To simulate a real-time setting, we assume the basecaller processes raw signal chunks incrementally as they are generated by the sequencer rather than waiting for the full completion of each read’s raw signal. (2) **RH2**: RawHash2 [39] RSGA-based read mapping baseline running on a state-of-the-art server-grade CPU [129]. (3) **MS-CPU<sub>Float</sub>**: MARS executed on CPU using floating-point arithmetic and the filtering optimizations presented in Section 5. (4) **MS-CPU<sub>Fixed</sub>**: MARS executed on CPU using both fixed-point arithmetic and filtering optimizations. (5) **MARS**: our proposed in-storage design of MARS using fixed-point arithmetic, implemented as described in Section 6. (6) **MS-EXT**: a variant of MARS that add all computation units outside (external to) the SSD. Sorting is offloaded to a near-CPU ASIC based on our custom design, while arithmetic and hash querying operations are executed in DRAM-based PIM units [118, 119]. This configuration represents a PIM-only system that avoids any in-storage computation and serves as a comparison point to evaluate the benefits of tightly integrated compute within the storage hierarchy. (7) **MS-SIMDRAM**: a MARS variant that replaces the Processing-Near-DRAM-based Arithmetic Unit with a SIMDRAM-based [130] Arithmetic Unit. (8) **GenPIP** [37]: a state-of-the-art hardware-accelerated, basecalling-based read mapping pipeline combining non-volatile memory (NVM)-based PIM with algorithmic optimizations (9) **MS-SmartSSD** [131]: an existing system [132] which directly connects an FPGA with the SSD via an external 3 GB/s link [133]. We map MARS’s Sorter and Merger Logic Units to the FPGA (300 MHz clock frequency [132]) and our PIM-components (§6.2,6.3) in the SSD-internal DRAM.

**CPU and GPU Configurations.** For the CPU-based systems, we use a high-end server with two 64-core AMD EPYC 7742 CPUs [129], 1TB of DDR4 DRAM [93] and a performance-optimized SSD [101] connected to the CPU via a PCIe4 interface [97]. For the BC system, the basecalling step (*Dorado* [63]) runs on an NVIDIA RTX A6000 GPU [134]. All software tools support multi-threaded processing where each raw signal sequence is handled by a separate thread. We run all tools with the best-performing configuration of 128 threads to compare against our system.

**SSD and DRAM Configurations.** To evaluate MARS and **MS-SIMDRAM**, we consider a performance-optimized SSD with internal LPDDR4 DRAM [93] (Table 1). Since accelerators and compute units operate sequentially, we simulate each component individually, including the data movement between them. For DRAM-based components (i.e., *Arithmetic* and *Querying* Units), we use timing parameters extracted from the LPDDR4 DRAM model in CACTI7 [135]. We assume that single-word ALUs embedded in SSD-internal DRAM operate at 164 MHz. For SSD components we use MQSim [90,136], a widely-adopted simulator for modern SSDs. Our *Sorter* and *Merger* Unit are implemented in Verilog HDL and synthesized using Synopsys Design Compiler [137] at 1 GHz to obtain timing, area, and energy results. We model data movement overheads by calculating the transfer latency between each computing and storage element based on the size of the data to be transferred and the available bandwidth between components. We combine the simulation results from DRAM and SSD simulators, the Verilog synthesis and the data movement overheads to evaluate the end-to-end performance of MARS.

**Table 1: Simulation configuration of our design.**

| Component              | Detailed Configuration                                                                                                                                          |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SSD                    | NVMe, PCIe 4.0, PCIe lane BW: 1.2 GB/s, TLC, 8 channels, 8 chips/channel, tDMA: 16/ $\mu$ s, tR (TLC): 22.5/ $\mu$ s, flash channel BW: 1 GB/s, 4 ARM Cortex R7 |
| SSD-Internal DRAM      | 4 GB LPDDR4 DRAM, 16 banks, 512 subarrays, 256 rows/subarray, row size: 2048 bytes                                                                              |
| Sorter and Merger Unit | Frequency: 1 GHz                                                                                                                                                |
| Arithmetic Unit        | Frequency: 164 MHz                                                                                                                                              |

**Datasets.** We evaluate MARS on five real-world datasets from different organisms, covering a wide range of genome sizes. Table 2 summarizes the dataset characteristics [138–143] and reference genomes [144–149], all obtained from public repositories. We use the *fast5* file format for our input data and assume that the data is already correctly placed, i.e. sequentially and evenly distributed across all SSD channels, for all evaluated systems.

## 8. Evaluation

### 8.1. Accuracy Analysis

We evaluate **RH2**, **MS-CPU<sub>Fixed</sub>** and **MS-CPU<sub>Float</sub>** accuracy based on the ground truth generated by basecalling reads with

**Table 2: Details of datasets used in our evaluation.**

| Organism          | Reads (#) | Bases (#) | Genome Size (bp) | Dataset Size |
|-------------------|-----------|-----------|------------------|--------------|
| D1 SARS-CoV-2     | 1,382,016 | 594 M     | 29,903           | 11 GB        |
| D2 <i>E. coli</i> | 353,317   | 2,365 M   | 5 M              | 27 GB        |
| D3 Yeast          | 49,989    | 380 M     | 12 M             | 39 GB        |
| D4 Green Algae    | 29,933    | 609 M     | 111 M            | 74 GB        |
| D5 Human HG001    | 269,507   | 1,584 M   | 3,117 M          | 39 GB        |

Dorado [63] and mapping the generated *basecalled* reads to the reference genome using minimap2 [15]. All hardware systems implement MS-CPU<sub>Float</sub> workflow and thus achieve the same accuracy. We use UNCALLED pafstats [12] tool to identify true positives (TP: correct mappings), false positives (FP: incorrect mappings), and false negatives (FN: unmapped reads that are mapped in the ground truth) based on the mapping position distance from the respective ground truth. Using these values, we calculate precision ( $P = TP / (TP+FP)$ ), recall ( $R = TP / (TP+FN)$ ), and the  $F_1$  score ( $F_1 = 2x(PxR) / (P+R)$ ).

We make two observations based on the accuracy results reported in Table 3. (1) MS-CPU<sub>Fixed</sub> outperforms RH2 in terms of recall and  $F_1$  score for all evaluated datasets, while maintaining on-par precision for small genomes and only a slight reduction in precision for larger genomes. This improvement is due to the integration of our two proposed filtering techniques (§5.1) and early quantization (§5.2) which together eliminate ambiguous or redundant candidate matches, i.e., matches that are frequent, low-quality or non-specific, and allow the pipeline to focus on signal regions that are more likely to represent correct alignments.(2) The use of fixed-point and integer operations instead of floating-point operations only minimally decreases accuracy for all datasets.

## 8.2. Performance Analysis

We evaluate the performance of all seven systems described in §7 leveraging the five diverse datasets of Table 2. Fig. 11 shows the execution time speedup achieved by each evaluated system over CPU-based RawHash2, RH2. We make three observations. First, MARS outperforms *all* other baselines across *all* datasets. Compared to the GPU-accelerated basecalling based pipeline BC, MARS delivers a speedup of  $93\times$  on average across all five datasets with larger speedups for smaller genomes. This is because MARS (i) eliminates basecalling, (ii) applies filtering mechanisms, (iii) reduces data movement, and (iv) enables highly parallel in-storage execution.



**Figure 11: End-to-end execution time speedup of each system over RH2.**

Second, MARS outperforms all prior hardware-accelerated solutions: MS-EXT, MS-SIMDRAM, GenPIP, and MS-SmartSSD.

Specifically, MARS improves performance by  $3.1\times$  on average over MS-EXT, which adopts PIM solutions (MARS-based ASIC and PIM accelerator) outside the storage. This comparison point shows that MS-EXT fails to fundamentally solve the I/O data movement overhead problem and highlights the importance and need for in-storage processing for RSGA. MS-SmartSSD performs worse than MARS, due to its limited 3 GB/s bandwidth between SSD and FPGA [131], which restricts the use of internal SSD bandwidth between flash and storage controller, fully utilized by MARS. While MS-SIMDRAM addresses I/O overhead through in-storage computation, its use of bit-serial operations for arithmetic (e.g., multiplication, division) results in execution time  $21.4\times$  slower than MARS. MARS is the only design that both eliminates the I/O bottleneck and meets the computational demands of RSGA acceleration.

Third, our algorithmic improvements alone (MS-CPU<sub>Fixed</sub>), i.e., without leveraging ISP capabilities, provide a considerable speedup of  $1.2\text{-}10.2\times$  over RH2 for medium- to large-sized genomes (i.e., D3-D5) and on-par performance for small genomes (i.e., D1,D2). This demonstrates the effectiveness of our software optimizations, including filtering, in reducing the computational load, particularly during chaining.

**Throughput evaluation.** We compare MARS’s throughput with the throughput of a single sequencer, which is 450 bases per second (i.e., 4000 - 5000 samples per second) [150]. As Table 4 shows, MARS’s throughput is substantially higher than 450 bp/sec for all datasets. In fact, MARS outperforms the real-time analysis requirement of a full MinION sequencer [27], which processes data at 230,400 bp/s, by  $46\times$  on average across all datasets (between  $1.2\times$  for large genomes (D5) to  $202\times$  for small genomes (D1)).

## 8.3. Energy Analysis

To demonstrate the energy benefits of MARS, we measure the energy consumption of all components (i.e., SSD, DRAM, CPU and if applicable GPU) involved in the respective systems. We use AMD µProf [151] to measure the energy consumption for CPU-based systems, and the CACTI7 [135] DDR4 model to estimate the power overheads on our PIM-enabled DRAM design. We synthesize logic components with the Synopsys Design Compiler [137] using a 65nm process node to estimate their power consumption.

Fig. 12 shows the end-to-end energy reduction achieved by all evaluated systems over RawHash2 (RH2). We make three observations. (1) All hardware-accelerated systems, i.e., MARS, MS-EXT, MS-SIMDRAM, GenPIP achieve greater energy reduction compared to CPU-based setups, i.e., BC and MS-CPU<sub>Fixed</sub>. (2) Only MS-SIMDRAM yields higher energy reduction compared to MARS (by  $3.5\times$  on average across datasets), due to its simplified Arithmetic Unit based on bit-serial, in-memory execution. However, because of MS-SIMDRAM’s significantly higher latency (§8.2), MARS still provides a more favorable trade-off between latency and energy consumption. (3) MS-EXT reduces energy by  $22.3\times$  as opposed to MARS’s  $79.4\times$  reduction over RH2, due to high data movement from the storage to the host and accelerators and a greater reliance on the CPU for orches-

**Table 3: Mapping accuracy of three RSGA pipelines compared to basecalling-based ground truth.**

|                         | D1 SARS-CoV-2 |        |        | D2 E.coli |        |        | D3 Yeast |        |        | D4 Green Algae |        |        | D5 Human HG001 |        |        |
|-------------------------|---------------|--------|--------|-----------|--------|--------|----------|--------|--------|----------------|--------|--------|----------------|--------|--------|
|                         | Prec.         | Recall | $F_1$  | Prec.     | Recall | $F_1$  | Prec.    | Recall | $F_1$  | Prec.          | Recall | $F_1$  | Prec.          | Recall | $F_1$  |
| RH2                     | 0.9868        | 0.8735 | 0.9267 | 0.9573    | 0.9009 | 0.9282 | 0.9862   | 0.8412 | 0.9079 | 0.9691         | 0.7015 | 0.8139 | 0.8949         | 0.4054 | 0.5582 |
| MS-CPU <sub>Fixed</sub> | 0.9917        | 0.9694 | 0.9803 | 0.9854    | 0.9574 | 0.9712 | 0.9533   | 0.9643 | 0.9588 | 0.9125         | 0.9166 | 0.9141 | 0.8723         | 0.6318 | 0.7300 |
| MS-CPU <sub>Float</sub> | 0.9939        | 0.9796 | 0.9867 | 0.9893    | 0.9616 | 0.9753 | 0.9551   | 0.9655 | 0.9603 | 0.9254         | 0.9438 | 0.9354 | 0.8763         | 0.6729 | 0.7612 |

**Table 4: Throughput of MARS.** A single nanopore has a throughput of 450 bp/sec; an entire MinION sequencer achieves 230,400 bp/sec.

|                     | D1         | D2        | D3        | D4        | D5      |
|---------------------|------------|-----------|-----------|-----------|---------|
| Throughput [bp/sec] | 46,655,128 | 5,274,148 | 1,202,660 | 1,277,764 | 286,728 |

tration, which increases energy use on the host side. Overall, MARS achieves the best energy consumption and performance trade-off among all designs.



**Figure 12: Energy reduction of each system compared to RH2.**

#### 8.4. Area Analysis

**SSD-internal DRAM overhead.** We estimate the base area of our PIM-enabled DRAM with CACTI7 [135] to be  $55.48 \text{ mm}^2$  in a 22nm technology. Each Arithmetic Unit occupies  $0.0295 \text{ mm}^2$ , leading to  $7.56 \text{ mm}^2$  [119, 152] total overhead for all 256 Arithmetic Units. Each LUT-based Querying Unit occupies  $0.018 \text{ mm}^2$ , leading to  $9.22 \text{ mm}^2$  [118] for 512 instances. The total DRAM overhead of our design, i.e.,  $16.78 \text{ mm}^2$ , is low compared to the total SSD area available, i.e., at least  $6400 \text{ mm}^2$  for our SSD configuration of 8 channels and 8 typical  $100 \text{ mm}^2$  NAND flash chips per channel.

**SSD Controller Logic overhead.** We estimate the area overhead of our logic components using Synopsys Design Compiler [137] with UMC 65nm technology node [153]. The area for the Sorter, Merger and Controller Unit is  $0.78 \text{ mm}^2$ ,  $0.14 \text{ mm}^2$  and  $0.002 \text{ mm}^2$ , respectively. Compared to a 14nm Intel Processor [154], the Sorter and Merger introduce only 0.028% area overhead (the area is  $0.09 \text{ mm}^2$  when scaled to 14nm [155]).

**Table 5: Area analysis overview per component.**

| Placement in SSD  | Unit       | Instances Number | Area [mm <sup>2</sup> ] per Unit | Area [mm <sup>2</sup> ] Total |
|-------------------|------------|------------------|----------------------------------|-------------------------------|
| SSD-internal DRAM | Arithmetic | 256              | 0.0295                           | 7.56                          |
|                   | Querying   | 512              | 0.018                            | 9.22                          |
| SSD controller    | Sorter     | 8                | 0.78                             | 6.24                          |
|                   | Merger     | 8                | 0.14                             | 1.12                          |
|                   | Control    | 1                | 0.002                            | 0.002                         |

#### 8.5. Sensitivity to SSD-Internal DRAM Size

We perform a sensitivity analysis to examine the scalability of the two ISP designs MARS and MS-SIMDRAM for different sizes of the SSD-internal DRAM, i.e., 2 GB, 4GB (base configuration) and 8 GB. Fig. 13 shows that MARS’s performance increases by 1.70x on average when we double the internal DRAM size, while MS-SIMDRAM’s performance increases almost by 1.99x on average. Therefore, the proposed design scales well when increasing internal DRAM resources and is not bound by the internal bandwidth. MS-SIMDRAM’s slightly better scaling indicates that increasing the DRAM capacity yields better results for PuM-based computations.



**Figure 13: Sensitivity to SSD-internal DRAM size.**

## 9. Related Work

To our knowledge, this is the first work to 1) enable in-storage acceleration of Raw Signal Genome Analysis and 2) combine the use of Processing-Near-Memory and Processing-Using-Memory inside the storage system. In this section, we briefly review prior work on hardware acceleration for genome analysis and ISP.

**Hardware Acceleration for RSGA.** Prior hardware acceleration works on RSGA propose FPGA-based [33, 58–60] and GPU-based [13, 54, 56, 57] systems. Specifically, Squigglefilter [36] proposes an edge-GPU-based system for RSGA that performs contamination analysis for small, viral genomes based on a 1D systolic array. HARU [33] uses an MPSoC with an on-chip FPGA to accelerate RSGA and f5c [56] presents a GPU-based accelerator. None of these systems 1) consider the impact of I/O data movement on end-to-end execution of RSGA and 2) provide a system for genome analysis that is scalable to medium- and larger-sized genomes, due to the use of costly dynamic time-warping alignment operations [17, 36, 55]. A comparison of MARS with SquiggleFilter and HARU is out of scope, as these works focus on performing read alignment for small, mostly viral genomes. MARS can be integrated into these tools to help them quickly identify seed hits, thus avoid searching the entire genome and enable scaling to large genomes.

**Hardware Acceleration for Genome Analysis.** Multiple prior works propose accelerator designs for basecalling-based genome analysis targeting basecalling and read map-

ping steps with different architectures like ASICs [156–158], GPUs [57, 159–170], FPGAs [105, 171–185], ISP [186], and PIM [187–194]. Basecalling accelerators [37, 65, 67, 195–200] speed up the translation of raw signals into nucleotide sequences, a step that is entirely bypassed by our RSGA-based design. Read mapping accelerators [75, 105, 106, 112, 159, 161, 175, 176, 179, 186, 189, 201–210] are *not* applicable to RSGA as they do *not* consider the noise within raw signals.

**In-Storage-Processing.** Prior works explore ISP through various approaches using (1) Processing-Near-Flash memory by integrating processing capabilities into the SSD controller in a general-purpose [211–216] or application-specific way [217–225], (2) Processing-using-Flash memory by exploiting the analog properties of flash memory [226–234] or by (3) closely integrating SSDs with GPUs [235] or FPGAs [236–239] (e.g., SmartSSD [131, 133]). While SmartSSD [131] places an FPGA near the SSD, MARS integrates computation inside the SSD-internal DRAM and controllers. Several works also consider other storage technologies like HDDs [215, 223, 224, 240] for computation. None of these works leverages SSD’s computational capabilities and enhances them to accelerate RSGA.

## 10. Conclusion

We propose MARS, the first in-storage processing architecture that enables multiple Processing-In-Memory paradigms within the SSD to reduce both data movement and computation overheads of RSGA read mapping. MARS (1) proposes targeted software modifications, such as early signal quantization and read filtering, to minimize hardware resources while maintaining accuracy, and (2) provides near-memory computation units within the SSD for accelerating computational steps of the RSGA pipeline. MARS improves performance over software- and hardware-accelerated state-of-the-art read mapping pipelines by a factor of 93 $\times$  and 40 $\times$  while reducing their energy consumption by 427 $\times$  and 72 $\times$  on average across five real-world dataset.

## Acknowledgments

We thank the anonymous reviewers of ICS 2025, ISCA 2025 and HPCA 2025 for their feedback. We thank the SAFARI group members for the feedback and stimulating intellectual environment they provide. We acknowledge the generous gifts from our industrial partners including Google, Huawei, Intel, and Microsoft. This work is supported in part by the ETH Future Computing Laboratory (EFCL), Huawei ZRC Storage Team, Semiconductor Research Corporation, AI Chip Center for Emerging Smart Systems (ACCESS), sponsored by InnoHK funding, Hong Kong SAR, and European Union’s Horizon programme for research and innovation [101047160 - BioPIM].

## References

- [1] E. A. Ashley, “Towards Precision Medicine,” *Nature Reviews Genetics*, 2016.
- [2] M. Flores, G. Glusman, K. Brogaard, N. D. Price, and L. Hood, “P4 Medicine: How Systems Medicine Will Transform the Healthcare Sector and Society,” *Personalized Medicine*, 2013.
- [3] N. M. Sweeney, S. A. Nahas, S. Chowdhury, S. Batalov, M. Clark, S. Taylor, J. Cakici, J. J. Nigro, Y. Ding, N. Veeraraghavan *et al.*, “Rapid Whole Genome Sequencing Impacts Care and Resource Utilization in Infants with Congenital Heart Disease,” *NPJ Genomic Medicine*, 2021.
- [4] J. S. Bloom, L. Sathe, C. Munugala, E. M. Jones, M. Gasperini, N. B. Lubock, F. Yarza, E. M. Thompson, K. M. Kovary, J. Park *et al.*, “Massively Scaled-Up Testing for SARS-CoV-2 RNA via Next-Generation Sequencing of Pooled and Barcoded Nasal and Saliva Samples,” *Nature Biomedical Engineering*, 2021.
- [5] R. Yelagandula, A. Bykov, A. Vogt, R. Heinen, E. Özkan, M. M. Strobl, J. C. Baar, K. Uzunova, B. Hajdusits, D. Kordic *et al.*, “Multiplexed Detection of SARS-CoV-2 and Other Respiratory Infections in High Throughput by SARSq,” *Nature Communications*, 2021.
- [6] J. Romiguer, P. Gayral, M. Ballenghien, A. Bernard, V. Cahais, A. Chenuil, Y. Chiari, R. Dernat, L. Duret, N. Faivre *et al.*, “Comparative Population Genomics in Animals Uncovers the Determinants of Genetic Diversity,” *Nature*, 2014.
- [7] H. Ellegren and N. Galtier, “Determinants of Genetic Diversity,” *Nature Reviews Genetics*, 2016.
- [8] J. Prado-Martinez, P. H. Sudmant, J. M. Kidd, H. Li, J. L. Kelley, B. Lorente-Galdos, K. R. Veeramah, A. E. Woerner, T. D. O’Connor, G. Santpere, A. Cagan, C. Theunert, F. Casals, H. Laayouni, K. Munch, A. Hobolth, A. E. Halager, M. Malig, J. Hernandez-Rodriguez, I. Hernando-Herraez, K. Prüfer, M. Pybus, L. Johnstone, M. Lachmann, C. Alkan, D. Twigg, N. Petit, C. Baker, F. Hormozdiari, M. Fernandez-Callejo, M. Dabad, M. L. Wilson, L. Stevson, C. Camprubí, T. Carvalho, A. Ruiz-Herrera, L. Vives, M. Mele, T. Abello, I. Kondova, R. E. Bontrop, A. Pusey, F. Lankester, J. A. Kiyang, R. A. Bergl, E. Lonsdorf, S. Myers, M. Ventura, P. Gagneux, D. Comas, H. Siegmund, J. Blanc, L. Agueda-Calpena, M. Gut, L. Fulton, S. A. Tishkoff, J. C. Mullikin, R. K. Wilson, I. G. Gut, M. K. Gonder, O. A. Ryder, B. H. Hahn, A. Navarro, J. M. Akey, J. Bertranpetti, D. Reich, T. Mailund, M. H. Schierup, C. Hvilsom, A. M. Andrés, J. D. Wall, C. D. Bustamante, M. F. Hammer, E. E. Eichler, and T. Marques-Bonet, “Great Ape Genetic Diversity and Population History,” *Nature*, 2013.
- [9] C. Alkan, J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari, J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, S. C. Sahinalp, R. A. Gibbs, and E. E. Eichler, “Personalized Copy Number and Segmental Duplication Maps Using Next-Generation Sequencing,” *Nature Genetics*, 2009.
- [10] C. Firtina, N. Mansouri Ghiasi, J. Lindegger, G. Singh, M. B. Cavlak, H. Mao, and O. Mutlu, “RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes,” *Bioinformatics*, 2023.
- [11] H. Zhang, H. Li, C. Jain, H. Cheng, K. F. Au, H. Li, and S. Aluru, “Real-time Mapping of Nanopore Raw Signals,” *Bioinformatics*, 2021.
- [12] S. Kovaka, Y. Fan, B. Ni, W. Timp, and M. C. Schatz, “Targeted Nanopore Sequencing by Real-time Mapping of Raw Electrical Signal with UNCALLED,” *Nature Biotechnology*, 2021.
- [13] Y. Bao, J. Wadden, J. R. Erb-Downward, P. Ranjan, W. Zhou, T. L. McDonald, R. E. Mills, A. P. Boyle, R. P. Dickson, D. Blaauw, and J. D. Welch, “SquiggleNet: Real-time, Direct Classification of Nanopore Signals,” *Genome Biology*, 2021.
- [14] “Bonito,” <https://github.com/nanoporetech/bonito>.
- [15] H. Li, “Minimap2: Pairwise Alignment for Nucleotide Sequences,” *Bioinformatics*, 2018.
- [16] R. Poplin, P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, S. S. Gross, L. Dorfman, C. Y. McLean, and M. A. DePristo, “A Universal SNP and Small-Indel Variant Caller Using Deep Neural Networks,” *Nature Biotechnology*, 2018.
- [17] J. Lindegger, C. Firtina, N. M. Ghiasi, M. Sadrosadati, M. Alser, and O. Mutlu, “RawAlign: Accurate, Fast, and Scalable Raw Nanopore Signal Mapping via Combining Seeding and Alignment,” *IEEE Access*, 2024.
- [18] H. Li and J. Ruan, “Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores,” *Genome research*, 2008.
- [19] M. Alser, Z. Bingöl, D. S. Cali, J. Kim, S. Ghose, C. Alkan, and O. Mutlu, “Accelerating Genome Analysis: A Primer on An Ongoing Journey,” *Micro*, 2020.
- [20] M. Alser, J. Rotman, K. Taraszka, H. Shi, P. I. Baykal, H. T. Yang, V. Xue, S. Knyazev, B. D. Singer, B. Balliu *et al.*, “Technology Dictates Algorithms: Recent Developments in Read Alignment,” *Genome Biology*, 2021.
- [21] Z. Bingöl, M. Alser, O. Mutlu, O. Ozturk, and C. Alkan, “GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read Mapping,” *IPDPSW*, 2021.
- [22] M. Jain, S. Koren, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R. Tyson, A. D. Beggs, A. T. Dilthey, I. T. Fiddes, S. Malla, H. Marriott, T. Nieto, J. O’Grady, H. E. Olsen, B. S. Pedersen, A. Rhee, H. Richardson, A. R. Quinlan, T. P. Snutch, L. Tee, B. Paten, A. M. Phillippy, J. T. Simpson, N. J. Loman, and M. Loose, “Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads,” *Nature Biotechnology*, 2018.
- [23] D. Senol Cali, J. S. Kim, S. Ghose, C. Alkan, and O. Mutlu, “Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions,” *Briefings in Bioinformatics*, 2018.
- [24] G. M. Cherf, K. R. Lieberman, H. Rashid, C. E. Lam, K. Karplus, and M. Akeson, “Automated Forward and Reverse Ratcheting of DNA in a Nanopore at 5-Å Precision,” *Nature Biotechnology*, 2012.
- [25] A. H. Laszlo, I. M. Derrington, B. C. Ross, H. Brinkerhoff, K. W. Langford, I. C. Nova, J. M. Samson, J. J. Bartlett, M. Pavlenok, and J. H. Gundlach, “Detection and Mapping of 5-methylcytosine and 5-hydroxymethylcytosine with Nanopore MspA,” *PNAS*, 2013.
- [26] A. H. Laszlo, I. M. Derrington, B. C. Ross, H. Brinkerhoff, A. Adey, I. C. Nova, J. M. Craig, K. W. Langford, J. M. Samson, R. Daza, K. Doering, J. Shendre, and J. H. Gundlach, “Decoding Long Nanopore Sequencing Reads of natural DNA,” *Nature Biotechnology*, 2014.
- [27] M. Jain, O. H. E., B. Paten, and M. Akeson, “The Oxford Nanopore MinION: Delivery of Nanopore Sequencing to the Genomics Community,” *Genome Biology*, 2016.

- [28] J. Shendure, S. Balasubramanian, G. M. Church, W. Gilbert, J. Rogers, J. A. Schloss, and R. H. Waterston, "DNA Sequencing at 40: Past, Present and Future," *Nature*, 2017.
- [29] A. L. Greninger, S. N. Naccache, S. Federman, G. Yu, P. Mbala, V. Bres, D. Stryke, J. Bouquet, S. Somasekar, J. M. Linnen, R. Dodd, P. Mulembakani, B. S. Schneider, J.-J. Muyembe-Tamfum, S. L. Stramer, and C. Y. Chiu, "Rapid Metagenomic Identification of Viral Pathogens in Clinical Samples by Real-time Nanopore Sequencing Analysis," *Genome Medicine*, 2015.
- [30] L. E. Kafetzopoulou, K. Eftymiadis, K. Lewandowski, A. Crook, D. Carter, J. Osborne, E. Aarons, R. Hewson, J. A. Hiscox, M. W. Carroll, R. Vipond, and S. T. Pullan, "Assessment of Metagenomic Nanopore and Illumina Sequencing for Recovering Whole Genome Sequences of Chikungunya and Dengue Viruses Directly from Clinical Samples," *Euro Surveill*, 2018.
- [31] M. Loose, S. Malla, and M. Stout, "Real-time Selective Sequencing using Nanopore Technology," *Nat. Methods*, 2016.
- [32] A. Payne, N. Holmes, T. Clarke, R. Munro, B. J. Debebe, and M. Loose, "Readfish Enables Targeted Nanopore Sequencing of gigabase-sized Genomes," *Nature Biotechnology*, 2021.
- [33] P. J. Shih, H. Saadat, S. Parameswaran, and H. Gamaarachchi, "Efficient Real-time Selective Genome Sequencing on Resource-Constrained Devices," *GigaScience*, 2023.
- [34] N. Huang, F. Nie, P. Ni, F. Luo, and J. Wang, "SACall: A Neural Network Basecaller for Oxford Nanopore Sequencing Data Based on Self-Attention Mechanism," *TCBB*, 2020.
- [35] G. Singh, M. Alser, A. Khodamoradi, K. Denolf, C. Firtina, M. B. Cavlak, H. Corporaal, and O. Mutlu, "RUBICON: A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers," *Genome Biology*, 2024.
- [36] T. Dunn, H. Sadasivan, J. Wadden, K. Goliya, K.-Y. Chen, D. Blaauw, R. Das, and S. Narayanasamy, "SquiggleFilter: An Accelerator for Portable Virus Detection," in *MICRO*, 2021.
- [37] H. Mao, M. Alser, M. Sadrosadati, C. Firtina, A. Baranwal, D. S. Cali, A. Manglik, N. A. Alserr, and O. Mutlu, "GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping," in *MICRO*, 2022.
- [38] Y. Wang, Y. Zhao, A. Bollas, Y. Wang, and K. F. Au, "Nanopore Sequencing Technology, Bioinformatics and Applications," *Nature Biotechnology*, 2021.
- [39] C. Firtina, M. Soysal, J. Lindegger, and O. Mutlu, "RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization," *Bioinformatics*, 2024.
- [40] C. Firtina, M. Mordig, H. Mustafa, S. Goswami, N. M. Ghiasi, S. Mercogliano, F. Eris, J. Lindegger, A. Kahles, and O. Mutlu, "Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism," *arXiv*, 2024.
- [41] H. S. Edwards, R. Krishnakumar, A. Sinha, S. W. Bird, K. D. Patel, and M. S. Bartsch, "Real-Time Selective Sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria," *Sci. Rep.*, 2019.
- [42] H. Sadasivan, J. Wadden, K. Goliya, P. Ranjan, R. P. Dickson, D. Blaauw, R. Das, and S. Narayanasamy, "Rapid Real-time Squiggle Classification for Read Until Using RawMap," *Arch. Clin. Biomed. Res.*, 2023.
- [43] A. J. Mikalsen and J. Zola, "Coriolis: Enabling Metagenomic Classification on Lightweight Mobile Devices," *Bioinform.*, 2023.
- [44] V. S. Shivakumar, O. Y. Ahmed, S. Kovaka, M. Zakeri, and B. Langmead, "Signomi: Classification of Nanopore Signal with a Compressed Pangenome Index," *Bioinformatics*, 2024.
- [45] M. B. Cavlak, G. Singh, M. Alser, C. Firtina, J. Lindegger, M. Sadrosadati, N. M. Ghiasi, C. Alkan, and O. Mutlu, "TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering," *Frontiers in Genetics*, 2024.
- [46] R. E. Workman, A. D. Tang, P. S. Tang, M. Jain, J. R. Tyson, R. Razaghi, P. C. Zuzarte, T. Gilpatrick, A. Payne, J. Quick, N. Sadowski, N. Holmes, J. G. de Jesus, K. L. Jones, C. M. Soulette, T. P. Snutch, N. Loman, B. Paten, M. Loose, J. T. Simpson, H. E. Olsen, A. N. Brooks, M. Akeson, and W. Timp, "Nanopore Native RNA Sequencing of a Human Poly(A) Transcriptome," *Nature Methods*, 2019.
- [47] R. R. Wick, L. M. Judd, and K. E. Holt, "Performance of Neural Network Basecalling Tools for Oxford Nanopore Sequencing," *Genome Biology*, 2019.
- [48] Y. K. Wan, C. Hendra, P. N. Pratanwanich, and J. Göke, "Beyond Sequencing: Machine Learning Algorithms Extract Biology Hidden in Nanopore Signal Data," *Trends in Genetics*, 2022.
- [49] A. C. Rand, M. Jain, J. M. Eizenga, A. Musselman-Brown, H. E. Olsen, M. Akeson, and B. Paten, "Mapping DNA Methylation with High-throughput Nanopore Sequencing," *Nature Methods*, 2017.
- [50] J. T. Simpson, R. E. Workman, P. C. Zuzarte, M. David, L. J. Dursi, and W. Timp, "Detecting DNA Cytosine Methylation using Nanopore Sequencing," *Nature Methods*, 2017.
- [51] W. Stephenson, R. Razaghi, S. Busan, K. M. Weeks, W. Timp, and P. Smibert, "Direct Detection of RNA Modifications and Structure using Single-Molecule Nanopore Sequencing," *Cell Genomics*, 2022.
- [52] Alexandra Sneddon, Agin Ravindran, Nadine Hein, Nikolay Shirokikh, and Eduardo Eyras, "Real-time Biochemical-free Targeted Sequencing of RNA species with RISER," *bioRxiv*, 2022.
- [53] S. Kovaka, P. W. Hook, K. M. Jenike, V. Shivakumar, L. B. Morina, R. Razaghi, W. Timp, and M. C. Schatz, "Uncalled4 Improves Nanopore DNA and RNA Modification Detection via Fast and Accurate Signal Alignment," *Nature Methods*, 2024.
- [54] H. Sadasivan, D. Stiffler, A. Tirumala, J. Israeli, and S. Narayanasamy, "Accelerated Dynamic Time Warping on GPU for Selective Nanopore Sequencing," *J. Biomed. Biotechnol.*, 2023.
- [55] D. Sart, A. Mueen, W. Najjar, E. Keogh, and V. Niennattrakul, "Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs," in *ICDM*, 2010.
- [56] H. Gamaarachchi, C. W. Lam, G. Jayatilaka, H. Samarakoon, J. T. Simpson, M. A. Smith, and S. Parameswaran, "GPU Accelerated Adaptive Banded Event Alignment for Rapid Comparative Nanopore Signal Analysis," *BMC Bioinformatics*, 2020.
- [57] L. Guo, J. Lau, Z. Ruan, P. Wei, and J. Cong, "Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race between FPGA and GPU," in *FCCM*, 2019.
- [58] V. Sundaresan, S. Nichani, N. Ranganathan, and R. Sankar, "A VLSI Hardware Accelerator for Dynamic Time Warping," in *ICPR*, 1992.
- [59] K. Liyanage, H. Gamaarachchi, R. Ragel, and S. Parameswaran, "Cross Layer Design Using HW/SW Co-Design and HLS to Accelerate Chaining in Genomic Analysis," *TCAD*, 2023.
- [60] S. Samarasinghe, P. Premathilaka, W. Herath, H. Gamaarachchi, and R. Ragel, "Energy Efficient Adaptive Banded Event Alignment using OpenCL on FPGAs," in *ICIAFS*, 2021.
- [61] S. Liu, Y. Wang, and F. Wang, "A Fast Read Alignment Method based on Seed-and-Vote for Next Generation Sequencing," *BMC bioinformatics*, 2016.
- [62] Y. Liao, G. K. Smyth, and W. Shi, "The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote," *Nucleic acids research*, 2013.
- [63] "Dorado," <https://github.com/nanoporetech/dorado>.
- [64] M. Alser, J. Lindegger, C. Firtina, N. Almadhoun, H. Mao, G. Singh, J. Gomez-Luna, and O. Mutlu, "From Molecules to Genomic Variations: Accelerating Genome Analysis via Intelligent Algorithms and Architectures," *CSBJ*, 2022.
- [65] Z. Xu, Y. Mai, D. Liu, W. He, X. Lin, C. Xu, L. Zhang, X. Meng, J. Mafofo, W. Zaher et al., "Fast-bonito: A Faster Deep Learning-Based Basecaller for Nanopore Sequencing," *Artificial Intelligence in the Life Sciences*, 2021.
- [66] J. Zeng, H. Cai, H. Peng, H. Wang, Y. Zhang, and T. Akutsu, "Causcall: Nanopore Basecalling using a Temporal Convolutional Network," *Frontiers in Genetics*, 2020.
- [67] G. Singh, M. Alser, K. Denolf, C. Firtina, A. Khodamoradi, M. B. Cavlak, H. Corporaal, and O. Mutlu, "RUBICON: A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers," *Genome Biology*, 2024.
- [68] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, "Shouji: A Fast and Efficient Pre-alignment Filter for Sequence Alignment," *Bioinformatics*, 2019.
- [69] M. Alser, T. Shahroodi, J. Gomez-Luna, C. Alkan, and O. Mutlu, "SneakySnake: A Fast and Accurate Universal Genome Pre-alignment Filter for CPUs, GPUs and FPGAs," *Bioinformatics*, 2020.
- [70] H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, and C. Alkan, "Accelerating Read Mapping with FastHASH," *BMC Genomics*, 2013.
- [71] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, "GateKeeper: a new Hardware Architecture for Accelerating Pre-alignment in DNA Short Read Mapping," *Bioinformatics*, 2017.
- [72] J. S. Kim, D. S. Cali, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, and O. Mutlu, "GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-memory Technologies," *BMC Genomics*, 2018.
- [73] G. Rizk and D. Lavenier, "GASSST: Global Alignment Short Sequence Search Tool," *Bioinformatics*, 2010.
- [74] F. Hach, I. Sarrafi, F. Hormozdiari, C. Alkan, E. E. Eichler, and S. C. Sahinalp, "mrsFAST-Ultra: A Compact, SNP-Aware Mapper for High Performance Sequencing Applications," *Nucleic acids research*, 2014.
- [75] A. F. Laguna, H. Gamaarachchi, X. Yin, M. Niemier, S. Parameswaran, and X. S. Hu, "Seed-and-vote based In-Memory Accelerator for DNA Read Mapping," in *ICCAD*, 2020.
- [76] H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, and O. Mutlu, "Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping," *Bioinformatics*, 2015.
- [77] H. Xin, S. Nahar, R. Zhu, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, and O. Mutlu, "Optimal seed solver: optimizing seed selection in read mapping," *Bioinformatics*, 2016.
- [78] G. Myers and W. Miller, "Chaining multiple-alignment fragments in sub-quadratic time," in *ACM-SIAM Symposium on Discrete Algorithms*, 1995.
- [79] M. Alser, O. Mutlu, and C. Alkan, "MAGNET: Understanding and improving the accuracy of genome pre-Alignment filtering," 2017.
- [80] C. Firtina and C. Alkan, "On Genomic Repeats and Reproducibility," *Bioinformatics*, 2016.
- [81] R. Micheloni, A. Marelli, and K. Eshghi, "Inside Solid State Drives (SSDs)," 2018.
- [82] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, "Error Characterization, Mitigation, and Recovery in Flash-Memory-based Solid-State Drives," *Proceedings of the IEEE*, 2017.
- [83] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy, "Design Tradeoffs for SSD Performance," in *USENIX ATC*, 2008.
- [84] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, "Errors in Flash-Memory-based Solid-State Drives: Analysis, Mitigation, and Recovery," *Inside Solid State Drives*, 2018.
- [85] K. Zhao, W. Zhao, H. Sun, X. Zhang, N. Zheng, and T. Zhang, "LDPC-in-SSD: Making Advanced Error Correction Codes Work Effectively in Solid State Drives," in *FAST'13*, 2013.
- [86] S. Tanakamaru, Y. Yanagihara, and K. Takeuchi, "Error-Prediction LDPC and Error-Recovery Schemes for Highly Reliable Solid-State Drives (SSDs)," *IEEE J. Solid-State Circuits*, 2013.
- [87] A. Gupta, Y. Kim, and B. Urgaonkar, "DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings," in *ASPLOS*, 2009.
- [88] S.-P. Lim, S.-W. Lee, and B. Moon, "FASTER FTL for Enterprise-Class Flash Memory SSDs," in *SNAPI*, 2010.

- [89] Y. Zhou, F. Wu, P. Huang, X. He, C. Xie, and J. Zhou, "An Efficient Page-level FTL to Optimize Address Translation in Flash Memory," in *EuroSys*, 2015.

[90] A. Tavakkol, J. Gómez-Luna, M. Sadrosadati, S. Ghose, and O. Mutlu, "MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices," in *FAST*, 2018.

[91] J.-Y. Shin, Z.-L. Xia, N.-Y. Xu, R. Gao, X.-F. Cai, S. Maeng, and F.-H. Hsu, "FTL Design Exploration in Reconfigurable High-Performance SSD for Server Applications," in *ICS*, 2009.

[92] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, "A Large-Scale Study of Flash Memory Failures in the Field," in *ACM SIGMETRICS*, 2015.

[93] J.-B. JEDEC, "Low Power Double Data Rate 4 (LPDDR4) Standard," 2017.

[94] Samsung, "Samsung SSD 860 PRO," <https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/860pro/>, 2018.

[95] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, "Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines," in *ICCD*, 2018.

[96] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, and O. Mutlu, "Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms," in *SIGMETRICS*, 2017.

[97] PCI-SIG, "PCI Express Base Specification Revision 4.0, Version 1.0," <https://pcisig.com/specifications>.

[98] K. Eshghi and R. Micheloni, "SSD Architecture and PCI Express Interface," in *SSDS*, 2018.

[99] AnandTech, "New Enterprise SSD Controllers," <https://www.anandtech.com/show/16275/new-enterprise-ssd-controllers-from-silicon-motion-phison-fadu>, 2020.

[100] D.-H. Kang, M.-S. Kim, S.-C. Jeon, W.-S. Jung, J.-Y. Park, G.-T. Choo, D.-K. Shim, A. Kavalal, S.-B. Kim, K.-M. Kang, J.-H. Lee, K.-Y. Ko, H.-W. Park, B.-J. Min, C. Yu, S.-K. Yun, N. Kim, Y. Jung, S. Seo, S. Kim, M.-K. Lee, J.-Y. Park, J.-C. Kim, Y.-S. Cha, K. Kim, Y. Jo, H. Kim, Y. Choi, J. Byun, J.-H. Park, K. Kim, T.-H. Kwon, Y. Min, C. Yoon, Y. Kim, D.-H. Kwak, E. Lee, W.-G. Hahn, K.-S. Kim, K. Kim, E. Yoon, W.-T. Kim, I. Lee, S.-H. Moon, J. Ihm, D.-S. Byeon, K.-W. Song, S. Hwang, and C.-H. Kyung, "A 512Gb 3-Bit/Cell 3D 6th-Generation V-NAND Flash Memory with 82MB/s Write Throughput and 1.2Gb/s Interface," in *ISSCC*, 2019.

[101] Samsung, "Samsung SSD PM1735," <https://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZPLJ3T2HBJR-00007/>, 2020.

[102] O. Mutlu and C. Firtina, "Accelerating Genome Analysis via Algorithm-Architecture Co-Design," in *DAC*, 2023.

[103] C. Firtina, "Enabling Fast, Accurate, and Efficient Real-Time Genome Analysis via New Algorithms and Techniques," *arXiv preprint arXiv:2503.02997*, 2025.

[104] H. Sadasivan, M. Marie, E. Dawson, V. Iyer, J. Israeli, and S. Narayanasamy, "Accelerating Minimap2 for Accurate Long Read Alignment on GPUs," *Journal of biotechnology and biomedicine*, 2023.

[105] K. Liyanage, H. Samarakoon, S. Parameswaran, and H. Gamaarachchi, "Efficient End-to-End Long-read Sequence Mapping using Minimap2-FPGA integrated with hardware-accelerated Chaining," *Scientific Reports*, 2023.

[106] Y. Gu, A. Subramaniyan, T. Dunn, A. Khadem, K.-Y. Chen, S. Paul, M. Vasimuddin, S. Misra, D. Blaauw, S. Narayanasamy et al., "GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis," in *ISCA*, 2023.

[107] F. Chen, L. Song, Y. Chen et al., "PARC: A Processing-In-CAM Architecture for Genomic Long Read Pairwise Alignment using ReRAM," in *ASP-DAC*, 2020.

[108] K. Liyanage, H. Gamaarachchi, H. Saadat, T. Li, H. Samarakoon, and S. Parameswaran, "Accelerating Chaining in Genomic Analysis Using RISC-V Custom Instructions," in *DATE*, 2024.

[109] S. Han, S. Moon, T. Suh, J. Heo, and J.-Y. Kim, "BLESS: Bandwidth and Locality Enhanced SMEM Seeding Acceleration for DNA Sequencing," in *ISCA*, 2024.

[110] Y. Li and G. Guidi, "High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid Parallelism," in *ICPP*, 2024.

[111] Y. Cheng, X. Sun, and Q. Luo, "RapidGKC: GPU-Accelerated K-Mer Counting," in *ICDE*, 2024.

[112] Y. Huang, L. Kong, D. Chen, Z. Chen, X. Kong, J. Zhu, K. Mamouras, S. Wei, K. Yang, and L. Liu, "CASA: An Energy-Efficient and High-Speed CAM-based SMEM Seeding Accelerator for Genome Alignment," in *MICRO*, 2023.

[113] W. Huangfu, X. Li, S. Li, X. Hu, P. Gu, and Y. Xie, "MEDAL: Scalable DIMM-based Near Data Processing Accelerator for DNA Seeding Algorithm," in *MICRO*, 2019.

[114] F. Zhang, S. Angizi, N. A. Fahmi, W. Zhang, and D. Fan, "PIM-Quantifier: A Processing-In-Memory Platform for mRNA Quantification," in *DAC*, 2021.

[115] Z. Jahshan and L. Yavits, "MajorK: Majority Based kmer Matching in Commodity DRAM," *CAL*, 2024.

[116] W. Huangfu, K. T. Malladi, S. Li, P. Gu, and Y. Xie, "NEST: DIMM-based Near-Data-Processing Accelerator for K-mer Counting," in *ICCAD*, 2020.

[117] F. Zokaei, M. Zhang, and L. Jiang, "Finder: Accelerating FM-index-based Exact Pattern Matching in Genomic Sequences through ReRAM Technology," in *PACT*, 2019.

[118] J. D. Ferreira, G. Falcao, J. Gómez-Luna, M. Alser, L. Orosa, M. Sadrosadati, J. S. Kim, G. F. Oliveira, T. Shahroodi, A. Nori, and O. Mutlu, "pLUTO: Enabling Massively Parallel Computation in DRAM via Lookup Tables," in *MICRO*, 2022.

[119] M. Lenjani, P. Gonzalez, E. Sadredini, S. Li, Y. Xie, A. Akel, S. Eilert, M. R. Stan, and K. Skadron, "Fulcrum: A Simplified Control and Access Mechanism Toward Flexible and Practical In-Situ Accelerators," in *HPCA*, 2020.

[120] W. Song, D. Koch, M. Luján, and J. Garside, "Parallel Hardware Merge Sorter," in *FCCM*, 2016.

[121] N. Samardzic, W. Qiao, V. Aggarwal, M.-C. F. Chang, and J. Cong, "Bonsai: High-Performance Adaptive Merge Tree Sorting," in *ISCA*, 2020.

[122] K. E. Batcher, "Sorting Networks and their Applications," in *AFIPS*, 1968.

[123] K. Ha, J. Jeong, and J. Kim, "An Integrated Approach for Managing Read Disturbances in High-density NAND Flash Memory," *TCAD*, 2015.

[124] Y. Luo, Y. Cai, S. Ghose, J. Choi, and O. Mutlu, "WARM: Improving NAND Flash Memory Lifetime with Write-Hotness Aware Retention Management," in *MSST*, 2015.

[125] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, "Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation," *POMACS*, 2018.

[126] Y. Cai, Y. Luo, S. Ghose, and O. Mutlu, "Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery," in *DSN*, 2015.

[127] Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu, "Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery," in *HPCA*, 2015.

[128] Micron, "Product Flyer: Micron 3D NAND Flash Memory," [https://www.micron.com/-/media/client/global/documents/products/product-flyer/3d\\_nand\\_flyer.pdf?la=en](https://www.micron.com/-/media/client/global/documents/products/product-flyer/3d_nand_flyer.pdf?la=en), 2016.

[129] AMD®, EPYC® 7742 CPU, 2019, <https://www.amd.com/en/products/cpu/amd-epyc-7742>.

[130] N. Hajinazar, G. F. Oliveira, S. Gregorio, J. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu, "SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM," in *ASPLOS*, 2021.

[131] J. H. Lee, H. Zhang, V. Lagrange, P. Krishnamoorthy, X. Zhao, and Y. S. Ki, "SmartSSD: FPGA-accelerated Near-Storage Data Analytics on SSD," *CAL*, 2020.

[132] SmartSSD Computational Storage Drive Installation and User Guide, Xilinx, 2021. [Online]. Available: [https://fpga.eetrend.com/files/2022-02/wen\\_zhang\\_100558024\\_244024-ug1382-smartsdd-csd.pdf](https://fpga.eetrend.com/files/2022-02/wen_zhang_100558024_244024-ug1382-smartsdd-csd.pdf)

[133] Y. Wang, S. Li, Q. Zheng, L. Song, Z. Li, A. Chang, and Y. Chen, "NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing," in *ISCA*, 2024.

[134] NVIDIA RTX A6000, 2020, <https://www.nvidia.com/en-us/design-visualization/rtx-a6000/>.

[135] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, "CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories," *ACM Trans. Archit. Code Optim.*, 2017.

[136] C. S. R. Group, "MQSim: A Framework for SSD Simulation - GitHub," <https://github.com/CMU-SAFARI/MQSim>, 2018.

[137] I. Synopsys, "Design Compiler," <https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler-graphical.html>

[138] "SARS-CoV-2 Genome Dataset," 2020, <https://cadd.e3s.climb.ac.uk/SP1-raw.tgz>.

[139] "Escherichia coli Genome Dataset, SRA Accession: ERR9127551," 2021, [https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli\\_r9.tar.gz](https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli_r9.tar.gz).

[140] "Yeast Genome Dataset, SRA Accession: SRR8648503," 2019, [https://sra-pub-src-1.s3.amazonaws.com/SRR8648503/GLUIII\\_basecalled\\_fast5\\_1tar.gz](https://sra-pub-src-1.s3.amazonaws.com/SRR8648503/GLUIII_basecalled_fast5_1tar.gz).

[141] "Green Algae Genome Dataset, SRA Accession: ERR3237140," 2019, [https://sra-pub-src-2.s3.amazonaws.com/ERR3237140/Chlamydomonas\\_0tar.gz](https://sra-pub-src-2.s3.amazonaws.com/ERR3237140/Chlamydomonas_0tar.gz).

[142] "Human Genome Dataset, SRA Accession: FAB42260," 2017, [https://s3.amazonaws.com/nanopore-human-wgs/rel6/MultiFast5Tars/FAB42260-4177064552\\_MultiFast5tar](https://s3.amazonaws.com/nanopore-human-wgs/rel6/MultiFast5Tars/FAB42260-4177064552_MultiFast5tar).

[143] "Oxford Nanopore Human Reference Datasets," 2019, <https://github.com/nanopore-wgs-consortium/NA12878>.

[144] "SARS-CoV-2 Reference Genome GCF\_009858895.2," 2020, [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/95/GCF\\_009858895.2\\_ASM985889v3/GCF\\_009858895.2\\_ASM985889v3\\_genomic.fna.gz](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/95/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz).

[145] "Escherichia coli Reference Genome GCA\_000007445.1," 2002, [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/445/GCA\\_000007445.1\\_ASM744v1\\_genomic.fna.gz](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/445/GCA_000007445.1_ASM744v1_genomic.fna.gz).

[146] "Yeast Reference Genome GCA\_000146045.2," 2014, <https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz>.

[147] "Green Algae Reference Genome GCF\_000002595.2," 2018, [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/95/GCF\\_000002595.2\\_Chlamydomonas\\_reinhardtii\\_v5.5/GCF\\_000002595.2\\_Chlamydomonas\\_reinhardtii\\_v5.5\\_genomic.fna.gz](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/95/GCF_000002595.2_Chlamydomonas_reinhardtii_v5.5/GCF_000002595.2_Chlamydomonas_reinhardtii_v5.5_genomic.fna.gz).

[148] K. H. Miga, S. Koren, A. Rhee, M. R. Vollger, A. Gershman, A. Bzikadze, S. Brooks, E. Howe, D. Porubsky, G. A. Logsdon, V. A. Schneider, T. Potapova, J. Wood, W. Chow, J. Armstrong, J. Fredrickson, E. Pak, K. Tigyi, M. Kremitzki, C. Markovic, V. Maduro, A. Dutra, G. G. Bouffard, A. M. Chang, N. F. Hansen, A. B. Wilfert, F. Thibaud-Nissen, A. D. Schmitt, J.-M. Belton, S. Selvarajan, M. Y. Dennis, D. C. Soto, R. Sahasrabudhe, G. Kaya, J. Quick, N. J. Loman, N. Holmes, M. Loose, U. Surti, R. a. Risques, T. A. Graves Lindsay, R. Fulton, I. Hall, B. Paten, K. Howe, W. Timp, A. Young, J. C. Mullikin, P. A. Pevzner, J. L. Gerton, B. A. Sullivan, E. E. Eichler, and A. M. Phillip, "Telomere-to-telomere Assembly of a Complete Human X Chromosome," *Nature*, 2020.

[149] A. Rhee, S. Nurk, M. Cechova, S. J. Hoyt, D. J. Taylor, N. Altemose, P. W. Hook, S. Koren, M. Rautiainen, I. A. Alexandrov, J. Allen, M. Asri, A. V. Bzikadze, N.-C. Chen, C.-S. Chin, M. Diekhans, P. Flicek, G. Formenti, A. Fungtammasan, C. G. Giron, E. Garrison, A. Gershman, J. L. Gerton, P. G. Grady, A. Guaraccino, L. Haggerty, R. Halabian, N. F. Hansen, R. Harris, G. A. Hartley, W. T. Harvey, M. Haukness, J. Heinz, T. Hourlier, R. M. Hubley, S. E. Hunt, S. Hwang, M. Jain, R. K. Kesharwani, A. P. Lewis, H. Li, G. A. Logsdon, J. K. Lucas, W. Makalowski, C. Markovic, F. J. Martin, A. M. M. Cartney, R. C. McCoy, J. McDaniel, B. M. McNulty, P. Medvedev, A. Mikheenko, K. M. Munson, T. D. Murphy, H. E. Olsen, N. D. Olson, L. F. Paulin, D. Porubsky, T. Potapova, F. Ryabov, S. L. Salzberg, M. E. Sauria, F. J. Sedlazeck, K. Shafin, V. A. Shepelev, A. Shumate, J. M. Storer, L. Surapaneni, A. M. T. Oill, F. Thibaud-Nissen, W. Timp, M. Tomaszkiewicz, M. R. Vollger, B. P. Walenz, A. C. Wray, and J. Zimin, "The Human Genome Assembly Reference Consortium's Assembly of the Human Genome Reference Assembly v38," *Nature*, 2023.

- Watwood, M. H. Weissensteiner, A. M. Wenger, M. A. Wilson, S. Zarate, Y. Zhu, J. M. Zook, E. E. Eichler, R. J. O'Neill, M. C. Schatz, K. H. Miga, K. D. Makova, and A. M. Phillippy, "The Complete Sequence of a Human Y Chromosome," *Nature*, 2023.
- [150] S. Wei, Z. R. Weiss, and Z. Williams, "Rapid Multiplex Small DNA Sequencing on the MinION Nanopore Sequencing Platform," *G3 Genes/Genomes/Genetics*, 2018.
- [151] AMD, "AMD µProf," <https://www.amd.com/en/developer/uprof.html>.
- [152] M. Lenjani and K. Skadron, "Supporting Moderate Data Dependency, Position Dependency, and Divergence in PIM-Based Accelerators," *Micro*, 2022.
- [153] UMC, "55 / 65 / 90nm, [https://www.umc.com/en/Product/technologies/Detail/55\\_65\\_90nm](https://www.umc.com/en/Product/technologies/Detail/55_65_90nm)."
- [154] WikiChip, "Cascade Lake SP - Intel," [https://en.wikichip.org/wiki/intel/cores/cascade\\_lake\\_sp](https://en.wikichip.org/wiki/intel/cores/cascade_lake_sp).
- [155] A. Stillmaker and B. Baas, "Scaling Equations for the Accurate Prediction of CMOS Device Performance from 180nm to 7nm," *Integration*, 2017.
- [156] Y. Turakha, G. Bejerano, and W. J. Dally, "Darwin: A Genomics Co-processor Provides up to 15,000 x Acceleration on Long Read Assembly," in *ASPLOS*, 2018.
- [157] D. Fujiki, A. Subramanyan, T. Zhang, Y. Zeng, R. Das, D. Blaauw, and S. Narayanasamy, "GenAx: A Genome Sequencing Accelerator," in *ISCA*, 2018.
- [158] A. Madhavan, T. Sherwood, and D. Strukov, "Race Logic: A hardware acceleration for dynamic programming algorithms," in *ISCA*.
- [159] H. Cheng, Y. Zhang, and Y. Xu, "Bitmapper2: A GPU-accelerated All-Mapper based on the Sparse q-gram Index," *TCBB*, 2018.
- [160] E. J. Houtgast, V.-M. Sima, K. Bertels, and Z. Al-Ars, "Hardware Acceleration of BWA-MEM Genomic Short Read Mapping for Longer Read Lengths," *Computational biology and chemistry*, 2018.
- [161] E. J. Houtgast, V. Sima, K. Bertels, and Z. Al-Ars, "An Efficient GPU-accelerated Implementation of Genomic Short Read Mapping with BWA-MEM," *ACM SIGARCH Computer Architecture News*, 2017.
- [162] A. Zeni, G. Guidi, M. Ellis, N. Ding, M. D. Santambrogio, S. Hofmeyr, A. Buluc, L. Oliker, and K. Yelick, "Logan: High Performance GPU-based X-drop Long-Read Alignment," in *IPDPS*, 2020.
- [163] N. Ahmed, J. Levy, S. Ren, H. Mushtaq, K. Bertels, and Z. Al-Ars, "GASAL2: a GPU Accelerated Sequence Alignment Library for High-throughput NGS data," *BMC bioinformatics*, 2019.
- [164] T. Nishimura, J. L. Bordim, Y. Ito, and K. Nakano, "Accelerating the Smith-Waterman Algorithm using Bitwise Parallel Bulk Computation technique on GPU," in *IPDPSW*, 2017.
- [165] E. F. de Oliveira Sandes, G. Miranda, X. Martorell, E. Ayguade, G. Teodoro, and A. C. M. Melo, "CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-wide Alignment in GPU Clusters," *TPDS*, 2016.
- [166] Y. Liu and B. Schmidt, "GSWABE: faster GPU-accelerated Sequence Alignment with Optimal Alignment Retrieval for Short DNA Sequences," *Concurrency and Computation: Practice and Experience*, 2015.
- [167] Y. Liu, A. Wirawan, and B. Schmidt, "CUDASW++ 3.0: Accelerating Smith-Waterman Protein Database Search by Coupling CPU and GPU SIMD Instructions," *BMC bioinformatics*, 2013.
- [168] R. Wilton, T. Budavari, B. Langmead, S. J. Wheelan, S. L. Salzberg, and A. S. Szalay, "Arioc: High-Throughput Read Alignment with GPU-accelerated Exploration of the Seed-and-Extend Search Space," *PeerJ*, 2015.
- [169] Y. Liu, D. L. Maskell, and B. Schmidt, "CUDASW++: Optimizing Smith-Waterman Sequence Database Searches for CUDA-enabled Graphics Processing Units," *BMC research notes*, 2009.
- [170] Y. Liu, B. Schmidt, and D. L. Maskell, "CUDASW++ 2.0: Enhanced Smith-Waterman Protein Database Search on CUDA-enabled GPUs based on SIMT and Virtualized SIMD Abstractions," *BMC research notes*, 2010.
- [171] D. Fujiki, S. Wu, N. Ozog, K. Goliya, D. Blaauw, S. Narayanasamy, and R. Das, "SeedEx: A Genome Sequencing Accelerator for Optimal Alignments in Subminimal Space," in *ICRC*, 2020.
- [172] S. S. Banerjee, M. El-Hadedy, J. B. Lim, Z. T. Kalbarczyk, D. Chen, S. S. Lumetta, and R. K. Iyer, "ASAP: Accelerated Short-Read Alignment on Programmable Hardware," *TC*, 2018.
- [173] A. Goyal, H. J. Kwon, K. Lee, R. Garg, S. Y. Yun, Y. H. Kim, S. Lee, and M. S. Lee, "Ultra-fast Next Generation Human Genome Sequencing Data Processing using DRAGENTM bio-IT Processor for Precision Medicine," *Open Journal of Genetics*, 2017.
- [174] Y.-T. Chen, J. Cong, Z. Fang, J. Lei, and P. Wei, "When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration," in *HotCloud*, 2016.
- [175] P. Chen, C. Wang, X. Li, and X. Zhou, "Accelerating the Next Generation Long Read Mapping with the FPGA-based System," *TCBB*, 2014.
- [176] Y.-L. Chen, B.-Y. Chang, C.-H. Yang, and T.-D. Chiueh, "A High-Throughput FPGA Accelerator for Short-Read Mapping of the Whole Human Genome," *TPDS*, 2021.
- [177] X. Fei, Z. Dan, L. Lina, M. Xin, and Z. Chunlei, "FPGASW: Accelerating Large-scale Smith-Waterman Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array," *Interdisciplinary Sciences: Computational Life Sciences*, 2018.
- [178] H. M. Waidyasoorya and M. Hariyama, "Hardware-Acceleration of Short-Read Alignment based on the Burrows-Wheeler Transform," *TPDS*, 2015.
- [179] Y.-T. Chen, J. Cong, J. Lei, and P. Wei, "A Novel High-Throughput Acceleration Engine for Read Alignment," in *FCCM*, 2015.
- [180] E. Rucci, C. Garcia, G. Botella, A. De Giusti, M. Naiouf, and M. Prieto-Matias, "SWIFOLD: Smith-Waterman Implementation on FPGA with OpenCL for Long DNA Sequences," *BMC systems biology*, 2018.
- [181] A. Haghi, S. Marco-Sola, L. Alvarez-Zafra, D. Diamantopoulos, C. Hagleitner, and M. Moreto, "An FPGA Accelerator of the Wavefront Algorithm for Genomics Pairwise Alignment," in *FPL*, 2021.
- [182] L. Li, J. Lin, and Z. Wang, "PipeBSW: A two-stage pipeline structure for Banded Smith-Waterman Algorithm on FPGA," in *ISVLSI*, 2021.
- [183] T. J. Ham, D. Bruns-Smith, B. Sweeney, Y. Lee, S. H. Seo, U. G. Song, Y. H. Oh, K. Asanovic, J. W. Lee, and L. W. Wills, "Genesis: A Hardware Acceleration Framework for Genomic Data Analysis," in *ISCA*, 2020.
- [184] T. J. Ham, Y. Lee, S. H. Seo, U. G. Song, J. W. Lee, D. Bruns-Smith, B. Sweeney, K. Asanovic, Y. H. Oh, and L. W. Wills, "Accelerating Genomic Data Analytics with Composable Hardwared Acceleration Framework," *Micro*, 2021.
- [185] L. Wu, D. Bruns-Smith, F. A. Nothaft, Q. Huang, S. Karandikar, J. Le, A. Lin, H. Mao, B. Sweeney, K. Asanovic *et al.*, "FPGA Accelerated INDEL Realignment in the Cloud," in *HPCA*, 2019.
- [186] N. Mansouri Ghiasi, J. Park, H. Mustafa, J. Kim, A. Olgun, A. Gollwitzer, D. Senol Cali, C. Firtina, H. Mao, N. Almadhoun Alser, R. Ausavarungnirun, N. Vijaykumar, M. Alser, and O. Mutlu, "GenStore: A High-Performance in-Storage Processing System for Genome Sequence Analysis," in *ASPLOS*, 2022.
- [187] D. Senol Cali, G. Kalsi, Z. Bingöl, L. Subramanian, C. Firtina, J. Kim, R. Ausavarungnirun, M. Alser, A. Nori, J. Luma *et al.*, "GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis," in *MICRO*, 2020.
- [188] W. Huangfu, S. Li, X. Hu, and Y. Xie, "RADAR: A 3D-ReRAM based DNA Alignment Accelerator Architecture," in *DAC*, 2018.
- [189] S. K. Khatamifar, Z. Chowdhury, N. Pande, M. Razaviyayn, C. Kim, and U. R. Karpuzu, "GeNvOM: Read Mapping Near Non-Volatile Memory," *TCBB*, 2021.
- [190] S. Gupta, M. Imani, B. Khaleghi, V. Kumar, and T. Rosing, "RAPID: A ReRAM Processing In-Memory Architecture for DNA Sequence Alignment," in *ISLPED*, 2019.
- [191] X.-Q. Li, G.-M. Tan, and N.-H. Sun, "PIM-align: a Processing-In-Memory Architecture for FM-index Search Algorithm," *Journal of Computer Science and Technology*, 2021.
- [192] S. Angizi, J. Sun, W. Zhang, and D. Fan, "Aligns: A Processing-in-Memory Accelerator for DNA Short Read Alignment leveraging sot-mram," in *DAC*, 2019.
- [193] F. Zokaei, H. R. Zarandi, and L. Jiang, "Aligner: A Process-In-Memory Architecture for Short Read Alignment in RRAMs," *CAL*, 2018.
- [194] F. Zhang, S. Angizi, J. Sun, W. Zhang, and D. Fan, "Aligner-D: Leveraging In-DRAM Computing to Accelerate DNA Short Read Alignment," *JETCAS*, 2023.
- [195] Q. Lou, S. C. Janga, and L. Jiang, "Helix: Algorithm/Architecture co-design for Accelerating Nanopore Genome Base-calling," in *PACT*, 2020.
- [196] Q. Lou and L. Jiang, "Brawl: A Spintronics-based Portable Basecalling-In-Memory Architecture for Nanopore Genome Sequencing," *CAL*, 2018.
- [197] T. Shahroodi, G. Singh, M. Zahedi, H. Mao, J. Lindegger, C. Firtina, S. Wong, O. Mutlu, and S. Hamdioui, "Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors," in *ICRC*, 2023.
- [198] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W. mei Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic, "PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference," 2019.
- [199] Z. Wu, K. Hammad, E. Ghafar-Zadeh, and S. Magierowski, "FPGA-Accelerated 3rd Generation DNA Sequencing," *IEEE Transactions on Biomedical Circuits and Systems (TBCS)*, 2020.
- [200] Z. Wu, K. Hammad, A. Beyene, Y. Dawji, E. Ghafar-Zadeh, and S. Magierowski, "An FPGA Implementation of a Portable DNA Sequencing Device Based on RISC-V," in *IEEE International New Circuits and Systems Conference (NEWCAS)*, 2022.
- [201] S. Angizi, J. Sun, W. Zhang, and D. Fan, "PIM-Aligner: A Processing-in-MRAM Platform for Biological Sequence Alignment," in *DATE*, 2020.
- [202] R. Kaplan, L. Yavits, and R. Ginosar, "RASSA: Resistive Prealignment Accelerator for Approximate DNA Long Read Mapping," *Micro*, 2018.
- [203] R. Kaplan, L. Yavits, and R. Ginosar, "BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data," in *SYSTOR*, 2020.
- [204] M. Doblas, O. Lostes-Cazorla, Q. Aguado-Puig, N. Cebray, P. Fontova-Musté, C. Batten, S. Marco-Sola, and M. Moreto, "GMX: Instruction Set Extensions for Fast, Scalable, and Efficient Genome Sequence Alignment," in *ICRC*, 2023.
- [205] Z. Jahshan, I. Merlin, E. Garzón, and L. Yavits, "DASH-CAM: Dynamic Approximate Search Content Addressable Memory for genome classification," *ICRC*, 2023.
- [206] A. Haghi, L. Alvarez, J. Front, J. M. De Haro Ruiz, R. Figueras, M. Doblas, S. Marco-Sola, and M. Moreto, "WFAsic: A High-Performance ASIC Accelerator for DNA Sequence Alignment on a RISC-V SoC," in *ICPP*, 2023.
- [207] M. Pham, Y. Tu, and X. Lv, "Accelerating BWA-MEM Read Mapping on GPUs," in *ICS*, 2023.
- [208] H. Zhong, Z. Chen, W. Huangfu, C. Wang, Y. Xu, T. Wang, Y. Yu, Y. Liu, V. Narayanan, H. Yang *et al.*, "ASMCap: An Approximate String Matching Accelerator for Genome Sequence Analysis Based on Capacitive Content Addressable Memory," *DAC*, 2023.
- [209] L. Burchard, M. X. Zhao, J. Langguth, A. Buluç, and G. Guidi, "Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU," *SC*, 2023.
- [210] V. Y. Gudur, S. Maheshwari, A. Acharyya, and R. Shafik, "An FPGA based Energy-Efficient Read Mapper with Parallel Filtering and In-Situ Verification," *TCBB*, 2021.
- [211] B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang, "Biscuit: A Framework for near-Data Processing of Big Data Workloads," in *ISCA*, 2016.
- [212] Y. Kang, Y.-s. Kee, E. L. Miller, and C. Park, "Enabling Cost-Effective Data Processing with Smart SSD," in *MSST*, 2013.

- [213] X. Wang, Y. Yuan, Y. Zhou, C. C. Coats, and J. Huang, “Project Almanac: A Time-Traveling Solid-State Drive,” in *EuroSys*, 2019.
- [214] A. Acharya, M. Uysal, and J. Saltz, “Active Disks: Programming Model, Algorithms and Evaluation,” *ASPLOS*, 1998.
- [215] K. Keeton, D. A. Patterson, and J. M. Hellerstein, “A Case for Intelligent Disks (IDISKs),” *SIGMOD Rec.*, 1998.
- [216] C. Zou and A. A. Chien, “Assasin: Architecture Support for Stream Computing to Accelerate Computational Storage,” in *MICRO*, 2022.
- [217] V. S. Maitihody, Z. Qureshi, W. Liang, Z. Feng, S. G. De Gonzalo, Y. Li, H. Franke, J. Xiong, J. Huang, and W.-m. Hwu, “Deepstore: In-storage Acceleration for Intelligent Queries,” in *MICRO*, 2019.
- [218] S. Pei, J. Yang, and Q. Yang, “REGISTOR: A Platform for Unstructured Data Processing inside SSD Storage,” *ACM TOS*, 2019.
- [219] S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Using Accelerated Flash Storage for External Graph Analytics,” in *ISCA*, 2018.
- [220] J. Do, Y.-S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt, “Query Processing on Smart SSDs: Opportunities and Challenges,” in *SIGMOD*, 2013.
- [221] S. Seshadri, M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. Jin, Y. Liu, and S. Swanson, “Willow: A User-Programmable SSD,” in *USENIX OSDI*, 2014.
- [222] S. Kim, H. Oh, C. Park, S. Cho, S.-W. Lee, and B. Moon, “In-Storage Processing of Database Scans and Joins,” *Information Sciences*, 2016.
- [223] E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle, “Active Disks for Large-Scale Data Processing,” *Computer*, 2001.
- [224] E. Riedel, G. Gibson, and C. Faloutsos, “Active Storage for Large-Scale Data Mining and Multimedia Applications,” *VLDB*, 1998.
- [225] Y. Wang, X. Pan, Y. An, J. Zhang, and G. Reinman, “BeaconGNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing,” in *HPCA*, 2024.
- [226] J. Park, R. Azizzi, G. F. Oliveira, M. Sadrosadati, R. Nadig, D. Novo, J. Gómez-Luna, M. Kim, and O. Mutlu, “Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory,” in *MICRO*, 2022.
- [227] W. H. Choi, P.-F. Chiu, W. Ma, G. Hemink, T. T. Hoang, M. Lueker-Boden, and Z. Bandic, “An In-Flash Binary Neural Network Accelerator with SLC NAND Flash Array,” in *ISCAS*, 2020.
- [228] R. Han, P. Huang, Y. Xiang, C. Liu, Z. Dong, Z. Su, Y. Liu, L. Liu, X. Liu, and J. Kang, “A Novel Convolution Computing Paradigm based on NOR Flash Array with High Computing Speed and Energy Efficiency,” *TCAS I*, 2019.
- [229] W. Shim and S. Yu, “GP3D: 3D NAND Based In-Memory Graph Processing Accelerator,” *JETCAS*, 2022.
- [230] F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B. Strukov, “High-Performance Mixed-Signal Neurocomputing with Nanoscale Floating-Gate Memory Cell Arrays,” *TNNLS*, 2017.
- [231] P.-H. Tseng, F.-M. Lee, Y.-H. Lin, L.-Y. Chen, Y.-C. Li, H.-W. Hu, Y.-Y. Wang, C.-C. Hsieh, M.-H. Lee, H.-L. Lung *et al.*, “In-Memory-Searching Architecture Based on 3D-NAND Technology with Ultra-High Parallelism,” in *IEDM*, 2020.
- [232] P. Wang, F. Xu, B. Wang, B. Gao, H. Wu, H. Qian, and S. Yu, “Three-Dimensional NAND flash for Vector-Matrix Multiplication,” *TVLSI*, 2018.
- [233] H.-T. Lue, P.-K. Hsu, M.-L. Wei, T.-H. Yeh, P.-Y. Du, W.-C. Chen, K.-C. Wang, and C.-Y. Lu, “Optimal Design Methods to Transform 3D NAND Flash into a High-Density, High-Bandwidth and Low-Power Nonvolatile Computing In Memory (nvCIM) Accelerator for Deep-Learning Neural Networks (DNN),” in *IEDM*, 2019.
- [234] C. Gao, X. Xin, Y. Lu, Y. Zhang, J. Yang, and J. Shu, “ParaBit: Processing Parallel Bitwise Operations in NAND Flash Memory Based SSDs,” in *MICRO*, 2021.
- [235] B. Y. Cho, W. S. Jeong, D. Oh, and W. W. Ro, “Xsd: Accelerating Mapreduce by Harnessing the GPU Inside an SSD,” in *WoNDP*, 2013.
- [236] G. Koo, K. K. Matam, T. I. H. K. G. Narra, J. Li, H.-W. Tseng, S. Swanson, and M. Annavaram, “Summarizer: Trading Communication with Computing Near Storage,” in *MICRO*, 2017.
- [237] S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and Arvind, “BlueDBM: An Appliance for Big Data Analytics,” *ISCA*, 2015.
- [238] M. Torabzadehkashi, S. Rezaei, A. Heydarigorji, H. Bobarshad, V. Alves, and N. Bagherzadeh, “Catalina: In-storage Processing Acceleration for Scalable Big Data Analytics,” in *Euromicro PDP*, 2019.
- [239] M. Ajdari, P. Park, J. Kim, D. Kwon, and J. Kim, “CIDR: A Cost-effective in-line Data Reduction System for terabit-per-second Scale SSD Arrays,” in *HPCA*, 2019.
- [240] S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, and G. R. Ganger, “Active Disk Meets Flash: A Case for Intelligent SSDs,” in *ICS*, 2013.