From Molecules 🧬🧪 to Genomic Variations 🩺💊: Accelerating Genome Analysis via Intelligent Algorithms and Architectures

We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. Our paper is the first to provide a comprehensive survey of a prominent set of algorithmic improvement and hardware acceleration efforts for the entire genome analysis pipeline.

We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.

Computer algorithms and hardware architectures are called intelligent if they are able to efficiently satisfy three principles, data-centric, data-driven, and architecture/algorithm/data-aware.

First, we would like to process genomic data efficiently by minimizing data movement and maximizing the efficiency with which data is handled, i.e., stored, accessed, and processed.

Second, we would like to take advantage of the vast amounts of genomic data and metadata to continuously improve decision making (self-optimizing decisions) for many different use cases in science, medicine, and technology.

Third, we would like to orchestrate the multiple components across the entire analysis system and adapt algorithms by understanding the structure of the underlying hardware, understanding analysis algorithms, and understanding various properties (i.e., the structure of the genome, type of sequencing data, quality of sequencing data) of each piece of data.

Introduction
Obtaining Genomic Sequencing Data
1. Generating Sequencing Data
2. Downloading Real Sequencing Data
3. Simulating Sequencing Data
Types of Genomic Sequencing Data
1. Short Reads
2. Ultra-long Reads
3. Accurate Long Reads
4. Discussion on Types of Sequencing Reads
Genome Analysis Using Different Types of Sequencing Reads
Basecalling
1. Illumina
2. ONT
3. PacBio
Quality Control
Read Mapping
1. Accelerating Indexing and Seeding
  1. Sampling Seeds
  2. Improving Data Structures for Seed Lookups
  3. Reducing Data Movement During Indexing
2. Accelerating Pre-Alignment Filtering
  1. Pigeonhole Principle
  2. Base Counting
  3. q-gram Filtering Approach
  4. Sparse Dynamic Programming
3. Accelerating Sequence Alignment
  1. Accurate Alignment Accelerators
  2. Alignment Accelerators with Limited Functionality
Variant Calling
Discussion and Future Opportunities

Citing this paper

If you use this paper in your work, please cite:

Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu. "From Molecules to Genomic Variations: Accelerating Genome Analysis via Intelligent Algorithms and Architectures" arXiv preprint arXiv (2022). https://arxiv.org/abs/2205.07957

Below is bibtex format for citation.

@misc{https://doi.org/10.48550/arxiv.2205.07957,
  doi = {10.48550/ARXIV.2205.07957},
  url = {https://arxiv.org/abs/2205.07957},
  author = {Alser, Mohammed and Lindegger, Joel and Firtina, Can and Almadhoun, Nour and Mao, Haiyu and Singh, Gagandeep and Gomez-Luna, Juan and Mutlu, Onur},
  title = {Going From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis},
  publisher = {arXiv},
  year = {2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Analysis		Analysis
Data		Data
Figures		Figures
Tools		Tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis

Analysis

Data

Data

Figures

Figures

Tools

Tools

README.md

README.md

Repository files navigation

From Molecules 🧬🧪 to Genomic Variations 🩺💊: Accelerating Genome Analysis via Intelligent Algorithms and Architectures

Table of Contents

Citing this paper

About

Releases

Packages

CMU-SAFARI/Molecules2Variations

Folders and files

Latest commit

History

Repository files navigation

From Molecules 🧬🧪 to Genomic Variations 🩺💊: Accelerating Genome Analysis via Intelligent Algorithms and Architectures

Table of Contents

Citing this paper

About

Topics

Resources

Stars

Watchers

Forks