Skip to content

Latest commit

 

History

History
114 lines (85 loc) · 3.61 KB

design.rst

File metadata and controls

114 lines (85 loc) · 3.61 KB

This document describes the design of the seqrepo package. It is currently a collection of thoughts during development. If you find dangling sentences, you should

Goals and design implications

This section summarizes goals and their architectural implications.

  • Space-efficient within a release/snapshot ⇒ compress sequences ⇒ dedupe sequences ⇒ use hashes
  • Space-efficient across releases ⇒ use hard links
  • Bandwidth-efficient distribution of incremental updates ⇒ immutable, journaled add-only sequence storage
  • Zero or more namespaced aliases associated with a sequence ⇒ store aliases for hashed sequences
  • Fast sequence lookup and slicing (random access) ⇒ when coupled with compression ⇒ blocked gzip

Space-efficient storage usually means compression. Conventional compression precludes random access to files, which, for example, would necessitate reading an entire chromosome in order to access an arbitrary region.

Fortunately, the blocked gzip format (bgzf) enables random access on compressed files. The solution implemented here uses block-gzipped fasta files with access provided by the pysam.FastaFile module. Taken together, bgzf and pysam enable compression and fast random access.

Space efficiency across snapshots is well-served by using hardlinks across snapshots for sequence files. (Sqlite databases are not hardlinked.)

Components

The biocommons.seqrepo package provides five classes:

FabgzReader, FabgzWriter

  • Provides fast random access to to sequences using block gzipped format (BGZF)
  • On commit, FabgzWriter closes file and creates indicies
  • FabgzReader is a thin wrapper around PySAM FastaFile (which provides bgzf reading)

FastaDir

  • Key-value store for sequences using immutable and journaled files
  • Sqlite db tracks metadata and file location of key values (sequences)

SeqAliasDb

  • Associates sequence key with "namespaced aliases" (e.g., ensembl-75, ENST00000012432)

SeqRepo

Known reference types: gi, refseq, Ensembl, LRG, GRC, BIC hashes: SHA1, SHA1/8, SHA256, SHA512, MD5, SEGUID

Filesystem Layout

FS Layout:

/opt/seqrepo/
├── master
│   ├── aliases.sqlite3
│   └── sequences
│       ├── 2016
│       │   ├── 0824
│       │   │   ├── 045923
│       │   │   │   ├── 1472014763.7728612.fa.bgz
│       │   │   │   ├── 1472014763.7728612.fa.bgz.fai
│       │   │   │   └── 1472014763.7728612.fa.bgz.gzi
│       │   │   ├── 045927
│       │   │   │   ├── 1472014767.3542793.fa.bgz
├── 2016-08-27
│   ├── aliases.sqlite3
│   └── sequences
│       ├── 2016 ...
├
└── 2016-08-28
    ├── aliases.sqlite3
    └── sequences
        │   ├── 0824...
        │   └── 0828
        │       ├── 000003
        │       │   ├── 1472342403.26.fa.bgz
        │       │   ├── 1472342403.26.fa.bgz.fai
        │       │   └── 1472342403.26.fa.bgz.gzi
        │           └── 1472357923.36.fa
        └── db.sqlite3