seqrepo includes a command line interface for loading, fetching, and exporting sequences.
This documentation assumes that the seqrepo base directory is:
SEQREPO_ROOT=/usr/local/share/seqrepo
Current convention is to add sequences to $SEQREPO_ROOT/master, then snapshot this to a dated directory like $SEQREPO_ROOT/2016-08-28. (This convention is conceptually similar to source code development on a master branch with tags.)
$ seqrepo --root-directory $SEQREPO_ROOT/master init
$ seqrepo --root-directory $SEQREPO_ROOT/master load -n NCBI mirror/ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.gz
$ seqrepo --root-directory $SEQREPO_ROOT/master show-status
seqrepo 0.1.0
root directory: /usr/local/share/seqrepo/master, 0.2 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 3 files, 33080 sequences, 110419437 residues
aliases: 165481 aliases, 165481 current, 5 namespaces, 33080 sequences
Snapshots are made with the snapshot command:
$ seqrepo -v snapshot 2017-07-17
INFO:biocommons.seqrepo.cli:snapshot created in $SEQREPO_ROOT/2017-07-17
The snapshot command:
- creates the same directory structure as the source directory
- hardlinks the sequence files and indexes to the new location
- copies the sqlite databases
- removes write permissions from directories and sqlite databases (sequence files are made unwritable after creation).
$ seqrepo -v -r $SEQREPO_ROOT export | head
>NCBI:NM_013305.4 seguid:EqjiLe... MD5:04e8c3c75... SHA512:000a70c470f6... SHA1:12a8e22d...
GTACGCCCCCTCCCCCCGTCCCTATCGGCAGAACCGGAGGCCAACCTTCGCGATCCCTTGCTGCGGGCCCGGAGATCAAACGTGGCCCGCCCCCGGCAGG
GCACAGCGCGCTGGGCAACCGCGATCCGGCGCCGGACTGGAGGGGTCGATGCGCGGCGCGCTGGGGCGCACAGGGGACGGAGCCCGGGTCTTGCTCCCCA
- SEQREPO_BGZIP_PATH may be used to specify an alternative location for the bgzip binary. (Default: /usr/bin/bgzip)