Genome Graph Annotation Schemes

Sparse binary relation representations for genome graphs annotations

Reference

Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, and André Kahles. Sparse Binary Relation Representations for Genome Graph Annotation. Apr 2020. Journal of Computational Biology, 27(4), 626-639. http://doi.org/10.1089/cmb.2019.0324

This repository implements the following schemes for representing graph annotation:

Column-major compressed
Row-major flat
Rainbowfish
BinRel-WT
BRWT
Multi-BRWT
...

This repository is no longer maintained. Check out the MetaGraph project for a significantly optimized and scaled-up implementation of Multi-BRWT as well as many other graph annotation representations.

As an underlying graph structure, the following representations are implemented:

Hash-based de Bruijn graph
Complete de Bruijn graph, taking constant space

Comparison

The figures below show the final size of two compressed binary relations

Kingsford with 3.7 bln rows and 2,652 columns, density ~0.3%
Refseq (family) with 1 bln rows and 3,173 columns, density ~3.8%

Method	Kingsford, Gb	RefSeq, Gb
Column	36.6	80.2
Flat	41.2	121.6
BinRel-WT	49.6	N/A
BinRel-WT (sdsl)	31.4	150.6
Rainbowfish	23.2	136.6
BRWT	14.1	57.2
Multi-BRWT	9.9	43.6

Prerequisites

cmake 3.6.1
GNU GCC with C++17 (gcc-8 or higher) or LLVM Clang (clang-7 or higher)
HTSlib
boost
folly (optional)

All can be installed with brew or linuxbrew

For compiling with GNU GCC:

brew install gcc autoconf automake libtool cmake make htslib
brew install --build-from-source boost
(optional) brew install --build-from-source double-conversion gflags glog lz4 snappy zstd folly
brew install gcc@8

Then set the environment variables accordingly:

echo "\
# Use gcc-8 with cmake
export CC=\"\$(which gcc-8)\"
export CXX=\"\$(which g++-8)\"
" >> $( [[ "$OSTYPE" == "darwin"* ]] && echo ~/.bash_profile || echo ~/.bashrc )

For compiling with LLVM Clang:

brew install llvm libomp autoconf automake libtool cmake make htslib boost folly

Then set the environment variables accordingly:

echo "\
# OpenMP
export LDFLAGS=\"\$LDFLAGS -L$(brew --prefix libomp)/lib\"
export CPPFLAGS=\"\$CPPFLAGS -I$(brew --prefix libomp)/include\"
# Clang C++ flags
export LDFLAGS=\"\$LDFLAGS -L$(brew --prefix llvm)/lib -Wl,-rpath,$(brew --prefix llvm)/lib\"
export CPPFLAGS=\"\$CPPFLAGS -I$(brew --prefix llvm)/include\"
export CXXFLAGS=\"\$CXXFLAGS -stdlib=libc++\"
# Path to Clang
export PATH=\"$(brew --prefix llvm)/bin:\$PATH\"
# Use Clang with cmake
export CC=\"\$(which clang)\"
export CXX=\"\$(which clang++)\"
" >> $( [[ "$OSTYPE" == "darwin"* ]] && echo ~/.bash_profile || echo ~/.bashrc )

Compile

git clone --recursive https://github.com/ratschlab/genome_graph_annotation
make sure all submodules are downloaded: git submodule update --init --recursive
install third-party libraries from external-libraries/ following the corresponding istructions
or simply run the following script

git submodule update --init --recursive

pushd external-libraries/sdsl-lite
./install.sh $(pwd)
popd

pushd external-libraries/libmaus2
cmake -DCMAKE_INSTALL_PREFIX:PATH=$(pwd) .
make -j $(($(getconf _NPROCESSORS_ONLN) - 1))
make install
popd

go to the build directory mkdir -p build && cd build
compile by cmake .. && make -j $(($(getconf _NPROCESSORS_ONLN) - 1))
run unit tests ./unit_tests

Typical issues

Linking against dynamic libraries in Anaconda when compiling libmaus2
- make sure that packages like Anaconda are not listed in the exported environment variables

Build types: `cmake .. <arguments>` where arguments are:

-DCMAKE_BUILD_TYPE=[Debug|Release|Profile] -- build modes (Release by default)
-DBUILD_STATIC=[ON|OFF] -- link statically (OFF by default)
-DWITH_AVX=[ON|OFF] -- compile with support for the avx instructions (ON by default)

Typical workflow

Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
./annograph build
Annotate graph using the column compressed annotation:
./annograph annotate
Transform the built annotation to a different annotation scheme:
./annograph transform_anno
Merge annotations (optional):
./annograph merge_anno
Query annotated graph
./annograph classify

Example

./annograph build -k 12 -o tiny_example ../tests/data/tiny.fa

./annograph annotate -i tiny_example --anno-filename -o tiny_example ../tests/data/tiny.fa

./annograph classify -i tiny_example -a tiny_example.column.annodbg ../tests/data/tiny.fa

./annograph stats -a tiny_example tiny_example

Scripts

For real benchmarking scripts, see scripts.

Experiments on simulated matrices

Compressed simulated binary relation matrices can be generated using the script experiments/run_benchmarks.py. Given a column count $N_COLUMNS, the simulation mode $MODE three available simulation modes are

norepl uniformly random matrix of size 1,000,000 x $N_COLUMNS,
uniform_rows 200,000 rows of size $N_COLUMNS duplicated 5 times to form a 1,000,000 x $N_COLUMNS matrix, and
uniform_columns $N_COLUMNS / 5 columns of size 1,000,000 duplicated 5 times to form a 1,000,000 x $N_COLUMNS matrix.

The compressor $METHOD can be one of: brwt, bin_rel_wt, bin_rel_wt_sdsl, column, rbfish, flat.

An experiment can then be run with the command

run_benchmarks.py $METHOD $MODE $N_COLUMNS $N_THREADS

when . is passed in place of $METHOD and/or $MODE, all methods/modes are run.

For method brwt, additional parameters can be passed at the end of the command. These can be one of

--arity <N> generate BRWT of arity N, or
--greedy 1 --relax <N> greedy optimization of column arrangement before construction of a BRWT of maximum arity N

All resulting matrices are saved to the simulate folder in the directory where the script is run.

Figures from manuscript

To reproduce the simulated matrix experiment results from the manuscript, run the following commands

for N_COLUMNS in 500 1000 3000; do
    run_benchmarks.py . . $N_COLUMNS $N_THREADS 
    run_benchmarks.py brwt $N_COLUMNS $N_THREADS --greedy 1 --relax 7
done

To plot all data for the figures in the experiments, run the command

run_benchmarks.py plot $N_COLUMNS

An alternative set of methods can be passed as subsequent arguments if desired, for example

run_benchmarks.py plot $N_COLUMNS brwt_arity_2 brwt_greedy_relax_6 bin_rel_wt

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
annotation		annotation
common		common
dbg_hash		dbg_hash
experiments		experiments
external-libraries		external-libraries
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakeLists.txt.in		CMakeLists.txt.in
CMakeListsKMC.txt.in		CMakeListsKMC.txt.in
COPYRIGHT		COPYRIGHT
LICENSE		LICENSE
README.md		README.md
config.cpp		config.cpp
config.hpp		config.hpp
main.cpp		main.cpp

License

ratschlab/genome_graph_annotation

Folders and files

Latest commit

History

Repository files navigation

Genome Graph Annotation Schemes

Reference

This repository is no longer maintained. Check out the MetaGraph project for a significantly optimized and scaled-up implementation of Multi-BRWT as well as many other graph annotation representations.

Comparison

Prerequisites

For compiling with GNU GCC:

For compiling with LLVM Clang:

Compile

Typical issues

Build types: cmake .. <arguments> where arguments are:

Typical workflow

Example

Scripts

Experiments on simulated matrices

Figures from manuscript

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Build types: `cmake .. <arguments>` where arguments are: