Prefilter for mapping

Problem: Mapping metagenomes to large databases, such as the GMGCv1, takes too much memory. Partitioning the dataset is a common solution (and supported by NGLess, but it has drawbacks (it's slow).

This repository explores the possibility of prefiltering the database by removing sequences that are extremely unlikely to be matches.

Approach

Parse all the reads and collect all randstrobes (or rather their hashes)
Parse the database and select only unigenes that are expected to be present in the reads
Map as usual to the pre-filtered database

For 2, different strategies are possible. The simplest is to keep any unigene that shares any hash with the set of hashes from the reads. Currently being considered

min1: keep all references that match at least one hash
min2: keep all references that match at least two hashes

We also tested counting the exact value or using a hacky Bloom filter structure that uses a single fixed size array, but the hacky version gave bad estimates.

Requirements

Python, including NumPy and Pandas
Jug
NGLess
Strobealign (Sahlin, 2022), including the Python bindings
tabulate is used to print the final table

To install most dependencies (assuming you have conda-forge & bioconda set up):

conda install python=3.11 numpy pandas requests tabulate jug ngless

To install strobealign's Python bindings (which will not be installed by default with conda):

# To ensure you have a recent C++ compiler (not always needed)
conda install gxx_linux-64 gcc_linux-64
export CC CXX

git clone https://github.com/ksahlin/strobealign
cd strobealign
pip install .

If available, stly is used to save memory; otherwise, the code will fall back on the standard Python set (which is actually faster).

Data

Database GMGCv1 (from (Coelho et al., 2022). This can be is downloaded by jugfile.py
Metagenomes: Dog dataset (from Coelho et al., 2018) and human gut dataset (from Zeller et al., 2014. These can be downloaded with ena-mirror. More guidance will be provided on how to do it soon, but get in touch if you have questions.

Note that running this benchmark will use a lot of disk storage!

Author

Luis Pedro Coelho (Queensland University of Technology). luis@luispedro.org

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
tests		tests
.gitignore		.gitignore
LICENSE.MIT		LICENSE.MIT
README.md		README.md
download.py		download.py
fastaq.py		fastaq.py
jugfile.py		jugfile.py
output_report.md		output_report.md
preprocess.ngl		preprocess.ngl
preprocess.py		preprocess.py
print_report.py		print_report.py
samples.py		samples.py
strobefilter.py		strobefilter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests

tests

.gitignore

.gitignore

LICENSE.MIT

LICENSE.MIT

README.md

README.md

download.py

download.py

fastaq.py

fastaq.py

jugfile.py

jugfile.py

output_report.md

output_report.md

preprocess.ngl

preprocess.ngl

preprocess.py

preprocess.py

print_report.py

print_report.py

samples.py

samples.py

strobefilter.py

strobefilter.py

Repository files navigation

Prefilter for mapping

Approach

Requirements

Data

Author

About

Releases

Packages

Languages

License

luispedro/strobefilter

Folders and files

Latest commit

History

Repository files navigation

Prefilter for mapping

Approach

Requirements

Data

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages