kmtricks

kmtricks is a modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.

Citation

Lemane, T., Medvedev, P., Chikhi, R., & Peterlongo, P. (2022). kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinformatics Advances.

Rationale

kmtricks is optimized for the analysis of multiple FASTA/FASTQ files (gzipped or not). It features:

Fast k-mer matrix construction
Fast Bloom filters construction
Rescues low-abundance k-mers when they are seen in multiple samples

Note: for counting kmers from a single file, kmtricks works but is slightly slower than a traditional k-mer counter (e.g. KMC). It is really optimized for merging count information across multiple samples, which traditional k-mer counters cannot do.

Overview

Input: a set of read sets in FASTA or FASTQ format, gzipped or not.

Final output is either:

a matrix of kmer abundances. M_i,j is the abundance of kmer i in the read set j
a matrix of kmer membership. M_i,j is the presence (1) or absence (0) of kmer i in the read set j
a vector of Bloom filters. M_i,j is the presence (1) or absence (0) of the hash_value i (line numbers are hash values) in the read set j.
- In this case, this matrix is provided vertically (one column is a bloom filter corresponding to one dataset).
- After transposition, this matrix may also be provided horizontally (one line is a Bloom filter corresponding to one dataset). This enables to provide efficiently an independent Bloom filter per input read file.

Installation and usage

Instructions for installation and usage are provided in the wiki.

Limitations

kmtricks needs disk space to run. The disk usage is variable and depends on data, parameters and output format. Based on our observations, the required space is between 20% of the total input size (gzipped) and the total input size (including outputs).

Reporting an issue

If you encounter a problem, please open an issue with a description of your run and the return of kmtricks infos. If you encounter a critical error like a segmentation fault, kmtricks automatically dumps a file kmtricks_backtrace.log in your current directory. This file is somewhat illegible in release mode. If you can, compile kmtricks in debug mode, launch it again and join the content of this file. If you cannot directly compile kmtricks on your system, the conda package provides kmtricks-debug binary for this case.

Reference

T. Lemane, P. Medvedev, R. Chikhi and P. Peterlongo, "kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections." Bioinformatics Advances, 2022, doi:10.1093/bioadv/vbac029.

@article{kmtricks,
    author = {Lemane, Téo and Medvedev, Paul and Chikhi, Rayan and Peterlongo, Pierre},
    title = "{kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections}",
    journal = {Bioinformatics Advances},
    year = {2022},
    doi = {10.1093/bioadv/vbac029},
    url = {https://doi.org/10.1093/bioadv/vbac029},
}

Contacts

Téo Lemane: teo[dot]lemane[at]proton[dot]me
Rayan Chikhi: rayan[dot]chikhi[at]pasteur[dot]fr
Pierre Peterlongo: pierre[dot]peterlongo[at]inria[dot]fr

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
.github/workflows		.github/workflows
cmake		cmake
conda/kmtricks		conda/kmtricks
doc		doc
docker		docker
include/kmtricks		include/kmtricks
km_howdesbt		km_howdesbt
plugins		plugins
scripts		scripts
src		src
tests		tests
thirdparty		thirdparty
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

License

tlemane/kmtricks

Folders and files

Latest commit

History

Repository files navigation

kmtricks

Citation

Rationale

Overview

Installation and usage

Limitations

Reporting an issue

Reference

Contacts

About

Topics

Resources

License

Stars

Watchers

Forks

Languages