rdxon: FASTQ filtering for rare (somatic) variants

Filter FASTQ files against all 1000 Genomes sequencing data using k-mers. Keep only reads with k-mers missing in 1000 Genomes.

Installation

rdxon is available as a pre-compiled statically linked binary from rdxon's github release page, as a singularity container SIF file or as a minimal Docker container.

git clone --recursive https://github.com/tobiasrausch/rdxon.git

cd rdxon/

make all

1000 Genomes k-mer map

Download the 1000 Genomes k-mer maps here: http://gear-genomics.embl.de/data/rdxon/

Running

To filter an input FASTQ file against the 1000 Genomes sequencing data simply run

rdxon filter -x kmer.x.map -y kmer.y.map -o <output.fq.gz> <input.fq.gz>

You can also dump all rare k-mers which are absent in 1000 Genomes to a file

rdxon filter -x kmer.x.map -y kmer.y.map -u <kmer.gz> -o <output.fq.gz> <input.fq.gz>

For paired-end data you can run Read1 and Read2 in parallel and then concatenate the output FASTQ files.

Paired-end mode

For certain downstream applications you may want to retain proper paired-ends. The paired-end mode of the filter subcommand is:

rdxon filter -x kmer.x.map -y kmer.y.map -o <outprefix> <read1.fq.gz> <read2.fq.gz>

Rare and somatic k-mers

For tumor-normal sequencing in cancer genomics, you can also filter for reads that contain rare and somatic k-mers.

rdxon somatic -x kmer.x.map -y kmer.y.map -o <output.fq.gz> <tumor.fq.gz> <control.fq.gz>

The somatic subcommand is also available in paired-end mode.

rdxon somatic -x kmer.x.map -y kmer.y.map -o <outprefix> <tumor.1.fq.gz> <tumor.2.fq.gz> <control.1.fq.gz> <control.2.fq.gz>

Approximate runtime and memory usage for filtering reads containing rare k-mers

Whole-exome sequencing: ~1 hour and ~4G RAM (single CPU, one job for Read1 and Read2)

Whole-genome sequencing: ~6 hours and ~4G RAM (single CPU, one job for Read1 and Read2)

Acknowledgement

The 1000 Genomes high-coverage data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. All cell lines were obtained from the Coriell Institute for Medical Research and from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research. More information regarding the 1000 Genomes high-coverage data and data reuse is available here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
singularity		singularity
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
AUTHORS		AUTHORS
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

singularity

singularity

src

src

.gitignore

.gitignore

.gitmodules

.gitmodules

.travis.yml

.travis.yml

AUTHORS

AUTHORS

Dockerfile

Dockerfile

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

Repository files navigation

rdxon: FASTQ filtering for rare (somatic) variants

Installation

1000 Genomes k-mer map

Running

Paired-end mode

Rare and somatic k-mers

Approximate runtime and memory usage for filtering reads containing rare k-mers

Acknowledgement

About

Releases 5

Packages

Languages

License

tobiasrausch/rdxon

Folders and files

Latest commit

History

Repository files navigation

rdxon: FASTQ filtering for rare (somatic) variants

Installation

1000 Genomes k-mer map

Running

Paired-end mode

Rare and somatic k-mers

Approximate runtime and memory usage for filtering reads containing rare k-mers

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages