Skip to content

Filter out sequences in FASTQ files that match a reference genome

License

Notifications You must be signed in to change notification settings

petersm3/filterfastq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Summary

Filter out sequences in Illumina HiSeq/MiSeq gzipped FASTQ files (single or paired-end) that match a reference genome, e.g., PhiX (Control Libraries)

Disclaimer

  • For removal of contamination it is recommended that you use other programs, e.g., JGI's BBDuk "Kmer filtering" feature as part of Data Preprocessing
  • filterfastq.pl was originally written as a proof of concept in 2015 (and used in production on a limited basis) due to Illumina's elimination of the dedicated PhiX control lane (#8) on their HiSeq 3000/4000 instruments
    • e.g., Sequencing a non-indexed sample (whole genome) with a large PhiX spike-in (e.g., 10%) may have required filtering of the final set of FASTQ reads
  • filterfastq.pl uses NCBI's command-line BLAST (blastn) to compare sequences in a semi-serial fashion, i.e., sets of 4 million sequences, using multiple blastn threads, against the reference genome (as a BLAST database), which can take a long time
    • e.g., Approximately 11.75 days to process 264,724,686 single end reads, which had a recorded PhiX spike-in at 0.6%; 0.521% reads were filtered out by the script

Dependencies

Usage

$ ./filterfastq.pl -h

Usage: ./filterfastq.pl -r reference_genome.fa -i fastq_input_dir -o fastq_output_dir [-t threads] [-e evalue]

-r, --reference
       FASTA reference genome for sequences to be filtered against
-i, --input
       Directory containing the gzipped FASTQ file(s) to be filtered
       Sets of paired-end reads need to be in the same directory and are
       identified by their filenames containing either *_R1_* and *_R2_*
-o, --output
       Directory to store the filtered FASTQ and intermediary files
       Script will create output directory but not parents, e.g., mkdir -p
-t, --threads
       Optional number of threads for blastn to use (default is 1)
-e, --evalue
       Optional evalue for blastn to use (default is 1e-15)
-q, --quiet
       Do not write script progress to standard output
-h, --help
       This usage information.

About

Filter out sequences in FASTQ files that match a reference genome

Topics

Resources

License

Stars

Watchers

Forks

Languages