Navigation Menu

Skip to content

benjsmith/mubiomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Software for Massively-Parallel Next-Generation Sequencing Data

Description:

A package for processing short-read sequences generated by "next-generation" sequencing platforms (e.g. Roche 454, Illumina HiSeq2000, Illumina MiSeq). A combination of generally useful scripts and those specific to microbiome analysis.

Code is now published as part of a manuscript. Please cite us if you use it: pubmed link

The mubiomics package is licensed under GNU GPLv3.

Contents:

Generally useful:

  • qc.py a quality control script. Filters and trims reads based on a number of settable parameters. Takes FASTA + QUAL or FASTQ files as input and can output FASTA or FASTQ.

  • demultiplexer.py a sequencing run demultiplexer for assigning reads to multiple samples based on a DNA barcode contained within the read. Currently supports 8-bp Hamming barcodes and regular barcodes of any fixed length. Can demultiplex single- or paired-end runs either before or after running quality control (qc.py). Barcodes can be contained on either or both ends of the reads and the way reads are demultiplexed can be controlled by the mapping file and the command-line parameters. Additional useful features are: 1) It checks the primer sequence as well as the barcode, so multiple primer sets can be used with the same barcode and reads will be demultiplexed correctly. It also tolerates a user-specified number of mismatches in the primer.

  1. If there is uncertainty about where in each read the barcode begins, it can search the read for the barcode. It even does this while tolerating a user-specified number of mismatches in the barcode. 3) The number of "padding" nucleotides before and after the barcode can be specified by setting the appropriate parameters (or set to 0 if none were included). Input is FASTA or FASTQ files and a mapping file. For single-end reads output is a FASTA or FASTQ file of assigned reads, where the header name has been modified to reflect the sample of origin according to the mapping file, and a file of unassigned sequences (if desired). For paired-end runs it produces a pair of matched files containing all paired sequences. The forward reads are all in one file and the reverse reads are all in the ".mates" file. It also produces a file of "singletons", those for which a barcode was identified, but no mate was found, and, optionally, a file of unassigned sequences. For demultiplexing large files, the default method requires a large amont of RAM (approx. 1/5th of the total input file size). A viable strategy, if memory availability is limited, is to split the original file(s) into smaller pieces before processing (Note: this should be done prior to running qc.py for paired-end reads, since a uneven number of reads may be removed from each file by qc.py). A mode is also available which indexes sequences on the hard drive and uses only a small amount of RAM even for large files, removing the need to split. This is easier for the user than splitting, but runs significantly more slowly on most hard drives.
  • trim_by_seq.py a script to trim, e.g., primers from both ends of reads in a sequence file. Allows specification of a 5' sequence and an optional 3' reverse complement sequence. Accepts ambiguous nucleotide values in the 5' and 3' sequences. Returned reads will begin after the 5' sequence and end before the 3' sequence. If either sequence is not located, that end will not be trimmed.

  • separate_assigned_reads.py will take a FASTA or FASTQ file that has been demultiplexed and create a folder of files containing the sample-specific sequences, one for each of the samples in the mapping file (the same mapping file used for demultiplexer.py). This step is essential for using pplacer to place reads from multiple samples.

  • seqparser.py provides several sequence manipulation functions, including conversion between file formats (e.g.,from FASTA + QUAL files to FASTQ files), compliment, reverse compliment and reverse.

Useful for usearch-based microbiome pipeline:

  • ucstripper.py a python script for processing the output from a usearch --cluster run against a reference database. The output is an OTU table where each column corresponds to a sample and each row corresponds to sequence in the database matched by a read. The cells are filled with count data.

  • ucstripper_paired.py identical to ucstripper.py except that it's designed to work with reads that are from a paired-end run. Reads can be analysed in usearch as though they are separate, but this script will count a hit to an OTU only if both ends agree on the assigned OTU. Optionally, if only one end is present in the .uc file, its OTU is counted.

  • name2tax.py takes the sequence names from ucstripper.py (and ucstripper_paired.py) output and assigns a bacterial name at a user-specified taxonomic level. It requires specifically formatted mapping and taxonomy files. The format for these files should match that used in taxtastic databases.

###Useful for RDP-based microbiome pipeline:

  • rdpstripper.py similar to ucstripper.py, but for the output from RDP Classifier.

  • taxtastic2rdp.py converts a database in the taxtastic format to a set of files suitable for creating and training an RDP Classifier database.

  • rdp_multi_script.sh a script to run RDP Multi-Classifier on your file of processed sequences with your custom trained database. It will prompt you for the various files etc. that it needs.

Useful for pplacer-based microbiome pipeline:

  • separate_assigned_reads.py see "Generally useful", above.

  • pplacer_batch.sh runs multiple samples through pplacer and guppy fat to produce phylogenetic placement files and fattened reference trees showing microbiome distributions for each sample.

  • sqlite_script.sh produces a CSV table of read classifications from a guppy classify generated database. This CSV table can then be used to produce OTU tables or, using gcstripper.py, specific taxon-level classification tables.

  • gcstripper.py similar to ucstripper.py, but for the output from sqlite_script.sh(i.e. pplacer classifications).

Useful for manipulating classification or OTU tables:

  • pool_otus.py takes an OTU table and pools counts whose OTU names are identical. Any OTUs that didn't find a match in the database can be combined into a category called "Noise". It also allows setting of a minimum count threshold; if no sample contains more counts than this threshold, the OTU is discarded.

  • otu_formatter.py takes multiple OTU tables and reformats them such that they all contain identical row and column names and numbers. Counts from the original tables are transferred exactly to the appropriate cells of the reformatted tables and where new rows or columns have to be created, zeros are entered.

For all included python scripts, type the name of the script followed by -h or --help to see help documentation. Many of the scripts output logfiles with useful information about the run e.g. the demultiplexer.py logfile tells you how many reads were assigned to each sample.

Installation:

To install, place the mubiomics directory anywhere on your hard drive, add it to the $PYTHONPATH shell variable and add the mubiomics/scripts directory to your $PATH variable.

The following prerequisites must be installed for scripts to work: -Python v2.7 or greater -Biopython v1.5.8 or greater

Some of the scripts are designed to process results from the following read classification programs. To run the shell scripts that use them, they should also be installed: -usearch -rdp multiclassifier -pplacer

Many of the scripts also depend on files from a taxtastic database as used by pplacer. These have a very specific format and can be constructed using the taxtastic program from Erick Matsen's group at FHCRC. -taxtastic

Testing:

To run tests, cd into mubiomics/tests and enter the following on the command line, followed by return:

$ tests.sh

In order for the tests to run to completion usearch v4.0, or greater must be installed. Once downloaded, the binary must be placed on the shell $PATH.

If the test fails, you will get error messages which may indicate the reason for the failure(s). It's most likely to be either that the dependencies are not installed and/or configured correctly or that the programs are not on your shell $PATH variable. To see $PATH, type echo $PATH in terminal. To add programs to your shell $PATH and $PYTHONPATH, enter the following in ~/.bash_profile:

export PATH=${PATH}:/path/to/directory
export PYTHONPATH=${PYTHONPATH}:/path/to/directory

The tests.sh script also provides an example workflow. Open it in a text editor for explanations of each step.

QIIME compatibility:

The programs in this package were intially written so that they provided compatibility with QIIME, which, at the time, couldn't demultiplex FASTQ data from the Hi-Seq platform. Output from demultiplexer.py is thus compatible with all downstream QIIME workflows. Output from ucstripper.py and pool_otus.py are an identical format to the OTU tables produced by QIIME and thus compatible with the analysis scripts that take them as input.

Notes:

Development and testing was performed on Mac OSX 10.6 using Python v2.7 and Biopython v1.5.8. We can't guarantee that it will work with other setups,but feel free to email with any issues.

The patricia.py class was obtained from a post on stack overflow. Thank you to John Peel for posting this, we hope you don't mind us using it!

About

a python-based set of tools for processing next-gen sequencing reads for microbiome analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published