tprK pipeline

This pipeline was designed to take Illumina and PacBio files straight off the sequencer to a final comparison table of all the different variable regions with their relative frequencies, as well as various pretty plots along the way.

Setup

Install nextflow.
- Make sure you move nextflow to a directory in your PATH variable.
Install docker. The first time running this program will take a while, as the docker image will take some time to build, but this is a one time thing!

Input Files

Put the following things in one folder:

All the sequence files to run analysis on
- PacBio Q20 reads, gzipped
- Single-end Illumina reads trimmed and run through Trimmomatic, gzipped
- By default the pipeline expects both PacBio and Illumina files for every sample. Running the pipeline with just PacBio or just Illumina files is possible with the --pacbio and --illumina flags respectively. However, some plots require both files to be generated and these plots will not be output.
Metadata file. This should be a .csv with three columns: SampleName, PacBio, Illumina, shown in the table below. Make sure to include absolute paths to PacBio and Illumina files!
- This file should be placed in the same folder as your files to be analyzed.
- There MUST be a newline character at the end of this file to be read as a valid csv. Simply hit enter in the last row to ensure there is a valid new line.
- Ensure that there are no special characters, including hyphens! Underscores are okay.
- If running just Illumina or just PacBio, simply leave those columns blank (but make sure to have commas as appropriate).
- Example metadata files (for both, just Illumina, and just PacBio) are provided in the example/ folder. The general format of the metadata file should be three columns, separated by commas, as shown:

SampleName	Illumina	PacBio
This will largely be the name used for generating tables and plots.	Should be in format Ill_[sample name].fastq.gz. The Illumina file specified for the sample name. This must match exactly the name of the matching file in the folder. This should be a trimmed file run through Trimmomatic.	Should be in format PB_[sample name].fastq.gz. The PacBio file specified for the sample name. This must match exactly the name of the matching file in the folder. This should be a Q20 file.

Usage

Example command for just Illumina files in current directory on a laptop without many CPUs: nextflow run michellejlin/tprk -r nextflow --INPUT ./ --OUTDIR output/ --ILLUMINA --METADATA metadata.csv -resume -with-docker ubuntu:18.04 -with-trace -profile laptop
Example command for comparing PacBio and Illumina files with specified cutoffs on the cloud with a large dataset: AWS_PROFILE=covid nextflow run michellejlin/tprk -r nextflow --INPUT example/ --OUTDIR example/output/ --METADATA metadata.csv --LARGE -resume -with-docker ubuntu:18.04 -with-trace -c ~/nextflow.covid.config -profile Cloud
Example command for just Illumina files in current directory with a specified reference sample for variable region comparisons: nextflow run michellejlin/tprk -r nextflow --INPUT ./ --OUTDIR output/ --ILLUMINA --METADATA metadata.csv -resume -with-docker ubuntu:18.04 -with-trace --LARGE --REFERENCE inoculum_S168_trim

For a list of arguments, you can also run nextflow run michellejlin/tprk -r nextflow --help .

Command	Description
--INPUT	Input folder where gzipped fastqs are located. For current directory, `./` can be used.
--OUTDIR	Output folder where .bams and consensus fastas will be piped into.
--METADATA	Path to metadata file with specific format.
--PACBIO	Specify that there are only PacBio files to be read.
--ILLUMINA	Specify that there are only Illumina files to be read.
--LARGE	Specify that this is a large dataset. Splitting of visualizations will be done.
--REFERENCE	Specify Illumina sample name (not file), to compare others to for dot-line plots. Can be used in tandem with --LARGE.
--RF_FILTER	Specify relative frequency filter. Default is 0.2.
--COUNT_FILTER	Specify count filter. Default is 5.
--ILLUMINA_FILTER	Specify whether PacBio reads should be filtered to only include files supported by Illumina reads that reach the cutoff.
-resume	nextflow will pick up where it left off if the previous command was interrupted for some reason.
-with-docker ubuntu:18.04	Runs command with Ubuntu docker.
-with-trace	Outputs a trace.txt that shows which processes end up in which work/ folders.

Common Errors

incomplete final line found by readTableHeader on '/Users/uwvirongs/Documents/tprk/metadata.csv' Make sure your metadata file has a new line at the end. You can do this by simply pressing enter on the last line of your file and saving.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.nextflow		.nextflow
__pycache__		__pycache__
example		example
.DS_Store		.DS_Store
.Rapp.history		.Rapp.history
.Rhistory		.Rhistory
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PacBio2tree.R		PacBio2tree.R
PacBio_v_Illumina_plots.R		PacBio_v_Illumina_plots.R
RAD_Frequency.R		RAD_Frequency.R
README.md		README.md
Variable_region_compare.R		Variable_region_compare.R
alldata_visualizer.py		alldata_visualizer.py
alldata_visualizer2.py		alldata_visualizer2.py
alldata_visualizer_alex.py		alldata_visualizer_alex.py
compare_df.R		compare_df.R
filterAllReads.py		filterAllReads.py
main.nf		main.nf
nextflow.config		nextflow.config
palette.py		palette.py
palette.pyc		palette.pyc
recalculate_frequency.R		recalculate_frequency.R
recalculate_frequency_old.R		recalculate_frequency_old.R
subset_tprk_output.R		subset_tprk_output.R
syph_r.py		syph_r.py
syph_visualizer.py		syph_visualizer.py
tprk_pipeline.py		tprk_pipeline.py

License

michellejlin/tprk

Folders and files

Latest commit

History

Repository files navigation

tprK pipeline

Table of Contents

Setup

Input Files

Usage

Common Errors

About

Resources

License

Stars

Watchers

Forks

Languages