Skip to content

Latest commit

 

History

History
59 lines (49 loc) · 5.07 KB

README.md

File metadata and controls

59 lines (49 loc) · 5.07 KB

tprK pipeline

This pipeline was designed to take Illumina and PacBio files straight off the sequencer to a final comparison table of all the different variable regions with their relative frequencies, as well as various pretty plots along the way.

Table of Contents

Setup

  1. Install nextflow.
    • Make sure you move nextflow to a directory in your PATH variable.
  2. Install docker. The first time running this program will take a while, as the docker image will take some time to build, but this is a one time thing!

Input Files

Put the following things in one folder:

  • All the sequence files to run analysis on
    • PacBio Q20 reads, gzipped
    • Single-end Illumina reads trimmed and run through Trimmomatic, gzipped
    • By default the pipeline expects both PacBio and Illumina files for every sample. Running the pipeline with just PacBio or just Illumina files is possible with the --pacbio and --illumina flags respectively. However, some plots require both files to be generated and these plots will not be output.
  • Metadata file. This should be a .csv with three columns: SampleName, PacBio, Illumina, shown in the table below. Make sure to include absolute paths to PacBio and Illumina files!
    • This file should be placed in the same folder as your files to be analyzed.
    • There MUST be a newline character at the end of this file to be read as a valid csv. Simply hit enter in the last row to ensure there is a valid new line.
    • Ensure that there are no special characters, including hyphens! Underscores are okay.
    • If running just Illumina or just PacBio, simply leave those columns blank (but make sure to have commas as appropriate).
    • Example metadata files (for both, just Illumina, and just PacBio) are provided in the example/ folder. The general format of the metadata file should be three columns, separated by commas, as shown:
SampleName Illumina PacBio
This will largely be the name used for generating tables and plots. Should be in format Ill_[sample name].fastq.gz. The Illumina file specified for the sample name. This must match exactly the name of the matching file in the folder. This should be a trimmed file run through Trimmomatic. Should be in format PB_[sample name].fastq.gz. The PacBio file specified for the sample name. This must match exactly the name of the matching file in the folder. This should be a Q20 file.

Usage

  • Example command for just Illumina files in current directory on a laptop without many CPUs: nextflow run michellejlin/tprk -r nextflow --INPUT ./ --OUTDIR output/ --ILLUMINA --METADATA metadata.csv -resume -with-docker ubuntu:18.04 -with-trace -profile laptop
  • Example command for comparing PacBio and Illumina files with specified cutoffs on the cloud with a large dataset: AWS_PROFILE=covid nextflow run michellejlin/tprk -r nextflow --INPUT example/ --OUTDIR example/output/ --METADATA metadata.csv --LARGE -resume -with-docker ubuntu:18.04 -with-trace -c ~/nextflow.covid.config -profile Cloud
  • Example command for just Illumina files in current directory with a specified reference sample for variable region comparisons: nextflow run michellejlin/tprk -r nextflow --INPUT ./ --OUTDIR output/ --ILLUMINA --METADATA metadata.csv -resume -with-docker ubuntu:18.04 -with-trace --LARGE --REFERENCE inoculum_S168_trim

For a list of arguments, you can also run nextflow run michellejlin/tprk -r nextflow --help .

Command Description
--INPUT Input folder where gzipped fastqs are located. For current directory, ./ can be used.
--OUTDIR Output folder where .bams and consensus fastas will be piped into.
--METADATA Path to metadata file with specific format.
--PACBIO Specify that there are only PacBio files to be read.
--ILLUMINA Specify that there are only Illumina files to be read.
--LARGE Specify that this is a large dataset. Splitting of visualizations will be done.
--REFERENCE Specify Illumina sample name (not file), to compare others to for dot-line plots. Can be used in tandem with --LARGE.
--RF_FILTER Specify relative frequency filter. Default is 0.2.
--COUNT_FILTER Specify count filter. Default is 5.
--ILLUMINA_FILTER Specify whether PacBio reads should be filtered to only include files supported by Illumina reads that reach the cutoff.
-resume nextflow will pick up where it left off if the previous command was interrupted for some reason.
-with-docker ubuntu:18.04 Runs command with Ubuntu docker.
-with-trace Outputs a trace.txt that shows which processes end up in which work/ folders.

Common Errors

  1. incomplete final line found by readTableHeader on '/Users/uwvirongs/Documents/tprk/metadata.csv' Make sure your metadata file has a new line at the end. You can do this by simply pressing enter on the last line of your file and saving.