Wastewater analysis

Introduction

ww_minimal is a bioinformatics analysis pipeline used to perform the initial quality control and variant analysis on wastewater sequencing samples. This pipeline supports Illumina short-reads prepared using the Nimagen primer scheme on various platforms (NovaSeq, NextSeq, MiSeq).

Pipeline summary

Merge sequencing FASTQ files (pigz)
Adapter trimming (fastp)
Variant calling
1. Read alignment (bwa mem)
2. Sort and index alignments (Samtools)
3. Primer sequence removal (BAMClipper)
4. Genome-wide and amplicon coverage (mosdepth, Samtools ampliconstats)
5. Variant calling (freyja variants/demix; samples may fail on this step due to low coverage, these are omitted from further analysis in the pipeline, they are not excluded overall)
6. Extract WHO and pango lineages (collate_results.py, collate_lineages.py)
7. Aggregate all sample outputs (xsv)

Quickstart

This pipeline uses conda for environment and package management (recommended to use miniconda).

Initialise environment

With [mini]conda installed:

git clone https://github.com/LooseLab/ww_nf_minimal
cd ww_nf_minimal
conda env create -f environment.yml

Run test profile

conda activate ww_minimal
nextflow run main.nf -profile test

Running an actual run

After successfully running the test subset you can attempt to run on other samples. Read the input section for how to setup the FASTQ directory and sample sheet. Once these are done the pipeline can be run like so:

nextflow run /path/to/main.nf --readsdir <FASTQ INPUT DIRECTORY> --sample_sheet <SAMPLE SHEET CSV> -with-report report.html

If nextflow crashes while running, you can add the flag -resume to the previous command to check for cached jobs so the entire pipeline does not need to be re-run.

Input and Output

Input

There are two required user supplied inputs, the sample sheet and the FASTQ reads directory. These can be supplied by editing the nextflow.config file adding the sample_sheet and readsdir attributes to the params or supplied on the command line using --sample_sheet and --readsdir. In addition there are three static inputs that are provided with the workflow (these may change in the future as the primer scheme changes). These are the reference genome, paired-end primer file, and amplicon primer file.

FASTQ

This pipeline expects FASTQ files to be structured inside an input directory with subfolders for each sequencing lab and then further subfolders for each run ID. For most labs the share directory can be used directly, however samples from Exeter require symlinking. An example input directory structure can be seen below:

input
├── <LAB1>
│  ├── <RUN1>
│  │  ├── SAMPLE_R1_L002_001.fastq.gz
│  │  └── SAMPLE_R2_L002_001.fastq.gz
│  └── <RUN2>
│     ├── SAMPLE_R1_L002_001.fastq.gz
│     └── SAMPLE_R2_L002_001.fastq.gz
└── <LAB2>
   └── <RUN1>
      ├── SAMPLE_L001_R1_001.fastq.gz
      ├── SAMPLE_L001_R2_001.fastq.gz
      ├── ...
      ├── SAMPLE_L004_R1_001.fastq.gz
      └── SAMPLE_L004_R2_001.fastq.gz

Sample sheet CSV

The CSV sample sheet is required as this informs the pipeline which samples should be analysed. It currently requires 6 fields:

sample_id
sample_site_code
timestamp_sample_collected
sequencing_lab_code
sequencing_sample_id
sequencing_run_id

These are used to find the input FASTQ files in the readsdir. All fields are passed through to the aggregation steps at the end of the pipeline.

Output

Output files are written, by default to a results directory where the pipeline is called from. This folder is organised for each step that emits files and results like so:

results
├── aggregated.csv
├── all_lineages.csv
├── <LAB1>
│  ├── <RUN1>
│  │  ├── alignments
│  │  ├── ampliconstats
│  │  ├── bamclipper
│  │  ├── freyja
│  │  ├── mosdepth
│  │  ├── stats_csv
│  │  └── trimmed
│  └── <RUN2>
│     ├── alignments
│     ├── ampliconstats
│     ├── bamclipper
│     ├── freyja
│     ├── mosdepth
│     ├── stats_csv
│     └── trimmed
└── <LAB2>
   └── <RUN1>
      ├── alignments
      ├── ampliconstats
      ├── bamclipper
      ├── freyja
      ├── mosdepth
      ├── stats_csv
      └── trimmed

Outputs that are organised in directories under a <RUN ID> are the raw outputs from the steps in pipeline summary. The aggregated outputs are placed at the top level as these combine data from all of the sequencing labs and runs.

`aggregated.csv`

This CSV file aggregates the WHO lineages, their frequencies, and sequencing depths for all the samples that are able to complete analysis. As multiple lineages maybe present multiple rows can be returned for a single sample.

Column	Description
amplicon_mean	Mean coverage over all amplicons including zeros
non_zero_amplicon_mean	Mean coverage over amplicons excluding zeros
amplicon_median	Median coverage over all amplicons including zeros
non_zero_amplicon_median	Median coverage over amplicons excluding zeros
count_gte_20	Count of amplicons with at least (≥) 20× coverage
count_lt_20	Count of amplicons with less than (<) 20× coverage
stdev	Standard deviation of coverage over all amplicons
non_zero_stdev	Standard deviation of coverage over all amplicons excluding zeros
lineage	WHO lineage assigned by Freyja
abundance	Abundance of this WHO lineage
mean_genome_coverage	Mean coverage over whole genome from mosdepth
sample_id	Sample ID used in the pipeline
sample_site_code	Sample site location code
timestamp_sample_collected	Timestamp sample collected
sequencing_lab_code	Sequencing lab
original_sample_id	Original metadata sample id
sequencing_sample_id	Sample ID used in the pipeline
sequencing_run_id	Run ID for the sample
amplicon_001_mean_depth	Coverage over this individual amplicon
...	...
amplicon_154_mean_depth	repeated for all amplicons

`all_lineages.csv`

This CSV file aggregates Pango lineages that are assigned by Freyja. It is a more fine-grained breakdown of the sample composition than the WHO lineages.

Column	Description
lineage	Pango lineage assigned by Freyja
abundance	Abundance of this lineage
sample_id	Sample ID used in the pipeline
sample_site_code	Sample site location code
timestamp_sample_collected	Timestamp sample collected
sequencing_lab_code	Sequencing lab
original_sample_id	Original metadata sample id
sequencing_sample_id	Sample ID used in the pipeline
sequencing_run_id	Run ID for the sample

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
scripts		scripts
static		static
test		test
README.md		README.md
environment.yaml		environment.yaml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

scripts

scripts

static

static

test

test

README.md

README.md

environment.yaml

environment.yaml

main.nf

main.nf

nextflow.config

nextflow.config

Repository files navigation

Wastewater analysis

Introduction

Pipeline summary

Quickstart

Initialise environment

Run test profile

Running an actual run

Input and Output

Input

FASTQ

Sample sheet CSV

Output

`aggregated.csv`

`all_lineages.csv`

About

Releases

Packages

Languages

LooseLab/ww_nf_minimal

Folders and files

Latest commit

History

Repository files navigation

Wastewater analysis

Introduction

Pipeline summary

Quickstart

Initialise environment

Run test profile

Running an actual run

Input and Output

Input

FASTQ

Sample sheet CSV

Output

aggregated.csv

all_lineages.csv

About

Resources

Stars

Watchers

Forks

Languages

`aggregated.csv`

`all_lineages.csv`