Skip to content

RIVM-bioinformatics/SARS2seq

Repository files navigation

SARS2seq

SARS2seq is a pipeline designed to process raw FastQ data from targeted SARS2-CoV-2 sequencing and generate biologically correct consensus sequences of the SARS-CoV-2 genome.

SARS2seq performs high speed data quality control and data cleanup, and high accuracy removal of primer-sequences from NGS reads. As well as alignment of reads and generation of a consensus sequence using a custom consensus caller which accounts for sequencing errors and alignment artefacts.

SARS2seq is able to run both on a standalone (linux) computer, as well as High-Performance Computing (HPC) infrastructures.


Outputs

SARS2seq outputs the following results:

  • An easy to use QC overview, generated by MultiQC.
  • Consensus sequences at various coverage thresholds.
  • Amino-acid sequences of every open reading frame at various coverage thresholds.
  • Tables with mutations relative to the reference genome (MN908947.3) at various coverage thresholds.
  • A width-of-coverage table showing the percentage of the reference genome (MN908947.3) which is covered at various depth-of-coverage thresholds.
  • A table showing the average depth of coverage per amplicon of each sample
  • An easy to use table with Pangolin and NextClade typing results for every sample at an optimal coverage threshold.

SARS2seq uses the following depth-of-coverage threshold by default: 1x, 5x, 10x, 50x, and 100x

Additional files for further analysis can be found in the data/ once processing is finished. These files include:

  • Alignment files
  • VCF-files
  • Per-position coverage files
  • GFF-files that have been corrected to match the created consensus genome
  • Found (and removed) coordinates of primer-sequences

Installation instructions

Before you install SARS2seq, make sure Conda is installed on your system and functioning properly!

  1. Download the latest version of SARS2seq by "cloning" this repository and navigating to the newly created directory.
    Copy and paste the code-snippet below in order to do so.

    git clone https://github.com/RIVM-bioinformatics/SARS2seq.git && cd SARS2seq/
    
  2. Create the required conda-environment and install the necessary dependencies.
    Copy and paste the code-snippet below in order to create the new conda-environment and directly activate it.

    conda create --name SARS2seq -c conda-forge mamba python=3.7 && conda activate SARS2seq && mamba env update -f mamba-env.yaml
    

    You can also use the following snippet if the code-snippet above didn't work for you:

    conda env create -f env.yml && conda activate SARS2seq
    

    The "SARS2seq" environment should now be active

  3. Now actually install SARS2seq to your system with the following command:

    pip install .
    

    If that command didn't work for you then use the following:

    python setup.py install
    

SARS2seq is now installed! You can start SARS2seq from anywhere on your system as long as the SARS2seq conda-environment is active.
You can also use SARS2seq in a different conda-environment as long as the software dependencies match.


Preparing input primer file

In order to get optimal results, make sure the fasta headers in your fasta file with primers is formatted properly. Please make sure the fasta headers for your primers are formatted in the following format:

>{primer-name}_{primer-number}_{orientation}

Orientation keywords for forward primers are: LEFT/PLUS/POSITIVE/FORWARD
Orientation keywords for reverse primers are: RIGHT/MINUS/NEGATIVE/REVERSE

Below is an example of formatted primer names from the ArticV3 sequencing protocol:

>nCoV-2019_1_LEFT
ACCAACCAACTTTCGATCTCTTGT
>nCoV-2019_1_RIGHT
CATCTTTAAGATGTTGACGTGCCTC
>nCoV-2019_2_LEFT
CTGTTTTACAGGTTCGCGACGT
>nCoV-2019_2_RIGHT
TAAGGATCAGTGCCAAGCTCGT

If your protocol has alternative primers then make sure the fasta header contains the "alt" keyword in the following format:

>{primer-name}_{primer-number}_alt_{orientation}

Make sure the "alt" keyword is in the middle and not at the end of the fasta header.

Below is an example of formatted primer names from the ArticV3 sequencing protocol with alternative primers included:

>nCoV-2019_13_LEFT
TCGCACAAATGTCTACTTAGCTGT
>nCoV-2019_13_RIGHT
ACCACAGCAGTTAAAACACCCT
>nCoV-2019_14_LEFT
CATCCAGATTCTGCCACTCTTGT
>nCoV-2019_14_alt_LEFT
TGGCAATCTTCATCCAGATTCTGC
>nCoV-2019_14_RIGHT
AGTTTCCACACAGACAGGCATT
>nCoV-2019_14_alt_RIGHT
TGCGTGTTTCTTCTGCATGTGC
>nCoV-2019_15_LEFT
ACAGTGCTTAAAAAGTGTAAAAGTGCC
>nCoV-2019_15_alt_LEFT
AGTGCTTAAAAAGTGTAAAAGTGCCT
>nCoV-2019_15_RIGHT
AACAGAAACTGTAGCTGGCACT
>nCoV-2019_15_alt_RIGHT
ACTGTAGCTGGCACTTTGAGAGA

Formatting your input primer file as above will yield the best results for your analysis.


Running an analysis

Please see the command line help for a quick explanation of every possible argument: sars2seq -h

You can start an analysis with a command such as the following:

sars2seq \
    --input {path/to/FastQ-files} \
    --output {path/to/desired-output} \
    --primers {path/to/primers.fasta/NONE} \
    --platform {nanopore/illumina/iontorrent} \
    --amplicon-type {end-to-end/end-to-mid} \
    --threads {threads}

If your protocol does not use primers then set "NONE" for the primers flag as: --primers NONE

The --amplicon-type flag is used to clarify the length of of the sequenced read in regards to the amplicon.
"end-to-end" means that the sequenced read covers the full length of the amplicon, meaning that the primer-sequence is present at both ends of a read. "end-to-mid" means that the sequenced read partially covers the length of the amplicon, meaning that the primer-sequence is present at only one end of a read.

Please check your sequencing and laboratory setup to ensure the best results.


Authors

  • Florian Zwagemaker
  • Dennis Schmitz
  • Karim Hajji
  • Annelies Kroneman

Acknowledgements

  • Harry Vennema
  • Dirk Eggink
  • Jeroen Cremer
  • Jeroen Laros
  • Robert Verhagen
  • Erwin van Wieringen