SARS2seq

SARS2seq is a pipeline designed to process raw FastQ data from targeted SARS2-CoV-2 sequencing and generate biologically correct consensus sequences of the SARS-CoV-2 genome.

SARS2seq performs high speed data quality control and data cleanup, and high accuracy removal of primer-sequences from NGS reads. As well as alignment of reads and generation of a consensus sequence using a custom consensus caller which accounts for sequencing errors and alignment artefacts.

SARS2seq is able to run both on a standalone (linux) computer, as well as High-Performance Computing (HPC) infrastructures.

Outputs

SARS2seq outputs the following results:

An easy to use QC overview, generated by MultiQC.
Consensus sequences at various coverage thresholds.
Amino-acid sequences of every open reading frame at various coverage thresholds.
Tables with mutations relative to the reference genome (MN908947.3) at various coverage thresholds.
A width-of-coverage table showing the percentage of the reference genome (MN908947.3) which is covered at various depth-of-coverage thresholds.
A table showing the average depth of coverage per amplicon of each sample
An easy to use table with Pangolin and NextClade typing results for every sample at an optimal coverage threshold.

SARS2seq uses the following depth-of-coverage threshold by default: 1x, 5x, 10x, 50x, and 100x

Additional files for further analysis can be found in the data/ once processing is finished. These files include:

Alignment files
VCF-files
Per-position coverage files
GFF-files that have been corrected to match the created consensus genome
Found (and removed) coordinates of primer-sequences

Installation instructions

Before you install SARS2seq, make sure Conda is installed on your system and functioning properly!

Download the latest version of SARS2seq by "cloning" this repository and navigating to the newly created directory.
Copy and paste the code-snippet below in order to do so.
```
git clone https://github.com/RIVM-bioinformatics/SARS2seq.git && cd SARS2seq/
```
Create the required conda-environment and install the necessary dependencies.
Copy and paste the code-snippet below in order to create the new conda-environment and directly activate it.
```
conda create --name SARS2seq -c conda-forge mamba python=3.7 && conda activate SARS2seq && mamba env update -f mamba-env.yaml
```
You can also use the following snippet if the code-snippet above didn't work for you:
```
conda env create -f env.yml && conda activate SARS2seq
```
The "SARS2seq" environment should now be active
Now actually install SARS2seq to your system with the following command:
```
pip install .
```
If that command didn't work for you then use the following:
```
python setup.py install
```

SARS2seq is now installed! You can start SARS2seq from anywhere on your system as long as the SARS2seq conda-environment is active.
You can also use SARS2seq in a different conda-environment as long as the software dependencies match.

Preparing input primer file

In order to get optimal results, make sure the fasta headers in your fasta file with primers is formatted properly. Please make sure the fasta headers for your primers are formatted in the following format:

>{primer-name}_{primer-number}_{orientation}

Orientation keywords for forward primers are: LEFT/PLUS/POSITIVE/FORWARD
Orientation keywords for reverse primers are: RIGHT/MINUS/NEGATIVE/REVERSE

Below is an example of formatted primer names from the ArticV3 sequencing protocol:

>nCoV-2019_1_LEFT
ACCAACCAACTTTCGATCTCTTGT
>nCoV-2019_1_RIGHT
CATCTTTAAGATGTTGACGTGCCTC
>nCoV-2019_2_LEFT
CTGTTTTACAGGTTCGCGACGT
>nCoV-2019_2_RIGHT
TAAGGATCAGTGCCAAGCTCGT

If your protocol has alternative primers then make sure the fasta header contains the "alt" keyword in the following format:

>{primer-name}_{primer-number}_alt_{orientation}

Make sure the "alt" keyword is in the middle and not at the end of the fasta header.

Below is an example of formatted primer names from the ArticV3 sequencing protocol with alternative primers included:

>nCoV-2019_13_LEFT
TCGCACAAATGTCTACTTAGCTGT
>nCoV-2019_13_RIGHT
ACCACAGCAGTTAAAACACCCT
>nCoV-2019_14_LEFT
CATCCAGATTCTGCCACTCTTGT
>nCoV-2019_14_alt_LEFT
TGGCAATCTTCATCCAGATTCTGC
>nCoV-2019_14_RIGHT
AGTTTCCACACAGACAGGCATT
>nCoV-2019_14_alt_RIGHT
TGCGTGTTTCTTCTGCATGTGC
>nCoV-2019_15_LEFT
ACAGTGCTTAAAAAGTGTAAAAGTGCC
>nCoV-2019_15_alt_LEFT
AGTGCTTAAAAAGTGTAAAAGTGCCT
>nCoV-2019_15_RIGHT
AACAGAAACTGTAGCTGGCACT
>nCoV-2019_15_alt_RIGHT
ACTGTAGCTGGCACTTTGAGAGA

Formatting your input primer file as above will yield the best results for your analysis.

Running an analysis

Please see the command line help for a quick explanation of every possible argument: sars2seq -h

You can start an analysis with a command such as the following:

sars2seq \
    --input {path/to/FastQ-files} \
    --output {path/to/desired-output} \
    --primers {path/to/primers.fasta/NONE} \
    --platform {nanopore/illumina/iontorrent} \
    --amplicon-type {end-to-end/end-to-mid} \
    --threads {threads}

If your protocol does not use primers then set "NONE" for the primers flag as: --primers NONE

The --amplicon-type flag is used to clarify the length of of the sequenced read in regards to the amplicon.
"end-to-end" means that the sequenced read covers the full length of the amplicon, meaning that the primer-sequence is present at both ends of a read. "end-to-mid" means that the sequenced read partially covers the length of the amplicon, meaning that the primer-sequence is present at only one end of a read.

Please check your sequencing and laboratory setup to ensure the best results.

Authors

Florian Zwagemaker
Dennis Schmitz
Karim Hajji
Annelies Kroneman

Acknowledgements

Harry Vennema
Dirk Eggink
Jeroen Cremer
Jeroen Laros
Robert Verhagen
Erwin van Wieringen

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.github/workflows		.github/workflows
SARS2seq		SARS2seq
docs		docs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
docs-requirements.txt		docs-requirements.txt
env.yml		env.yml
mamba-env.yaml		mamba-env.yaml
mkdocs.yml		mkdocs.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

SARS2seq

SARS2seq

docs

docs

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

docs-requirements.txt

docs-requirements.txt

env.yml

env.yml

mamba-env.yaml

mamba-env.yaml

mkdocs.yml

mkdocs.yml

setup.py

setup.py

Repository files navigation

SARS2seq

Outputs

Installation instructions

Preparing input primer file

Running an analysis

Authors

Acknowledgements

About

Releases 18

Packages

Contributors 4

Languages

License

RIVM-bioinformatics/SARS2seq

Folders and files

Latest commit

History

Repository files navigation

SARS2seq

Outputs

Installation instructions

Preparing input primer file

Running an analysis

Authors

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages