Mutfunc: SARS-CoV-2 (Pipeline)

This repository contains the data generation pipeline for Mutfunc: SARS-CoV-2, which is a resource containing variant effect predictions and annotations for all possible SARS-CoV-2 amino acid substitutions. The source code for the web interface is in a separate repository. The dataset and methods are described in detail in the Mutfunc: SARS-CoV-2 preprint.

Citation

Alistair Dunham, Gwendolyn M Jang, Monita Muralidharan, Danielle Swaney & Pedro Beltrao. 2021. A missense variant effect prediction and annotation resource for SARS-CoV-2 (bioRxiv)

Installation

Clone the repo
Install required dependancies (see below).
Download any required additional data
Run snakemake setup_directories to initialise any missing directories

Dependancies

The tools and python modules are required to run the data generation pipeline and R packages are used for the data analysis scripts (analysis/). Analysis can be run without running the data generation pipeline by downloading the Mutfunc: SARS-CoV-2 dataset and several additional datasets. I used Python 3.8.2 and R 3.6.3, but any version supporting the required packages should work.

Tools

SIFT4G (I used a slightly modified version that outputs scores to 5 decimal places instead of 2)
FoldX 5
Naccess
Singularity (to run the VEP container)
Ensembl VEP
MMseqs2

Python

Snakemake
Numpy
Pandas
Biopython
ruamel.yaml

R

tidyverse
broom
ggpubr
ggtext
ggrepel
plotlistr

Additional Datasets

Various additional data file are required for parts of the pipeline and analysis:

An aligned SAR-CoV-2 variant VCF file, placed in a location defined in snakemake.yaml. This is used to calculate variant frequencies. I use a version of the VCF used by sarscov2phylo pruned to only include public sequences.
supplementary data file S2 from Bouhaddou et al. (2020), saved as data/ptms/SuppTable_annotated_viral_phosphosites_revised.tsv. This is used to source PTM data.
Table S1 from Greaney et al. (2021), saved as data/greaney_spike_antibody.csv. This is used for antibody escape data.
media-3.csv from Starr et al. (2020), saved as data/starr_ace2_spike.csv. This is used to compare predictions to the Spike DMS study in the analysis and is not used in the pipeline.
EVCouplings predictions, saved in data/evcouplings. This is only used for analysis/evcouplings.R

Running the Pipeline

The pipeline manages generation of the dataset, including downloading most source data from online repositories. It is managed by Snakemake, with a master Snakefile and additional rulesets for each section in the pipeline/ directory. Scripts for various sections of the pipeline are found in bin/ and modules in src/. The configuration file (snakemake.yaml) specifies various parameters to run the pipeline, including paths to various local files and a flag telling the pipeline whether to look for online updates for data files. The pipeline can be run using the snakemake command, but running the complete pipeline really requires access to a computer cluster and using the required snakemake cluster configuration for you environment. Running on a single machine, even a very powerful one, would take a restrictive amount of time (e.g. multiple days).

Running Analysis

The analysis R scripts found in analysis/ generate figures summarising the data. They are not automatically run by the data generation pipeline and must be manually executed to generate figures. Most of them can be run without running the data generation pipeline if the Mutfunc: SARS-CoV-2 dataset is downloaded and places in data/output and the specified additional datasets are downloaded.

Name		Name	Last commit message	Last commit date
Latest commit History 476 Commits
analysis		analysis
bin		bin
data		data
docs		docs
figures/figures		figures/figures
pipeline		pipeline
src		src
.gitignore		.gitignore
.rsync_exclude		.rsync_exclude
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
cluster.yaml		cluster.yaml
snakemake.yaml.example		snakemake.yaml.example

License

allydunham/mutfunc_sars_cov_2

Folders and files

Latest commit

History

Repository files navigation

Mutfunc: SARS-CoV-2 (Pipeline)

Citation

Installation

Dependancies

Tools

Python

R

Additional Datasets

Running the Pipeline

Running Analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Languages