SARS-CoV-2-HPDA-evolutionary-analysis

@Author: Arnaud N'Guessan

Overview

This repository contains a script for analyzing SARS-CoV-2 evolution in epitopes during the first two waves of the COVID-19 pandemic. The immunological data come from a high protein density array analysis of SARS-CoV-2 epitopes in 15 patients (N'Guessan A. et al., 2022). This script and the related data can be updated manually to integrate data from other waves or other sets of epitopes.

Dependencies

R (version 3.5.2+) packages: "ggplot2", "seqinr", "grid", "RColorBrewer", "randomcoloR", "gplots", "lmPerm", "ggpubr", "gridExtra", "RColorBrewer", "indicspecies", "tidyr", "Cairo", "parallel", "foreach", "doParallel", "infotheo", "VennDiagram", "Biostrings", "session"

The script

a) Inputs:

-->OUTPUT_WORKSPACE: The absolute path of the "SARS-CoV-2-HPDA-evolutionary-analysis/" repertory in your system. Make sure that it contains a sub-directory named "depth_report_NCBI_SRA_amplicon/" which contains all the samples depth coverage analysis files (a .csv file generated by "samtools depth" or a csv file with 3 columns/fields corresponding to the sample, the position of the site in the reference genome MN908947_3 and the site depth respectively). We added an example of such depth report file in "SARS-CoV-2-HPDA-evolutionary-analysis/depth_report_NCBI_SRA_amplicon/" so that you can visualize what it should look like (each sample needs to have its own depth report file). Next, make sure that the "SARS-CoV-2-HPDA-evolutionary-analysis/" repertory should also contain the script (high_confidence_epitopes_analysis.r) and the related data (Epitopes_mapped.csv, MN908947_3.fasta, Table_signature_mutations.csv, df_high_confidence_epitope_metrics.rds, df_sars_cov_2_epitopes.rds, df_variants_SRA_amplicon_first_wave.rds and df_variants_SRA_amplicon_second_wave.rds)

-->NB_CPUS: the number of cpus to use for analyzes that are performed through parallel programming (R doParallel)

b) Outputs: Various plots showing the evolutionary profile of SARS-CoV-2 epitopes during waves 1 and 2 + comparisons between lineages / variants.

c)Running the script For running the script from a terminal (command line), you must have R (version 3.5.2+) installed or loaded (slurm module) and you must run the command: Rscript high_confidence_epitopes_analysis.r $OUTPUT_WORKSPACE $NB_CPUS

d) Updating lineage signature mutation data To update or edit the lineage signature mutation data, you can open the file "Table_signature_mutations.csv" in Excel and add the new signature mutation + its lineage as a new entry in the table. Only these 2 columns are mandatory. You can set the other fields/columns as "NA" or leave them empty. Don't forget to save the table as a .csv file. You can also make the edits in your favorite text editor (newline for a signature mutation X of lineage Z sequences would be: X,Z,NA,NA,NA OR X,Z,"","","" OR X,Z,,,). The signature mutation name needs to be in the format ORF_name:Old_amino_acidResidue_position_in_ORF_protein_sequenceNew_amino_acid (e.g. ORF8:L84S).

References

We defined signature mutations of each variant (see "Table_signature_mutations.csv") as substitutions that are present in >=90% of sequences assigned to that lineage. We calculated the prevalence of substitutions in thousands of publicly available consensus sequences collected from NCBI during 2020 and added data from CoV-Spectrum about under-represented lineage in the database or lineages that emerged during 2021 (Chen et al., 2021). The signature mutation dataset is a mix of mutation prevalence data from our own NCBI consensus seqeunces database (for the earlier lineage) and GISAID data obtained from cov-spectrum (for more recent lineages like Omicron). Thus, multiple PANGO versions are involved (v.2.1.7 for the earliest 2020 lineages and v.3.1.20 for recent variants like Omicron). The signature mutation prevalence dataset is presented here as a json file named "Database_Missense_and_Nonsense_signature_mutations_prevalence_in_SC2_lineages_consensus_sequences_as_of_2021_01_16_plus_VOCs.json".

Chen, C., Nadeau, S., Yared, M., Voinov, P., Ning, X., Roemer, C. & Stadler, T. "CoV-Spectrum: Analysis of globally shared SARS-CoV-2 data to Identify and Characterize New Variants" Bioinformatics (2021); doi: 10.1093/bioinformatics/btab856.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
depth_report_NCBI_SRA_amplicon		depth_report_NCBI_SRA_amplicon
Database_Missense_and_Nonsense_signature_mutations_prevalence_in_SC2_lineages_consensus_sequences_as_of_2021_01_16_plus_VOCs.json		Database_Missense_and_Nonsense_signature_mutations_prevalence_in_SC2_lineages_consensus_sequences_as_of_2021_01_16_plus_VOCs.json
Epitopes_mapped.csv		Epitopes_mapped.csv
LICENSE		LICENSE
MN908947_3.fasta		MN908947_3.fasta
MN908947_3.fasta.amb		MN908947_3.fasta.amb
MN908947_3.fasta.ann		MN908947_3.fasta.ann
MN908947_3.fasta.bwt		MN908947_3.fasta.bwt
MN908947_3.fasta.fai		MN908947_3.fasta.fai
MN908947_3.fasta.nhr		MN908947_3.fasta.nhr
MN908947_3.fasta.nin		MN908947_3.fasta.nin
MN908947_3.fasta.nsq		MN908947_3.fasta.nsq
MN908947_3.fasta.pac		MN908947_3.fasta.pac
MN908947_3.fasta.sa		MN908947_3.fasta.sa
README.md		README.md
Table_signature_mutations.csv		Table_signature_mutations.csv
df_high_confidence_epitope_metrics.rds		df_high_confidence_epitope_metrics.rds
df_sars_cov_2_epitopes.rds		df_sars_cov_2_epitopes.rds
df_variants_SRA_amplicon_first_wave.rds		df_variants_SRA_amplicon_first_wave.rds
df_variants_SRA_amplicon_second_wave.rds		df_variants_SRA_amplicon_second_wave.rds
high_confidence_epitopes_analysis.r		high_confidence_epitopes_analysis.r

License

arnaud00013/SARS-CoV-2-HPDA-evolutionary-analysis

Folders and files

Latest commit

History

Repository files navigation

SARS-CoV-2-HPDA-evolutionary-analysis

Overview

Dependencies

The script

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages