Skip to content

This is a project to contribute utilities for ViruSurf, a viral db centered in viral sequences.

License

Notifications You must be signed in to change notification settings

damianosmel/VirusGenoUtil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VirusGenoUtil

This is a project to contribute utilities for a viral genomic data explorer.

Dependencies

  • Python 3.7.6
  • BioPython 1.74
  • Pandas 1.0.1
  • NumPy 1.19.1
  • Requests 2.24.0

Tested functionalities

  1. Convert nucleic-acid sequence ORF multiple sequence alignemnts, provived by virulign aligner, saved as multi-fasta alignments (MFA) to variants per target sequence

  2. Convert amino-acid sequence ORF multiple sequence alignemnts, provived by virulign aligner, saved as multi-fasta alignments (MFA) to variants per target sequence

  3. Identify homology based immunodominant regions for structural proteins of Sars-Cov-2 following "Grifoni et al."

  4. Extract IEDB epitopes from IEDB data portal

Convert virulign MFA to variants csv - How to use

a. install virulign (on a server if you need to run for many sequences)

b. depending on the sequence type:

  • for amino-acid sequence: check the example virulign_msa_aa_run.sh used for running virulign for all ncbi sars cov2 as of May 15,
  • change this bash appropriately and run it for your own paths

c. put all output MFA fasta (one per ORF) to an alignments folder

d. go to main.py and change appropriately for the input and output paths,

  • input should be the path of such alignment folder
  • please change the refernce ncbi id
  • create a file that contains your email (will be needed for programmatically accessing the Entrez service and fetch the genbank file of the reference sequence)
  • output will be a folder containing a variant csv file for all ORFs of one target sequence, for example see output

Identify immunodominant regions using homology - How to use

a. download the spike, membrane and nucleoprotein proteins of SARS-CoV(NC_004718.3):

  • create input root: mkdir data/exp_epitopes_input
  • create SARS-CoV proteins folder: mkdir data/exp_epitopes_input/sars_cov1_proteins
  • save each protein on a separate protein_id fasta file

b. download the same proteins for SARS-CoV2(NC_045512.2):

  • create SARS-CoV2 proteins folder: mkdir data/exp_epitopes_input/sars_cov2_proteins
  • save all in sars_cov2_proteins.fasta

c. prepare BLAST db for SARS-CoV2 structural proteins:

  • create blast folder: mkdir data/exp_epitopes_input/blast
  • concatenate all SARS-CoV2 proteins in one fasta file: cat data/exp_epitopes_input/sars_cov2_proteins/*.fasta > data/exp_epitopes_input/blast/sars_cov2_proteins.fasta
  • build BLAST db: makeblastdb -in sars_cov2_proteins.fasta -out sars_cov2_SMN -dbtype prot -title "Sars Cov2 Spike Membrane Nucleoprotein" -parse_seqids

d. prepare response frequency data:

  • get response frequency assays for B-cells and T-cells from "IEDB"
  • save to data/exp_epitopes_input/Bcells and data/exp_epitopes_inpyt/Tcells respectively

e. go to exp_epitopes_run.py

  • prepare and run the Immunodominance class
  • in the specified output folder, the immunodominant regions for SARS-CoV will be saved and optionally the sliding average RF plots per protein

f. BLASTP these regions to SARS-CoV2 proteins

g. go to exp_epitopes_run.py

  • prepare and run the HomologyBasedEpitopes class
  • in the specified output folder, the homology based identified epitopes tsv will be saved

Extract epitopes from prediction tools - How to use

Bepipred 2.0

  1. concatenate structural proteins into one fasta file (time-efficient way to use bepipred predictor): cd sars_cov2_data/pred_epitopes_input/sars_cov2_proteins cat *.fasta > sars_cov2_struct_prot.fasta

  2. run Bepipred 2.0 for the SARS-CoV-2 structural proteins: cd sars_cov2/sars_cov2_data/pred_epitopes_input BepiPred-2.0 -t 0.55 sars_cov2_proteins/sars_cov2_struct_prot.fasta > bepipred_out/bepipred_sars_cov2.tsv

  3. set parameters on pred_epitopes_run.py and run section for BepipredEpitopes:

  • the output tsv file will be placed on your specified output folder:

    ls -lh sars_cov2/pred_epitopes/bepipred/bepipred_sars_cov2_epi.tsv

Extract epitopes from IEDB

IEDB data

  • go to iedb database export "page"
  • download the B and T cells and the mhc_ligan_full(single file) csv.zip files.
  • decompress each of them and then compress it again using gzip.
  • finally, place the three gzip files in iedb_data folder.

Viruses data

  • create a folder containing the viruses of interest. Save each virus as a separate subfolder and name as taxon_ncbi_id, where ncbi id is the taxonomic ncbi id of this virus. For example for Sars-CoV-2, the subfolder should be named taxon_2697049.
  • in each subfolder add each viral protein as a separate .fa file. Then name the protein file using the uniprot id of this protein. For example, for Spike protein name it as P59594.fa. For more information on how to create a viral folder, consult an example download log file.
  • finally, you have created a viruses_proteins folder as the example one.

ONTIE downloads

  • re-use the already downloaded ONTIE .ttl files from ONTIE "page", through already executions of the code.

Email for Entrez services

  • please add your email at the email.txt. It will be needed to run Entrez services through bio-python.

Extracting IEDB epitopes

a. extract epitopes, save them in csv files

  • set up the paths for your machine in iedb_epitopes_run.py
  • run iedb_epitopes_run.py
  • in specified output folder two csv files will be created. The imported_iedb_epitopes.tsv is the realization of the Epitope table and the imported_epitope_fragments.tsv is the realization of EpitopeFragment table in ViruSurf db.

b. extract epitopes, access them through a list of tuples

About

This is a project to contribute utilities for ViruSurf, a viral db centered in viral sequences.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published