Protein_Normalized_SW_Similarity_Score

An R script that calculates a similarity matrix for a list of protein sequences with the aid of Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score.

Installation

Make sure to have installed the following packages : Biostrings, and here.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

install.packages("here")

More Details

In the area of Drug-Target Interaction (DTI) prediction and Computer-Aided Drug Design (CADD) , most of the times, it is necessary to express the proteins and the drugs as a series of features and attributes that help us train our prediction model more efficiently. Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score is one of the ways to rate and cluster a pair of proteins based on the similarity of their sequences. The classic Smith-Waterman Similarity Score is higher for larger sequences and lower for small sequences. The normalization process, helps to neutralize the influence of the length of the pairs of the sequence on the similarity score, hence the name normalized similarity score.

Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score was first described in this paper. and was subsequently used by other researchers in this field(see this paper.).

The settings for the pairwise alignment are similar to the default paramters of EMBOSS Water PSA program hosted at European Bioinformatic Institute(EBI) website :

Substitution Matrix = BLOSUM62, Gap open = 10, Gap extend = 0.5

This particular formula taken from Bleakley et al (2009), is the heart of this script :

Further Readings :

Kevin Bleakley, Yoshihiro Yamanishi, Supervised prediction of drug–target interactions using bipartite local models, Bioinformatics, Volume 25, Issue 18, 15 September 2009, Pages 2397–2403, https://doi.org/10.1093/bioinformatics/btp433

Seal, A., Ahn, YY. & Wild, D.J. Optimizing drug–target interaction prediction based on random walk on heterogeneous networks. J Cheminform 7, 40 (2015). https://doi.org/10.1186/s13321-015-0089-z

Maryam Bagherian, Elyas Sabeti, Kai Wang, Maureen A Sartor, Zaneta Nikolovska-Coleska, Kayvan Najarian, Machine learning approaches and databases for prediction of drug–target interaction: a survey paper, Briefings in Bioinformatics, Volume 22, Issue 1, January 2021, Pages 247–269, https://doi.org/10.1093/bib/bbz157

The script takes about a minute or so provided that the number of sequences are below 50. For fasta files which contain 1000 unique sequences and more, it takes around 24 hours for the tasks to complete!
For larger sequences, one might use parallelization. I have to use parallel package of R, if I want to implement parallelization in the future.
I have provided five datasets, containing the sequences of various proteins of human proteome. All of these five files are taken from Uniprotkb. "16_Selected_Proteins.txt" and "38_Selected_Proteins.txt" can be used as demo datasets to test the script. The other three datasets, contain much more sequences and it takes more than a day for my system to execute them. Parallelization might be helpful to reduce the needed time.
I first calculated the similarity matrix, then used (res)^(-1) as a measure of dissimilarity. There must exist a better way to calculate dissimilarity scores from similarity scores...

NOTE: As always, I have put the Expected_Results folder which contains the files that the R script generates once you run it.

NOTE: ~~Make sure to edit setwd command in the R script to point at your desired directory.~~ In this particular repository, it is not necessary because of the here package. However you can consider to do it manually in case the script somehow does not work for you.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Expected_Results		Expected_Results
16_Selected_Proteins.txt		16_Selected_Proteins.txt
38_Selected_Proteins.txt		38_Selected_Proteins.txt
LICENSE		LICENSE
Protein_Similarity_Normalized_Smith_Waterman.R		Protein_Similarity_Normalized_Smith_Waterman.R
README.md		README.md
Uniprot_GPCR_Human_Reviewed_Canocial_and_Isoform.txt		Uniprot_GPCR_Human_Reviewed_Canocial_and_Isoform.txt
Uniprot_Ion_Channel_Human_Reviewed_Canocial_and_Isoform.txt		Uniprot_Ion_Channel_Human_Reviewed_Canocial_and_Isoform.txt
Uniprot_Proteome_Human_Reviewed.txt		Uniprot_Proteome_Human_Reviewed.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected_Results

Expected_Results

16_Selected_Proteins.txt

16_Selected_Proteins.txt

38_Selected_Proteins.txt

38_Selected_Proteins.txt

LICENSE

LICENSE

Protein_Similarity_Normalized_Smith_Waterman.R

Protein_Similarity_Normalized_Smith_Waterman.R

README.md

README.md

Uniprot_GPCR_Human_Reviewed_Canocial_and_Isoform.txt

Uniprot_GPCR_Human_Reviewed_Canocial_and_Isoform.txt

Uniprot_Ion_Channel_Human_Reviewed_Canocial_and_Isoform.txt

Uniprot_Ion_Channel_Human_Reviewed_Canocial_and_Isoform.txt

Uniprot_Proteome_Human_Reviewed.txt

Uniprot_Proteome_Human_Reviewed.txt

Repository files navigation

Protein_Normalized_SW_Similarity_Score

Installation

More Details

About

Releases

Packages

Languages

License

Amirreza-Mousavi/Protein_Normalized_SW_Similarity_Score

Folders and files

Latest commit

History

Repository files navigation

Protein_Normalized_SW_Similarity_Score

Installation

More Details

About

Topics

Resources

License

Stars

Watchers

Forks

Languages