Protein Secondary Structure Prediction

The protein secondary structure prediction is an essential problem in bioinformatics. The structure mostly depends on the primary amino acid sequence of the protein. Secondary structure prediction belongs to the group of pattern recognition and classification problems. The secondary structure of a given instance is predicted based on its sequence features. One of the known solutions is using the Support Vector Machine (SVM) to predict the secondary structure, which has been described in [1] and [2]. The aim of this work was the implementation of the protein secondary structure predictor based on a logistic regression model. To do this we implemented the algorithm described in the mentioned articles. The project was implemented in R programming language.

Methods

Dataset

The dataset consists of three text files: the training dataset, the testing dataset and the validation dataset. Each file has the following structure: the first line provides information about the sequence identification code. The second line contains the sequence of amino acids. In the third line, the secondary protein structure is written. The proteins are separated with an empty line. There are no missing values in this dataset.

Working with the large dataset

Due to the large size of the dataset, we decided to use R libraries that allowed us to perform calculations on multiple cores: parallel and doParallel. In addition, we saved the trained binary classifiers in RDS files. This operation allowed us to remove models from the workspace and free the memory, which was very important for performing further calculations. We used google drive to store the models, which also allowed us to transfer data between two computers.

Measures

To evaluate achieved results we used two commonly used measures Q3 and SOV.

Q3

The secondary structure prediction is usually evaluated by Q3 accuracy, which measures the percent of residues for which a 3-state secondary structure is correctly predicted.

SOV

The segment overlap score (SOV) is used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures. The main advantage of SOV is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does.

Logistic Regression

Logistic regression analyzes the relationship between multiple independent variables and a categorical dependent variable and estimates the probability of occurrence of an event by fitting data to a logistic curve. Binary logistic regression is commonly used when the outcome variable is binary and the predictor variables are either continuous or categorical.

Algorithm Description

After investigating the dataset, we created the input groups for the logistic regression classifier. We followed the instructions described in the articles. First, we had to implement the sliding window scheme. This method allows to preserve the information about the local interactions among neighbouring residues. In the beginning, we had to choose the size of the window. To predict the structure of the amino acid in the middle we need to use the sequence whose size is equal to the size of the window. As can be seen, the problem of missing amino acids at the ends of the sequence had to be solved. We complemented the missing values with the empty character "-" to keep the correct window size. The described algorithm for the window size 5 is presented in the image.

Sliding window coding scheme

After investigating the dataset, we created the input groups for the logistic regression classifier. We followed the instructions described in the articles. First, we had to implement the sliding window scheme. This method allows to preserve the information about the local interactions among neighbouring residues. In the beginning, we had to choose the size of the window. To predict the structure of the amino acid in the middle we need to use the sequence whose size is equal to the size of the window. As can be seen, the problem of missing amino acids at the ends of the sequence had to be solved. We complemented the missing values with the empty character "-" to keep the correct window size. The described algorithm for the window size 5 is presented in the image.

Orthogonal Input profile

The next step was to use orthogonal encoding to assign a unique binary vector to each residue. The weights of all the residues in the window have the value 1 and the rest have the value 0. As a result, we obtain the input matrix with 21 columns (there are 20 different amino acids and one value assigned to the empty character) and with a number of rows equal to the size of the window. Next, we reshaped the orthogonal input to get the one-dimensional input vector. The size of the vector is equal to the result of the multiplication of the number of rows of the matrix by the number of columns. The following rows were written to the vector one by one. Encoding for the input window of the size 5 is presented in the image.

Constructing the binary classifiers

We constructed six binary classifiers: three one-versus-one classifiers (H/E, C/E, C/H) and three one-versus-rest classifiers (C/~C, E/~E, H/~H). For each classifier, we trained the logistic regression model.

Constructing tertiary classifier

The binary classifiers were used to create the different tertiary classifiers. We created three tree classifiers described in the articles. (C/~C & H/E, E/~E & C/H, H/~H & C/E). For example for the second classifier when the first binary classifier classifies the sample as C its predicted value is C, otherwise the class is predicted by the second one-versus-one classifier H/E. Their structures are presented in figures. We also tested the classifier based on three one-versus-one classifiers (C/~C & E/~E & H/~H). The sample is assigned the class with the highest probability.

Results

We tested different window sizes (from 5 to 13 amino acids, only odd sizes). The table 3.1 presents the accuracy scores obtained for each binary classifier on the test dataset. The best results were obtained for the window of size 13.

Then we compared the Q3 and SOV results we got for each of the tertiary classifiers. To do that, we saved the predicted structure in a FASTA format. For each classifier, we used the window that provided the best accuracy for the binary classification. Results are presented in the table 3.2. The best results were achieved by the H/∼H & C/∼C & E/∼E classifier.

Conclusions

The project allowed us to learn about the problem of protein secondary structure prediction. The obtained results are worse than the results achieved with the method described in the [2]. It should be noted that the dataset used in the project is not exactly the same dataset used by the authors of the articles. However, the Q3 and SOV measures are better than 33.33%, so our classifiers work better than guessing. Testing different window sizes allowed us to find the binary classifiers that provide better accuracy. Building the classifier from three one-versus-rest classifiers allowed us to achieve the highest Q3 and SOV values. In order to further develop the project, it would be necessary to see if larger window sizes would allow for greater model accuracy, as it can be seen that for the tested window sizes, accuracy increases as the size increases. One could also try to build more complex classifiers for recognizing the three classes, possibly to improve prediction accuracy. Current solutions based on deep learning provide better SOV and Q3 measures than our solution based on logistic regression.

Bibliography

[1] Mayuri Patel and Hitesh Shah. ‘Protein Secondary Structure Prediction Using Support Vector Machines (SVMs)’. In: 2013 International Conference on Machine Intelligence and Research Advancement. 2013, pp. 594–598. doi: 10.1109/ICMIRA.2013.124.

[2] Hua Sujun, and Sun Zhirong. ‘A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach’. In:Journal of molecular biology. 2001, 397––407.

[3] Liu T and Wang Z. ‘A further refined definition of segment overlap score and its significance for protein structure similarity’. In: vol. Source Code Biol Med. 2018, pp. 594–598. doi: 10.1109/ICMIRA.2013.124.9

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
datasets		datasets
fasta		fasta
.gitattributes		.gitattributes
.gitignore		.gitignore
ProjectDescription.pdf		ProjectDescription.pdf
README.md		README.md
functions.R		functions.R
predict_structure.R		predict_structure.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

fasta

fasta

.gitattributes

.gitattributes

.gitignore

.gitignore

ProjectDescription.pdf

ProjectDescription.pdf

README.md

README.md

functions.R

functions.R

predict_structure.R

predict_structure.R

Repository files navigation

Protein Secondary Structure Prediction

Methods

Dataset

Working with the large dataset

Measures

Q3

SOV

Logistic Regression

Algorithm Description

Sliding window coding scheme

Orthogonal Input profile

Constructing the binary classifiers

Constructing tertiary classifier

Results

Conclusions

Bibliography

About

Releases

Packages

Languages

julimer228/ProteinSecondaryStructurePrediction

Folders and files

Latest commit

History

Repository files navigation

Protein Secondary Structure Prediction

Methods

Dataset

Working with the large dataset

Measures

Q3

SOV

Logistic Regression

Algorithm Description

Sliding window coding scheme

Orthogonal Input profile

Constructing the binary classifiers

Constructing tertiary classifier

Results

Conclusions

Bibliography

About

Topics

Resources

Stars

Watchers

Forks

Languages