Skip to content

jahnl/binding_in_disorder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IDBindT5

Prediction of Binding Residues in Disordered Regions Based on Protein Embeddings

Here, we presented a novel machine learning (ML) model trained to predict binding regions specifically in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 [1] to reach a balanced accuracy of 57.2±3.6% (95% confidence interval). This was numerically slightly higher than the performance of the state-of-the-art (SOTA) methods ANCHOR2 (52.4±2.7%) and DeepDISOBind (56.9±5.6%) that rely on expert-crafted features and/or evolutionary information from multiple sequence alignments (MSAs). IDBindT5’s SOTA predictions are much faster than other methods, easily enabling full-proteome analyses.

How to use

The repository consists of two public branches: main and prediction. The main branch is for active development, providing reproducibility and additional information for the thesis. If you want to use our model for binding residue prediction, switch to the prediction branch and clone this reduced repository to your local machine. The prediction branch includes all scripts needed for prediction, five selected ML models and example input data.

Input files

You will need:

  • a FASTA file of your protein sequence(s)
  • a H5 file of your protein embeddings, generated by ProtT5. We recommend using the bio-embeddings pipeline for the task (https://github.com/sacdallago/bio_embeddings/).
  • a disorder annotation. You can either input curated MobiDB [2] annotations (click 'download entry/entries > fasta'), or use SETH [3] to predict disordered regions instead (https://embed.predictprotein.org/)

config_prediction.ini

Fill out the config file so that the model can run with your preferred settings:

  • You can leave the parameters 'model_name', 'fold' and 'cutoff' blank to use model 'FNN_all' with default settings.
  • Enter any other model name from the ./results/models folder to use it instead. You can also download additional models from the repository's main branch for your prediction.
  • If you want to reach high precision (to the cost of recall), we recommend using model 'FNN_disorder' with cutoff 0.55.
  • Enter the paths to your input files into the corresponding configuration items: annotations (MobiDB FASTA or SETH prediction + IDs), FASTA sequences and H5 embeddings.

Refer to the configuration file's comments for more detailed instructions.

Execution

  • change directory to .../binding_in_disorder/src
  • run python main_prediction.py

References

[1] Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., & Rost, B. (2021). ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. - under review -. https://doi.org/10.1101/2020.07.12.199554

[2] Piovesan, D., Del Conte, A., Clementel, D., Monzon, A. M., Bevilacqua, M., Aspromonte, M. C., Iserte, J. A., Orti, F. E., Marino-Buslje, C., & Tosatto, S. C. E. (2022). MobiDB: 10 years of intrinsically disordered proteins. Nucleic Acids Res. https://doi.org/10.1093/nar/gkac1065

[3] Ilzhöfer, D., Heinzinger, M., & Rost, B. (2022). SETH predicts nuances of residue disorder from protein embeddings [Original Research]. Frontiers in Bioinformatics, 2. https://doi.org/10.3389/fbinf.2022.1019597

Citation

@article{jahn2024IDBindT5,
  title={Protein Embeddings Predict Binding Residues in Disordered Regions},
  author={Jahn, Laura R and Marquet, Celine and Heinzinger, Michael and Rost, Burkhard},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

About

Prediction of Binding Residues in Disordered Regions Based on Protein Embeddings; TUM Master Praktikum Bioinformatics 2022 (Project #3) and Master's Thesis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published