Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of custom sidechainnet enties for .cif files #53

Open
alpha-omega-labs opened this issue Nov 19, 2022 · 4 comments
Open

Creation of custom sidechainnet enties for .cif files #53

alpha-omega-labs opened this issue Nov 19, 2022 · 4 comments

Comments

@alpha-omega-labs
Copy link

Hello,
What is the way to create custom sidechainnet (and proteinnet) alike dataset entries from a list of .cif files?
How to create sidechainnet formatted dataset with 1000 custom .cif proteins from RCSB?
Can sidechainnet entries be used along with OpenProtein https://github.com/biolib/openprotein?
Thank you!

@alpha-omega-labs
Copy link
Author

Hello,
Could you please provide some heads up on how to better use Sidechainnet for the following goal:

Develop model that predicts some specific proteins class (by structure) - 100-400aa length.
This class is divided into two sub classes, that structurally may be descried as similar as cat and bobcat.
Model should be able to predict both sub classes accurately.
Proteinnet CASP12 have <150 of chains related to class 1 and <150 chains related to class 2.
Current RCSB have ~800 of class 1 and ~500 of class 2.

Questions:

  1. Is it better to develop 2 models - separate for class 1 and another for class 2, OR develop single model for both classes? (so train on 800 and 500 separately OR train on 1300 and have proper splits in validation subset).
  2. How to properly construct dataset for that having all: chains from RCSB and also Proteinnet existing records (~300). Is it proper way to load proteinnet with existing IDs (~300) and add non existing chains via provided commands? Or is it better to load one single ID from proteinnet and other 1299 from RCSB?
  3. What model better to use for such structures of ~300aa and with about 1300 items in dataset? (real world data to predict with model is about 35k sequences).
  4. Is there a discord or any other place to hav chat for general questions about sidechainnet?

Thank you

@jonathanking
Copy link
Owner

Hello,
What is the way to create custom sidechainnet (and proteinnet) alike dataset entries from a list of .cif files?
How to create sidechainnet formatted dataset with 1000 custom .cif proteins from RCSB?
Can sidechainnet entries be used along with OpenProtein https://github.com/biolib/openprotein?
Thank you!

I'm sorry, but at the moment SidechainNet cannot parse cif files. This issue shows a way to load PDB files, however. To convert CIF to PDB files, you may try using ProDy. Let me know if you need more assistance with that.

I am unfamiliar with OpenProtein, unfortunately, so I'm not sure how to answer your question.

@jonathanking
Copy link
Owner

First of all, I want to thank you so much for your interest in using SidechainNet! I hope I can help you out here.

Questions:

  1. Is it better to develop 2 models - separate for class 1 and another for class 2, OR develop single model for both classes? (so train on 800 and 500 separately OR train on 1300 and have proper splits in validation subset).

I think this is a great research question that I do not have the answer for myself. My personal hunch would be to use a single model.

  1. How to properly construct dataset for that having all: chains from RCSB and also Proteinnet existing records (~300). Is it proper way to load proteinnet with existing IDs (~300) and add non existing chains via provided commands? Or is it better to load one single ID from proteinnet and other 1299 from RCSB?

I'm sorry, I am not sure what you are suggesting. Could you please clarify?
A) Are you trying to load proteinnet directly, or use SidechainNet?
B) What do you mean "load one single ID from ProteinNet"?

If I understand you correctly, I think you are just trying to load these proteins into the SidechainNet format so you can train a model. In that case, I would simply use the custom SidechainNet constructors that are demonstrated in the Colab notebook.

  1. What model better to use for such structures of ~300aa and with about 1300 items in dataset? (real world data to predict with model is about 35k sequences).

Unfortunately, exactly what model to use is again really hard to suggest. New papers are being released every week or so.

  1. Is there a discord or any other place to hav chat for general questions about sidechainnet?

No, there is no discord for SidechainNet. Feel free to open discussions here, though!

@alpha-omega-labs
Copy link
Author

alpha-omega-labs commented Dec 5, 2022

Trying with two models first :)
Question 2 answered in other issue with your explanation about TBM (and other prefixes), thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants