Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidechainnet for CASP 13 to CASP 15 #57

Open
harshagrawal13 opened this issue Mar 7, 2023 · 6 comments
Open

Sidechainnet for CASP 13 to CASP 15 #57

harshagrawal13 opened this issue Mar 7, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@harshagrawal13
Copy link

Hi!
I am trying to do Masked Modelling using sequential and structural data using your curated dataset. I was wondering if it's possible for you to add the data for CASP 13 to CASP 15 if that's possible or share how I can do the same on my own.

Kind regards,
Harsh

@jonathanking
Copy link
Owner

Hi Harsh,

Thanks for your interest! This is something that I would love to do (and I'm sure other users would be interested in), but it's unfortunately delayed and I don't have info on when I can add this. I'm working on adding slightly different functionality to SidechainNet at the moment.

Why? The trouble is that SidechainNet directly extends ProteinNet (and thereby uses ProteinNet's pretty sophisticated protein sequence clustering and filtering methods). Since ProteinNet does support CASPs newer than CASP 12 to my knowledge (specifically the clustering info), I am prevented from adding later CASP datasets to SidechainNet for now. I must either develop the code to split the training data in the same way as AlQuraishi et al. have done, ask for the authors access to that code, or hope that the authors would be willing to generate the same kind of dataset splits for CASPs > 12 and share them.

The good news is that you can manually specify proteins for a custom SidechainNet dataset. See Section 5 of the Colab Walkthrough linked to in the README. You'd just need define a list of train, validation, and test set proteins using the SidechainNet naming scheme, and those protein chains will be acquired and parsed into SidechainNet's datastructures. For the CASP test set proteins, however, you would need identify the RCSB PDB IDs that they correspond to, so that SidechainNet can download them correctly from the RCSB PDB.

Please let me know if you have any questions or concerns, and I'd be happy to help as much as I can.

Best,
Jonathan

@jonathanking jonathanking added the enhancement New feature or request label Mar 7, 2023
@harshagrawal13
Copy link
Author

Hey Jonathaking,
Thanks for your swift reply. I really appreciate all the effort you've put into sidechainnet. It's been incredibly useful. As you mentioned I was trying to use the create_custom function but I'm unsure how exactly to proceed. I simply require (without any test, val splits) all ~210,000 PDB entries in the SCN format. I found this endpoint: https://data.rcsb.org/rest/v1/holdings/current/entry_ids to query all the PDB IDs. I also understood that I need to format these ids in proteinnet format. (I'm unsure where to query the <chain/model_number> and <chain_id>. I was setting all of them to 1 and A by default. When I passed a list of all PDB ids formatted like this to the create_custom function, it throws an error: need at least one array to concatenate. I'm attaching a screenshot. Kindly let me know if I'm doing something wrong or how should I proceed.

Screenshot 2023-03-08 at 7 22 31 PM

@jonathanking
Copy link
Owner

jonathanking commented Mar 8, 2023

I'm really glad it has been helpful to you! Let's see, let me try to break this down a bit.

1

To begin, (apologies if you already know this) you should be aware that SidechainNet (as well as many other models and datasets like ProteinNet or even AlphaFold) treat proteins not as mutli-chain entities, but rather operate on each protein chain independently. So, in SidechainNet, we use a naming scheme that not only includes the 4-digit RSCB PDB ID, but also a "model number" (usually 1 is appropriate if you don't have a reason to use something else), as well as the very important chain ID.

What you're effectively doing is trying to download model 1 and chain A from all of those proteins. Model 1 probably exists for all of them, as well as chain A, but neither are guaranteed.

2

I'm not positive, but I think your code is not running on the Colab notebook because some of the IDs you've provided are not valid. To me it looks like your code doesn't bother downloading sidechainnet data for any of the items you requested (it says 0it). I tried running the Colab notebook as it is written and it works there (see below):

Downloading pre-parsed ProteinNet data (~3.5 GB compressed).
Downloading file chunks (estimated): 57257chunk [02:03, 463.84chunk/s]                        
Re-initializing validation set splits ([10, 90]).
Loading complete ProteinNet data (100% thinning) from /usr/local/lib/python3.9/dist-packages/sidechainnet/resources/proteinnet_parsed.
Raw ProteinNet files already preprocessed (/usr/local/lib/python3.9/dist-packages/sidechainnet/resources/proteinnet_parsed/training_100.pkl).
Preparing to download requested proteins via their ProteinNet IDs.
Downloading SidechainNet specific data from RSCB PDB.
141 IDs OK for parallel downloading.
  0%|          | 0/141 [00:00<?, ?it/s]DEBUG:.prody:Connecting wwPDB FTP server RCSB PDB (USA).
...
100%|██████████| 147/147 [00:09<00:00, 15.27it/s]
Finished unifying sidechain information with ProteinNet data.
0 IDs failed to combine successfully.
147 included in CASP User-specified (User-specified% thinning).
User-specified SidechainNet written to ./sidechainnet_data/custom01.pkl.
To load the data in a different format, use sidechainnet.load with the desired
options and set 'local_scn_path=./sidechainnet_data/custom01.pkl'.

If you want me to look closer at your error, can you please expand the error traceback where it says "3 Frames"?

3

I think I understand what you want, but SidechainNet doesn't have all the tools to get you there at the moment. If you can come up with a way to generate all of the sidechainnet-formatted IDs that you need properly, then my code should be able to handle that. SidechainNet specifies proteins as being part of the validation or test sets by this naming convention (i.e. 10# and FM# mean validation set number 10 and Free Modeling test set), so if you're not planning to have SidechainNet construct validation sets or test sets, then you should not have any identifiers with # in them.

There is also functionality that's not fully tested where if you have the pdb file, you can load the protein into a SCNProtein. However, this doesn't work for proteins with gaps in their sequences, and the PDB file must only have a single chain.

Please let me know if I can help any more!

@harshagrawal13
Copy link
Author

Hey, thanks for your reply. Here's a copy of the colab notebook: https://colab.research.google.com/drive/1X-Z7qcDUyQxIXnBYyWd042BcQr3UsF-z?usp=sharing. Kindly let me know if you can find the issue or suggest me how it can be fixed :)

@jonathanking
Copy link
Owner

I get the same error when running your notebook. I think it's because of the reasons I mentioned above (improper sidechainnet ids). Please let me know if I can clarify further.

@harshagrawal13
Copy link
Author

Gotcha. Thanks. I'll try to fetch the correct Ids and post if I encounter any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants