Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.1.1 Creating a list of ProteinNet IDs - error #54

Open
alpha-omega-labs opened this issue Nov 26, 2022 · 5 comments
Open

5.1.1 Creating a list of ProteinNet IDs - error #54

alpha-omega-labs opened this issue Nov 26, 2022 · 5 comments

Comments

@alpha-omega-labs
Copy link

Hello,
There is some issue with 5.1.1 instruction:
First - it has actual error:
"the testing set from CASP11."
but:
test_ids = scn.get_proteinnet_ids(casp_version=12, split="test")

Also after "d = scn.create_custom" no custom/additional proteins included in train dataset (following instruction).

Is there any actual instruction on inclusion of custom pdbid to train/validation/test sets or construction new sets from scratch using only new pdbids?
Thank you

@jonathanking
Copy link
Owner

Thank you for your interest and your patience as I try to address your concerns.

To begin, can you please provide me the code you are trying to run?

Also, have you seen my example on creating a custom dataset in the Google Colab notebook linked in the README?

@alpha-omega-labs
Copy link
Author

alpha-omega-labs commented Dec 4, 2022

Hello,
First of all - Thank you very much for sidechainnet!

Here are issues (not critical):

  1. Custom dataset construction - "TEST" sets not working.
    a. Working:
    training_ids =[list of hundreds entries]
    valid32_ids = [list with 8 entries with 32#]
    valid96_ids = [list with 8 entries with 96#]

d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,

Working fine.

b. Not working:
Nothing changed from a. but valid32_ids prefix changed to TBM# (and other prefixes for "test" suggested by colab book) in all entries

training_ids =[list]
valid32_ids = [list with "TBM#" instead of "32#"]
valid96_ids = [list with 96#]

Error:

Traceback (most recent call last):
  File "test_test.py", line 35, in 
    d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,
  File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/create.py", line 354, in create_custom
    sc_only_data, sc_filename = download_sidechain_data(
  File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 130, in download_sidechain_data
    sc_data, pnids_errors = get_sidechain_data(new_pnids, limit)
  File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 189, in get_sidechain_data
    list(
  File "/home/groot/.local/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
OSError: /home/groot.local/lib/python3.8/site-packages/sidechainnet/resources/proteinnet_parsed/targets/7PBC_1_A.pdb is not a valid filename or a valid PDB identifier.

This error comes every time on all "TEST" dataset prefixes and any PDBid.

This may be easy overridden by assigning test to some unused in training validation set.

  1. There is also probably a bug in batch forming for "train" set.
    For some reason
d = scn.load(local_scn_path="sidechainnet_data/800p.pkl",
             with_pytorch="dataloaders",
             batch_size=32, 
             dynamic_batching=False)
 
training_d = d['train']
train_batch = next(iter(training_d))
print(train_batch.pids) 
 

return all the same first PDBid in train set batch_size times.

same with any "valid-n" dataset return valid actual different PDBids.

Question: Is there way to query batch 1,2,3,etc not just 1st to verify data is different every time.

  1. Could you please clarify how is it "better" to query chains - by pdbid or by proteinnet id?
    Will, eventually after scripts run, sidechainnet constructed entries be identical?
    For example:
    1TZH_d1tzhl1 of Proteinnet matches 1TZH_2_A of RCSB.

  2. Is there a possibility to construct sidechainnet entries from custom pdb/cif files without pdbid, e.g. AlphaFold high confidence prediction, or "cleaned" pdb with pdbtools?

  3. Can sidechainnet produce text format entries "per pdb" as in ProteinNet? And also can SidechainNet scripts (maybe on some steps of its work) produce valid ProteinNet entries in text format?

  4. Could you please give a hint on adjusting model example 4.2 from Colab "Training with Sequences, PSSMs, Secondary Structures, and Information Content" for the following task:
    Input is 800 chains, ~350aa each, very similar structure.
    What are the best specs to try:
    a. Train/Validation split ratio
    b. n of layers
    c. size (and how to determine best size in that case)
    d. learning rate (or rates of various steps)
    e. other specs

  5. Is there a simple tutorial or script on how to apply trained model to fasta sequence to "predict" tertiary structure and produce .pdb.

Thank you!

@alpha-omega-labs
Copy link
Author

alpha-omega-labs commented Dec 4, 2022

Train batches have single pdbid every time
Valid-n batches have different pdbid but not shuffle during next epoch

Returning pdid for epoch:

  1. same single pdbid for every epoch from "train" = batch_size. (e.g 1TZH_2_A 8 times for batch_size 8), no shuffle epoch to epoch, every time same pdbid
  2. various pdbid for every epoch from "valid-n" = = batch_size but same every epoch.

@jonathanking
Copy link
Owner

Thanks for following up! Okay, there are a lot of things here to address. I'm going to go through your messages and insert my comments inline.

Hello, First of all - Thank you very much for sidechainnet!

Here are issues (not critical):

  1. Custom dataset construction - "TEST" sets not working.
    a. Working:
    training_ids =[list of hundreds entries]
    valid32_ids = [list with 8 entries with 32#]
    valid96_ids = [list with 8 entries with 96#]

d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,

Working fine.

b. Not working: Nothing changed from a. but valid32_ids prefix changed to TBM# (and other prefixes for "test" suggested by colab book) in all entries

training_ids =[list] valid32_ids = [list with "TBM#" instead of "32#"] valid96_ids = [list with 96#]

Error:

Traceback (most recent call last):
File "test_test.py", line 35, in
d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,
File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/create.py", line 354, in create_custom
sc_only_data, sc_filename = download_sidechain_data(
File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 130, in download_sidechain_data
sc_data, pnids_errors = get_sidechain_data(new_pnids, limit)
File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 189, in get_sidechain_data
list(
File "/home/groot/.local/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
OSError: /home/groot.local/lib/python3.8/site-packages/sidechainnet/resources/proteinnet_parsed/targets/7PBC_1_A.pdb is not a valid filename or a valid PDB identifier.
This error comes every time on all "TEST" dataset prefixes and any PDBid.

This may be easy overridden by assigning test to some unused in training validation set.


I think what is happening here is that by providing an PDB ID with the prefix "TBM", this signifies that the protein is in fact a test set protein. These are targets used in the CASP competition and I have no programmatic way to download these via SidechainNet currently. Simply put - by putting this prefix, you are suggesting that you want to make a SidechainNet dataset containing a protein that SidechainNet does not know how to add.


  1. There is also probably a bug in batch forming for "train" set.
    For some reason

d = scn.load(local_scn_path="sidechainnet_data/800p.pkl",
with_pytorch="dataloaders",
batch_size=32,
dynamic_batching=False)

training_d = d['train']
train_batch = next(iter(training_d))
print(train_batch.pids)

return all the same first PDBid in train set batch_size times.

same with any "valid-n" dataset return valid actual different PDBids.

Are you using a custom dataset? If so, how many proteins are in it?


Question: Is there way to query batch 1,2,3,etc not just 1st to verify data is different every time.

Currently retrieving the i'th batch is not supported.

  1. Could you please clarify how is it "better" to query chains - by pdbid or by proteinnet id?
    Will, eventually after scripts run, sidechainnet constructed entries be identical?
    For example:
    1TZH_d1tzhl1 of Proteinnet matches 1TZH_2_A of RCSB.

First, please note that 1TZH_d1tzhl1 signifies this is an ASTRAL ID (this is a subset of the corresponding RCSB entry, not the entire thing). The ProteinNet repository discusses this in greater detail.

I'm not exactly sure what you would like to know. PDB IDs and ProteinNet/SidechainNet IDs are simply two different ways to name protein entries. Can you please clarify your question?

  1. Is there a possibility to construct sidechainnet entries from custom pdb/cif files without pdbid, e.g. AlphaFold high confidence prediction, or "cleaned" pdb with pdbtools?

Yes, though this is not implemented in the main branch. Please see this issue for an example function you can use for the time being (PDB files only). Note that if the structure has gaps, you will need to carefully handle the protein for several nuanced reasons.

  1. Can sidechainnet produce text format entries "per pdb" as in ProteinNet? And also can SidechainNet scripts (maybe on some steps of its work) produce valid ProteinNet entries in text format?

SidechainNet does not support exporting proteins in the ProteinNet text format. It does, however, support exporting SCNProtein objects to pdb files (see SCNProtein.to_pdb).

  1. Could you please give a hint on adjusting model example 4.2 from Colab "Training with Sequences, PSSMs, Secondary Structures, and Information Content" for the following task:
    Input is 800 chains, ~350aa each, very similar structure.
    What are the best specs to try:
    a. Train/Validation split ratio
    b. n of layers
    c. size (and how to determine best size in that case)
    d. learning rate (or rates of various steps)
    e. other specs

I love this question, as it is very closely related to my current research :) However, I simply don't know the answer to any of these components. I wish I did!

  1. Is there a simple tutorial or script on how to apply trained model to fasta sequence to "predict" tertiary structure and produce .pdb.

At the moment, no, I'm sorry. The code is currently written to start from a SidechainNet dataset object and then make predictions from there. You can browse the examples directory for some model examples if you would like to know more than what the Colab notebook shows. I'm sorry that this is not more helpful, but at the moment SidechainNet has more functionality regarding the data handling and less functionality regarding specific models/training setups.

Something like this would be the idea (assuming you've parsed the FASTA files into strings, and that you have a trained model that takes a SCNProtein as input and produces a protein as output):

def make_scn_from_seq(seq, name):
  return SCNProtein(seq=seq, id=name)

def predict(model, protein):
  return model(protein)

my_proteins = [make_scn_from_seq(s, name) for (s, name) in my_sequences]
for p in my_proteins:
  pred = predict(model, p)
  pred.to_pdb(f"{p.id}.pdb}")

Train batches have single pdbid every time
Valid-n batches have different pdbid but not shuffle during next epoch

Returning pdid for epoch:

  1. same single pdbid for every epoch from "train" = batch_size. (e.g 1TZH_2_A 8 times for batch_size 8), no shuffle epoch to epoch, every time same pdbid
  2. various pdbid for every epoch from "valid-n" = = batch_size but same every epoch.

I think these issues may be related to your item 2. above. Can you share more of your code? It is possible that shuffling may not be working as expected.

@alpha-omega-labs
Copy link
Author

alpha-omega-labs commented Dec 5, 2022

Thank you very much for detailed answers!

Main issue now is batch shuffle, so here is the setup:

  1. There are 800 proteins of approx same length (length may matter for batching probably)
  2. 768 in 100% custom "train" set, and copy of it in "valid-10" named set. Both are exact same but produce diff batching. print("Protein IDs\n ", batch.pids)
  3. If "train" set is used for training (with 768 proteins) and no matter what batch size - its always once single PDBID repeated n batch_size times. print("Protein IDs\n ", batch.pids)
  4. If "valid-10" set is used for training (with 768 proteins) - its different proteins in single batch, however each batch has same set of protein. (and number seems to be random sometimes)

So despite "train" and "valid-10" are identical it produce diff batches.

for epoch in range(1000):
  #print(f"Epoch {epoch}")
  #progress_bar = tqdm(total=len(d['train']), smoothing=0)
  i = 5
  for batch in d['valid-10']:
      i -= 1
      if i <= 0:
          break
      print(f"Model Input = {tuple(batch.seq_evo_sec.shape)}; Total residues = "
          f"{batch.seq_evo_sec.shape[0] * batch.seq_evo_sec.shape[1]}.")	 
      # Prepare variables and create a mask of missing angles (padded with zeros)
      # Note the mask is repeated in the last dimension to match the sin/cos represenation.
      seq_evo_sec = batch.seq_evo_sec.to(device)
      true_angles_sincos = scn.structure.trig_transform(batch.angs).to(device)
      mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2)
      
      # Make predictions and optimize
      pred_angles_sincos = pssm_model(seq_evo_sec)
      loss = mse_loss(pred_angles_sincos[mask], true_angles_sincos[mask])
      loss.backward()
      torch.nn.utils.clip_grad_norm_(pssm_model.parameters(), 2)
      optimizer.step()

      # Housekeeping
      batch_losses.append(float(loss))
      #progress_bar.update(1)
      #progress_bar.set_description(f"\rRMSE Loss = {np.sqrt(float(loss)):.4f}")
      
      valid_d = d['valid-10']

      valid10_batch = next(iter(valid_d)) 
      print(valid10_batch._fields)	     
      print(valid10_batch.pids)
      print("Protein IDs\n   ", batch.pids)

   

Could sidechainnet batch loader based on seq. length be not used, especially if proteins in input are approx same size? How to shuffle custom dataset directly based on batch size, e.g. if batch size is 32, how to make batches going 1-32, 32-64, 64-96, etc
what are consequences of much variable chain length in batch vs approx same length?

Should code above show batch per epoch or just first batch every time? Probably there is a code error for this, but anyway "train" batches contain one single protein chain n times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants