5.1.1 Creating a list of ProteinNet IDs - error #54

alpha-omega-labs · 2022-11-26T18:27:59Z

Hello,
There is some issue with 5.1.1 instruction:
First - it has actual error:
"the testing set from CASP11."
but:
test_ids = scn.get_proteinnet_ids(casp_version=12, split="test")

Also after "d = scn.create_custom" no custom/additional proteins included in train dataset (following instruction).

Is there any actual instruction on inclusion of custom pdbid to train/validation/test sets or construction new sets from scratch using only new pdbids?
Thank you

jonathanking · 2022-11-27T05:53:42Z

Thank you for your interest and your patience as I try to address your concerns.

To begin, can you please provide me the code you are trying to run?

Also, have you seen my example on creating a custom dataset in the Google Colab notebook linked in the README?

alpha-omega-labs · 2022-12-04T02:07:16Z

Hello,
First of all - Thank you very much for sidechainnet!

Here are issues (not critical):

Custom dataset construction - "TEST" sets not working.
a. Working:
training_ids =[list of hundreds entries]
valid32_ids = [list with 8 entries with 32#]
valid96_ids = [list with 8 entries with 96#]

d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,

Working fine.

b. Not working:
Nothing changed from a. but valid32_ids prefix changed to TBM# (and other prefixes for "test" suggested by colab book) in all entries

training_ids =[list]
valid32_ids = [list with "TBM#" instead of "32#"]
valid96_ids = [list with 96#]

Error:

Traceback (most recent call last):
  File "test_test.py", line 35, in 
    d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,
  File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/create.py", line 354, in create_custom
    sc_only_data, sc_filename = download_sidechain_data(
  File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 130, in download_sidechain_data
    sc_data, pnids_errors = get_sidechain_data(new_pnids, limit)
  File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 189, in get_sidechain_data
    list(
  File "/home/groot/.local/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
OSError: /home/groot.local/lib/python3.8/site-packages/sidechainnet/resources/proteinnet_parsed/targets/7PBC_1_A.pdb is not a valid filename or a valid PDB identifier.

This error comes every time on all "TEST" dataset prefixes and any PDBid.

This may be easy overridden by assigning test to some unused in training validation set.

There is also probably a bug in batch forming for "train" set.
For some reason

d = scn.load(local_scn_path="sidechainnet_data/800p.pkl",
             with_pytorch="dataloaders",
             batch_size=32, 
             dynamic_batching=False)
 
training_d = d['train']
train_batch = next(iter(training_d))
print(train_batch.pids)

return all the same first PDBid in train set batch_size times.

same with any "valid-n" dataset return valid actual different PDBids.

Question: Is there way to query batch 1,2,3,etc not just 1st to verify data is different every time.

Could you please clarify how is it "better" to query chains - by pdbid or by proteinnet id?
Will, eventually after scripts run, sidechainnet constructed entries be identical?
For example:
1TZH_d1tzhl1 of Proteinnet matches 1TZH_2_A of RCSB.
Is there a possibility to construct sidechainnet entries from custom pdb/cif files without pdbid, e.g. AlphaFold high confidence prediction, or "cleaned" pdb with pdbtools?
Can sidechainnet produce text format entries "per pdb" as in ProteinNet? And also can SidechainNet scripts (maybe on some steps of its work) produce valid ProteinNet entries in text format?
Could you please give a hint on adjusting model example 4.2 from Colab "Training with Sequences, PSSMs, Secondary Structures, and Information Content" for the following task:
Input is 800 chains, ~350aa each, very similar structure.
What are the best specs to try:
a. Train/Validation split ratio
b. n of layers
c. size (and how to determine best size in that case)
d. learning rate (or rates of various steps)
e. other specs
Is there a simple tutorial or script on how to apply trained model to fasta sequence to "predict" tertiary structure and produce .pdb.

Thank you!

alpha-omega-labs · 2022-12-04T07:38:00Z

Train batches have single pdbid every time
Valid-n batches have different pdbid but not shuffle during next epoch

Returning pdid for epoch:

same single pdbid for every epoch from "train" = batch_size. (e.g 1TZH_2_A 8 times for batch_size 8), no shuffle epoch to epoch, every time same pdbid
various pdbid for every epoch from "valid-n" = = batch_size but same every epoch.

jonathanking · 2022-12-05T01:15:58Z

Thanks for following up! Okay, there are a lot of things here to address. I'm going to go through your messages and insert my comments inline.

Hello, First of all - Thank you very much for sidechainnet!

Here are issues (not critical):

Custom dataset construction - "TEST" sets not working.
a. Working:
training_ids =[list of hundreds entries]
valid32_ids = [list with 8 entries with 32#]
valid96_ids = [list with 8 entries with 96#]

d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,

Working fine.

b. Not working: Nothing changed from a. but valid32_ids prefix changed to TBM# (and other prefixes for "test" suggested by colab book) in all entries

training_ids =[list] valid32_ids = [list with "TBM#" instead of "32#"] valid96_ids = [list with 96#]

Error:

Traceback (most recent call last):
File "test_test.py", line 35, in
d = scn.create_custom(pnids=training_ids + valid32_ids + valid96_ids,
File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/create.py", line 354, in create_custom
sc_only_data, sc_filename = download_sidechain_data(
File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 130, in download_sidechain_data
sc_data, pnids_errors = get_sidechain_data(new_pnids, limit)
File "/home/groot/.local/lib/python3.8/site-packages/sidechainnet/utils/download.py", line 189, in get_sidechain_data
list(
File "/home/groot/.local/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
OSError: /home/groot.local/lib/python3.8/site-packages/sidechainnet/resources/proteinnet_parsed/targets/7PBC_1_A.pdb is not a valid filename or a valid PDB identifier.
This error comes every time on all "TEST" dataset prefixes and any PDBid.

This may be easy overridden by assigning test to some unused in training validation set.

I think what is happening here is that by providing an PDB ID with the prefix "TBM", this signifies that the protein is in fact a test set protein. These are targets used in the CASP competition and I have no programmatic way to download these via SidechainNet currently. Simply put - by putting this prefix, you are suggesting that you want to make a SidechainNet dataset containing a protein that SidechainNet does not know how to add.

There is also probably a bug in batch forming for "train" set.
For some reason

d = scn.load(local_scn_path="sidechainnet_data/800p.pkl",
with_pytorch="dataloaders",
batch_size=32,
dynamic_batching=False)

training_d = d['train']
train_batch = next(iter(training_d))
print(train_batch.pids)

return all the same first PDBid in train set batch_size times.

same with any "valid-n" dataset return valid actual different PDBids.

Are you using a custom dataset? If so, how many proteins are in it?

Question: Is there way to query batch 1,2,3,etc not just 1st to verify data is different every time.

Currently retrieving the i'th batch is not supported.

Could you please clarify how is it "better" to query chains - by pdbid or by proteinnet id?
Will, eventually after scripts run, sidechainnet constructed entries be identical?
For example:
1TZH_d1tzhl1 of Proteinnet matches 1TZH_2_A of RCSB.

First, please note that 1TZH_d1tzhl1 signifies this is an ASTRAL ID (this is a subset of the corresponding RCSB entry, not the entire thing). The ProteinNet repository discusses this in greater detail.

I'm not exactly sure what you would like to know. PDB IDs and ProteinNet/SidechainNet IDs are simply two different ways to name protein entries. Can you please clarify your question?

Is there a possibility to construct sidechainnet entries from custom pdb/cif files without pdbid, e.g. AlphaFold high confidence prediction, or "cleaned" pdb with pdbtools?

Yes, though this is not implemented in the main branch. Please see this issue for an example function you can use for the time being (PDB files only). Note that if the structure has gaps, you will need to carefully handle the protein for several nuanced reasons.

Can sidechainnet produce text format entries "per pdb" as in ProteinNet? And also can SidechainNet scripts (maybe on some steps of its work) produce valid ProteinNet entries in text format?

SidechainNet does not support exporting proteins in the ProteinNet text format. It does, however, support exporting SCNProtein objects to pdb files (see SCNProtein.to_pdb).

Could you please give a hint on adjusting model example 4.2 from Colab "Training with Sequences, PSSMs, Secondary Structures, and Information Content" for the following task:
Input is 800 chains, ~350aa each, very similar structure.
What are the best specs to try:
a. Train/Validation split ratio
b. n of layers
c. size (and how to determine best size in that case)
d. learning rate (or rates of various steps)
e. other specs

I love this question, as it is very closely related to my current research :) However, I simply don't know the answer to any of these components. I wish I did!

Is there a simple tutorial or script on how to apply trained model to fasta sequence to "predict" tertiary structure and produce .pdb.

At the moment, no, I'm sorry. The code is currently written to start from a SidechainNet dataset object and then make predictions from there. You can browse the examples directory for some model examples if you would like to know more than what the Colab notebook shows. I'm sorry that this is not more helpful, but at the moment SidechainNet has more functionality regarding the data handling and less functionality regarding specific models/training setups.

Something like this would be the idea (assuming you've parsed the FASTA files into strings, and that you have a trained model that takes a SCNProtein as input and produces a protein as output):

def make_scn_from_seq(seq, name):
  return SCNProtein(seq=seq, id=name)

def predict(model, protein):
  return model(protein)

my_proteins = [make_scn_from_seq(s, name) for (s, name) in my_sequences]
for p in my_proteins:
  pred = predict(model, p)
  pred.to_pdb(f"{p.id}.pdb}")

Train batches have single pdbid every time
Valid-n batches have different pdbid but not shuffle during next epoch

Returning pdid for epoch:

same single pdbid for every epoch from "train" = batch_size. (e.g 1TZH_2_A 8 times for batch_size 8), no shuffle epoch to epoch, every time same pdbid

various pdbid for every epoch from "valid-n" = = batch_size but same every epoch.

I think these issues may be related to your item 2. above. Can you share more of your code? It is possible that shuffling may not be working as expected.

alpha-omega-labs · 2022-12-05T15:00:26Z

Thank you very much for detailed answers!

Main issue now is batch shuffle, so here is the setup:

There are 800 proteins of approx same length (length may matter for batching probably)
768 in 100% custom "train" set, and copy of it in "valid-10" named set. Both are exact same but produce diff batching. print("Protein IDs\n ", batch.pids)
If "train" set is used for training (with 768 proteins) and no matter what batch size - its always once single PDBID repeated n batch_size times. print("Protein IDs\n ", batch.pids)
If "valid-10" set is used for training (with 768 proteins) - its different proteins in single batch, however each batch has same set of protein. (and number seems to be random sometimes)

So despite "train" and "valid-10" are identical it produce diff batches.

for epoch in range(1000):
  #print(f"Epoch {epoch}")
  #progress_bar = tqdm(total=len(d['train']), smoothing=0)
  i = 5
  for batch in d['valid-10']:
      i -= 1
      if i <= 0:
          break
      print(f"Model Input = {tuple(batch.seq_evo_sec.shape)}; Total residues = "
          f"{batch.seq_evo_sec.shape[0] * batch.seq_evo_sec.shape[1]}.")	 
      # Prepare variables and create a mask of missing angles (padded with zeros)
      # Note the mask is repeated in the last dimension to match the sin/cos represenation.
      seq_evo_sec = batch.seq_evo_sec.to(device)
      true_angles_sincos = scn.structure.trig_transform(batch.angs).to(device)
      mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2)
      
      # Make predictions and optimize
      pred_angles_sincos = pssm_model(seq_evo_sec)
      loss = mse_loss(pred_angles_sincos[mask], true_angles_sincos[mask])
      loss.backward()
      torch.nn.utils.clip_grad_norm_(pssm_model.parameters(), 2)
      optimizer.step()

      # Housekeeping
      batch_losses.append(float(loss))
      #progress_bar.update(1)
      #progress_bar.set_description(f"\rRMSE Loss = {np.sqrt(float(loss)):.4f}")
      
      valid_d = d['valid-10']

      valid10_batch = next(iter(valid_d)) 
      print(valid10_batch._fields)	     
      print(valid10_batch.pids)
      print("Protein IDs\n   ", batch.pids)

Could sidechainnet batch loader based on seq. length be not used, especially if proteins in input are approx same size? How to shuffle custom dataset directly based on batch size, e.g. if batch size is 32, how to make batches going 1-32, 32-64, 64-96, etc
what are consequences of much variable chain length in batch vs approx same length?

Should code above show batch per epoch or just first batch every time? Probably there is a code error for this, but anyway "train" batches contain one single protein chain n times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.1.1 Creating a list of ProteinNet IDs - error #54

5.1.1 Creating a list of ProteinNet IDs - error #54

alpha-omega-labs commented Nov 26, 2022

jonathanking commented Nov 27, 2022

alpha-omega-labs commented Dec 4, 2022 •

edited

alpha-omega-labs commented Dec 4, 2022 •

edited

jonathanking commented Dec 5, 2022

alpha-omega-labs commented Dec 5, 2022 •

edited

5.1.1 Creating a list of ProteinNet IDs - error #54

5.1.1 Creating a list of ProteinNet IDs - error #54

Comments

alpha-omega-labs commented Nov 26, 2022

jonathanking commented Nov 27, 2022

alpha-omega-labs commented Dec 4, 2022 • edited

alpha-omega-labs commented Dec 4, 2022 • edited

jonathanking commented Dec 5, 2022

alpha-omega-labs commented Dec 5, 2022 • edited

alpha-omega-labs commented Dec 4, 2022 •

edited

alpha-omega-labs commented Dec 4, 2022 •

edited

alpha-omega-labs commented Dec 5, 2022 •

edited