Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model not training #19

Open
Thomas191 opened this issue Jul 17, 2019 · 5 comments
Open

Model not training #19

Thomas191 opened this issue Jul 17, 2019 · 5 comments

Comments

@Thomas191
Copy link

Hello! I have been attempting to run this code for a couple of weeks but seem to have hit a dead end.

I am running the model on Ubuntu 18.04, with Tensorflow GPU installed (and verified with other code) and with CUDA 10.0 and CuDNN 7.6.1.

My end goal is to use CASP12 to predict the structure of around 1000 proteins.

At the moment I am using CASP10 (to save space) and trying to predict the structure of just one sequence to test the model.

Here is my folder structure:

WD/hmmer-3.2.1

WD/rgn/data_processing/
WD/rgn/model/

WD/proteinnet10

WD/RGN10/data/ProteinNet10Thinning90/testing
WD/RGN10/data/ProteinNet10Thinning90/training
WD/RGN10/data/ProteinNet10Thinning90/validation

WD/RGN10/runs/CASP10/ProteinNet10Thinning90/1
WD/RGN10/runs/CASP10/ProteinNet10Thinning90/2
...
WD/RGN10/runs/CASP10/ProteinNet10Thinning90/logs
WD/RGN10/runs/CASP10/ProteinNet10Thinning90/checkpoints
WD/RGN10/runs/CASP10/ProteinNet10Thinning90/configuration

WD/RGN10/logs

This is the last line of code:

rgn/model/protling.py RGN10/runs/CASP10/ProteinNet10Thinning90/configuration -d RGN10 -p -e weighted_testing

When the model runs there doesn't appear to be any errors, however the prediction is placed in folder number 1 and not in the highest number folder as would be expected.

Following the comments in another issue, I have tried deleting all the numbered folders and just training the model using the following code:

rgn/model/protling.py RGN10/runs/CASP10/ProteinNet10Thinning90/configuration -d RGN10

This only creates folder 1, logs, and checkpoints folders.

Likewise for the following code:

rgn/model/protling.py RGN10/runs/CASP10/ProteinNet10Thinning90/configuration -d RGN10 -p -e weighted_testing

Where once again only folder 1, logs, and checkpoints folders are created and the prediction for our sequence is placed in folder 1.

We have looked at this prediction and have converted it to a PDB file to view in PyMol, however the output is a helical structure (completely different to what a folded protein would look like).

We would appreciate any suggestions you have about how to fix this issue.

@alquraishi
Copy link
Contributor

Hi @Thomas191,

I just tried recreating your directory structure on my system and it worked fine. The predictions should definitely be placed in the highest numbered folder. Otherwise you're making predictions from an untrained model which will be junk. Training a model from scratch is also quite time-intensive.

I noticed that you don't have a gpu assigned. Did you try an option like -g0?

@Thomas191
Copy link
Author

Hi!
Adding the -g0 worked like a charm, however I now have the issue that it doesn't seem to work for fasta sequences with more than one chain. Should it? Or is the model not suited for multiple chains?
This is the error I get when I try to fold a protein with more than one chain:

Input file contains >1 alignments, but UCSC A2M formatted output file can only contain 1
WARNING: Logging before flag parsing goes to stderr.
W0801 20:46:52.664010 140230345348992 deprecation_wrapper.py:119] From rgn/data_processing/convert_to_tfrecord.py:120: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

Traceback (most recent call last):
File "rgn/data_processing/convert_to_tfrecord.py", line 123, in
dict_ = read_record(input_file, num_evo_entries)
File "rgn/data_processing/convert_to_tfrecord.py", line 68, in read_record
primary = letter_to_num(file_.readline()[:-1], _aa_dict)
File "rgn/data_processing/convert_to_tfrecord.py", line 53, in letter_to_num
num = [int(i) for i in num_string.split()]
ValueError: invalid literal for int() with base 10: '>2X7'

Again, any help is much appreciated.

@alquraishi
Copy link
Contributor

Yes unfortunately it doesn't support multiple chains at the moment. You'd have to input them separately.

@gszwabowski
Copy link

@Thomas191 how did you convert the tertiary file to a pdb? I have my output but have no idea how to interpret it.

@OsamaGhandour
Copy link

@Thomas191 can you mention how exactly command structure that solve this problem for you ?
(This only creates folder 1, logs, and checkpoints folders)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants