Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No secondary structure data in CASP12 TFRecord files. I didn't check others... #5

Open
ufimtsev opened this issue Dec 21, 2018 · 14 comments

Comments

@ufimtsev
Copy link

No description provided.

@mircare
Copy link

mircare commented Feb 15, 2019

"* CASP12 test set is incomplete due to embargoed structures. Once the embargo is lifted we will release all structures." https://github.com/aqlaboratory/proteinnet

@ufimtsev
Copy link
Author

It doesn't apply to CASP12 only. None of the files contains secondary structure data

@basantab
Copy link

Hi,

Yes, looks like the text-based records of CASP11 are also missing the secondary structure entries.

@alquraishi
Copy link
Contributor

Thanks for bringing this to my attention. I will update the files soon with secondary structure information.

@crvineeth97
Copy link

@alquraishi Thank you for the amazing resource. May I please know when this issue will be fixed? Thanks!

@AlexeyG
Copy link

AlexeyG commented Jun 3, 2019

I checked a number of splits for a number CASPs - both in TFRecord and in textual formats. I wasn't exhaustive, but it seems like secondary structure data is missing from all of them.

Can the information still be (easily) added to the datasets?

@alquraishi
Copy link
Contributor

Hi @AlexeyG, yes the information can be added easily. It's mostly already there, I just need to expose it. Stay tuned.

@oskar-taubert
Copy link

Hi @alquraishi, can you estimate when this is going to happen?
I would like to use ProteinNet to compare a couple of predictor architectures and a standardized dataset would be useful. Thanks.

@uoda
Copy link

uoda commented Nov 19, 2019

I am Harun Or Rashid,doing masters thesis in Protein sequence, structure and function analysis at University of Wuerzburg,Germany under Prof.Dr.Thomas Dandekar who is the chair of department of Bioinformatics.

I have been trying to implement your RGN network to predict protein 3d structure from sequence. I followed the instruction in your Github :https://github.com/aqlaboratory/rgn

I installed the cpu version of tensorflow 1.10.0 including python 2.7 and setproctitle in conda environment.

I made directory as you mentioned.
RGN7/data/ProteinNet7Thinning90/(testing,training,validation)folder
RGN7/runs/CASP7/ProteinNet7Thinning90/ CASP7.config

I ran script:
python protling.py RGN7/runs/CASP7/ProteinNet7Thinning90/CASP7.config -d RGN7 -p

But i got the out one CASP7.log file which i attached here.

I do not understand the error and wheres the problem.
Would you please help me to solve this issue and help me please to do this properly.

@MoZZez
Copy link

MoZZez commented May 2, 2020

Hello, also encountered this issue(specifically in CASP11), are there still plans on adding secondary structures in observable future?

@deepgradient
Copy link

deepgradient commented Oct 19, 2020

Hi @AlexeyG, yes the information can be added easily. It's mostly already there, I just need to expose it. Stay tuned.

Any progress in resolving this issue?

@spetti
Copy link

spetti commented Nov 18, 2020

I am also very interested in using the secondary structure information-- are there still plans to release this info? Thanks!

@alquraishi
Copy link
Contributor

As an interim solution I added JSON files for the secondary structure data. I say interim because there are a few caveats: the data is not currently integrated within the rest of ProteinNet. Instead, these JSON files are on their own and in an ad hoc file format. There are two files, one corresponding to single domain entries coming from ASTRAL and the other to whole proteins coming from the PDB. The IDs of these entries match those of the original ProteinNet files, and so it should be easy to cross-reference them. The only other wrinkle is that not all ProteinNet entries have secondary structure information, but the vast majority do. The files are linked to in the main README page.

@deepgradient
Copy link

deepgradient commented Nov 22, 2020

@alquraishi
Dear Mohammed,

I've just checked the ids of validation and test datasets of CASP-11 with your added JSON files. Unfortunately, I cannot match any ids between the CASP-11 and JSON files.
Is there any possibility that I am doing something mistakenly?

I would be thankful if you could kindly let me know about the possibility of adding the secondary structures info directly to the original CASP datasets?

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests