Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmasked zeroed tertiary data in text-based CASP7 #30

Open
memoryleak47 opened this issue Mar 15, 2022 · 5 comments
Open

Unmasked zeroed tertiary data in text-based CASP7 #30

memoryleak47 opened this issue Mar 15, 2022 · 5 comments

Comments

@memoryleak47
Copy link

When implementing an RGN for a university project, we stumbled upon a few apparant irregularities in the text-based CASP7 dataset provided here.
That is, quite a few atoms in the tertiary data were positioned at (0,0,0) even though the mask was +, i.e. the atom was considered to be 'valid'.

Example taken from CASP7/validation.

[ID]
70#1MLI_1_A
[PRIMARY]
...
[EVOLUTIONARY]
...
[TERTIARY]
0	1562.5	0	0	1571.2	0	0	1458.2	0	0	1371.3	0	0	1078.5	0	0	953.8 ...
0	1363.	0	0	1492.5	0	0	1226.9	0	0	1303.3	0	0	1229.4	0	0	1255.1 ...
0	4743.1	0	0	4394.3	0	0	4152.2	0	0	3792.3	0	0	3597.2	0	0	3246.3 ...
[MASK]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                           

In this example two thirds of the atoms are positioned at (0, 0, 0).
Is this a bug, or am I simply misinterpreting the given data somehow?

Thanks in advance!

@jonathanking
Copy link

I believe there are a handful of structures that only contain alpha-carbon information. If you inspect the RCSB entry, you'll find this is the case for this structure. You can also see the pattern of (N, Calpha, C) in the tertiary data, where N and C are missing.

Hopefully Mohammed can correct me if I am mistaken, but I hope my comment can help for now.

@memoryleak47
Copy link
Author

memoryleak47 commented Mar 18, 2022

I see! So sometimes individual atoms can be missing in spite of a "+" mask.

But can we assume that each (0, 0, 0) atom is in fact just missing data?
Or is there some other procedure to know which atoms are valid?

@jonathanking
Copy link

jonathanking commented Mar 19, 2022 via email

@memoryleak47
Copy link
Author

Correct. I believe the mask is on the residue level and not the atom level.

Ah, true!

If I'm not overlooking something, it doesn't seem to be mentioned in the documentation here https://github.com/aqlaboratory/proteinnet/blob/master/docs/proteinnet_records.md nor anywhere else on this github page.

Is there some external resource where I could read that up?

@jonathanking
Copy link

I'm afraid I don't have more information. I'm not affiliated with ProteinNet, though I use the provided data and dataset splits in my own research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants