New biomol fields #400

d-beltran · 2022-02-23T12:15:35Z

This is the third option on how to include biomolecular data within Optimade.
This has been discussed in issue 389.

It introduces two new main fields: biomol_chains and biomol_residues. These fields describe how atoms are grouped in "chains" and "residues", two classifiers widely used in the biomolecular field.

In addition two more fields are suggested: biomol_sequences and biomol_sequence_types. These fields describe sequences of residues and they are useful for queries.

New fields are placed in the appendix, as @JPBergsma did in previous options PR395 and PR396

optimade.rst

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

JPBergsma · 2022-02-24T11:58:58Z

optimade.rst

+biomol_sequences
+~~~~~~~~~~~~~~~~
+
+- **Description**: A list of residue sequences in current structure. It may be any type of sequence, as this type is further specified in :field:`biomol_sequence_types`.


Are you not duplicating data here? It seems that the sequences already occur in the biomol_chains field. It would perhaps be better to make a reference from the biomol_chains to these sequences.

Is the main point of this field not to enable querying on the sequences?
If that is the case, it may be better to set the query ability to SHOULD.

You are right, data would be duplicated.

I was thinking that it could be better specifying which chains (and even which residues) are included on each sequence, by their indices, and then removing the sequences and sequence_types fields on biomol_chains.

Actually, instead of making new fields for this we could reshape sequences like a list of dictionaries, to make it coherent with the other biomol field formats. What do you think?

biomol_sequences ~~~~~~~~~~~~~~~~ - **Description**: A list of residue sequences in current structure. Every sequence is a dictionary which includes the sequence itself and the type of sequence it is. Every sequence may include a list of chain and residue indices. Sequences may be grouped and ordered in any form (e.g. by chains, by fragments of covalently bonded atoms, etc.) as long as they make sense when querying structures by sequence. - **Type**: list of dictionaries with the properties: - :property:`sequence`: string (REQUIRED) - :property:`type`: string (REQUIRED) - :property:`chains`: list of integers - :property:`residues`: list of integers - **Requirements/Conventions**: - **Query**: Queries on this property SHOULD be supported. - **sequence**: A string with a letter for each residue in the sequence. Letters SHOULD be capital letters. - **type**: The type of a sequence is defined by its components (e.g. 'aminoacids'). - **chains**: A list of integers referring to indices in :field:`biomol_chains` for chains which include this sequence totally or partially. - **residues**: A list of integers referring to indices in :field:`biomol_residues` for residues included in the sequence. Indices start the count at 0. There MUST NOT be repeated indices both in :property:`chains` and :property:`residues`. - **Examples**: .. code:: jsonc { "biomol_sequences":[ { sequence: 'MSHHWGYG', type: 'aminoacids' }, { sequence: 'GATTACA', type: 'nucleotides' } ] }

We could indeed turn the sequences into a dictionary.

I was still wondering whether there is a clear hierarchical structure? (That a chain can have one or more sequences, but a sequence can not contain more than one chain? In that case, we would not need to duplicate the residues.

I was also wondering how the situation is handled, where there are multiple chains with the same sequence.
Will there be multiple sequence entries with the same sequence ? My idea was to make a sequence unique, so it only occurs once in the biomol_sequence field. But if you include the residues, you would need to have a separate sequence entry for each chain.

I was still wondering whether there is a clear hierarchical structure? (That a chain can have one or more sequences, but a sequence can not contain more than one chain? In that case, we would not need to duplicate the residues.

In my previous suggestion the chains property is a list of integers so a sequence may contain more than one chain. This is important since a polymer may be splitted in several chains. In the other hand, a chain may contain the whole sequence of a polymer and more things at the same time, so not all residues in the chain would be part of the sequence. Then listing residues in the chain makes sense.

I was also wondering how the situation is handled, where there are multiple chains with the same sequence.
Will there be multiple sequence entries with the same sequence ? My idea was to make a sequence unique, so it only occurs once in the biomol_sequence field. But if you include the residues, you would need to have a separate sequence entry for each chain.

I did not think about this and you are right. Then we can forget about residues to make everything easier.

merkys · 2022-02-24T12:06:04Z

optimade.rst

-   - **types**: A list of tags specifying the type of molecules this chain contains.
+   - **types**: A list of custom tags/labels specifying the type of molecules this chain contains (e.g. 'protein').
+     This field is useful as an overview of every chain and as a query target for the structure.
+     Labels in this field are non-standard. Every implementation may use different labels according to its needs.


I would advise standardizing labels at least for the most common molecule types to benefit the queryability of the field. Implementations could use their own labels, but prefixed with their own database-specific prefixes.

Good idea. This way we could save a lot of work.

I have been searching for references for our labels in the current mmCIF format (the future PDB standard) and this is the best I have found. They are meant for assemblies and they do not totally suit me, but I will try to resemble them.

So I suggest the standard labels to be the following: 'PROTEIN', 'NUCLEIC ACID', 'CARBOHYDRATES', 'LIPID', 'MEMBRANE', 'LIGAND', 'ION', 'SOLVENT', 'OTHER'.

If you agree I will commit changes soon.

Yes, it seems like a good idea to define these labels.

I very much like basing on mmCIF. Maybe there is already a JSON representation for mmCIF data?

So I suggest the standard labels to be the following: 'PROTEIN', 'NUCLEIC ACID', 'CARBOHYDRATES', 'LIPID', 'MEMBRANE', 'LIGAND', 'ION', 'SOLVENT', 'OTHER'.

If you agree I will commit changes soon.

These labels sound very good. I would just render them in lowercase and describe the use of prefixes for custom labels in the form of <prefix>:<label>.

Maybe there is already a JSON representation for mmCIF data?

In a fast search I found several mmCIF to JSON parsers and one of them seems to be the official one.

I would just render them in lowercase and describe the use of prefixes for custom labels in the form of :

Allright

optimade.rst

JPBergsma · 2022-05-19T17:16:18Z

I just came across another issue. I am trying to implement the standard we described here to aid our discussion. To have some example data, I downloaded a random trajectory from the internet. This trajectory has a non-standard amino acid in it.
How do you suggest we handle this case? It seems a one letter code is not sufficient to describe all amino acids in a sequence.

d-beltran · 2022-05-20T09:16:52Z

Usually non-standard aminoacids are tagged as 'X' in the one letter code.

JPBergsma · 2022-05-23T10:13:57Z

Yes, I can do that. It would make it harder to search for sequences with non-standard amino acids, but those are probably quite rare anyway.

JPBergsma · 2022-05-04T13:10:33Z

optimade.rst

+      },
+      {
+        sequence: 'GATTACA',
+        type: 'nucleotides'


Would it not be better to use something other than "nucleotides" so we can destinguish between DNA and RNA?

JPBergsma · 2022-05-05T14:44:00Z

optimade.rst

+     Standard labels for this field are the follwoing: 'protein', 'nucleic acid', 'carbohydrates', 'lipid', 'membrane', 'ligand', 'ion', 'solvent' and 'other'.
+     The list SHOULD contain values within the standard labels.
+     Additional custom labels MAY be used. These labels MUST include the database-provider-specific prefix with the following format: <prefix>:<label>.
+   - **sequences**: A list of residue sequences in current chain.


If the same sequence occurs twice in a chain. Should the sequence be listed here twice or just once?

JPBergsma · 2022-09-13T08:44:58Z

optimade.rst

+        "name": "PHE",
+	      "number": 17,
+	      "insertion_code": null,
+        "sites":[0,1,2,3, ...]
+      },
+      {
+        "name": "ASP",
+	      "number": 18,
+	      "insertion_code": null,
+        "sites":[17,18,19,20, ...]
+      },
+      {
+        "name": "LEU",
+	      "number": 18,
+        "insertion_code": "A",
+        "sites":[29,30,31, ...]


Suggested change

"name": "PHE",

"number": 17,

"insertion_code": null,

"sites":[0,1,2,3, ...]

},

{

"name": "ASP",

"number": 18,

"insertion_code": null,

"sites":[17,18,19,20, ...]

},

{

"name": "LEU",

"number": 18,

"insertion_code": "A",

"sites":[29,30,31, ...]

"name": "PHE",

"number": 17,

"insertion_code": null,

"sites":[0,1,2,3, ...]

},

{

"name": "ASP",

"number": 18,

"insertion_code": null,

"sites":[17,18,19,20, ...]

},

{

"name": "LEU",

"number": 18,

"insertion_code": "A",

"sites":[29,30,31, ...]

For me the indentation levels were not consistent so I try to correct them here.

JPBergsma · 2022-09-13T09:01:10Z

optimade.rst

+     This field is useful as an overview of every chain and as a query target for the structure.
+     Standard labels for this field are the follwoing: 'protein', 'nucleic acid', 'carbohydrates', 'lipid', 'membrane', 'ligand', 'ion', 'solvent' and 'other'.
+     The list SHOULD contain values within the standard labels.
+     Additional custom labels MAY be used. These labels MUST include the database-provider-specific prefix with the following format: <prefix>:<label>.


I think it would be good to add a custom label to the example.

JPBergsma · 2022-09-13T09:27:04Z

optimade.rst

+   - Values in :property:`name` SHOULD be in capital letters.
+   - Values in :property:`name` SHOULD NOT be longer than 1 character when the number of chains is not greater than the number of letters in English alphabet (26).
+   - Values in :property:`sequences` SHOULD be in capital letters.
+   - Number of values in :property:`sequences` and :property:`sequence_types` MUST match.


Maybe add here that the order of the values should also be the same.

Suggested change

- Number of values in :property:`sequences` and :property:`sequence_types` MUST match.

- The number of values and their order in :property:`sequences` and :property:`sequence_types` MUST match.

optimade.rst

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

optimade.rst

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

merkys · 2023-06-08T09:06:50Z

optimade.rst

+   - :property:`name`: string (REQUIRED)
+   - :property:`number`: integer (REQUIRED)
+   - :property:`icode`: string or null (REQUIRED)
+   - :property:`chain`: string (OPTIONAL)


What does it mean when chain is missing?

It may happen in a regular PDB file that the chain column is blank and this is not necessarily wrong. I don't think there is any physical or chemical meaning. Chains are something very custom and there is not a strict criteria for setting them.

In our database when chains are missing we set them automatically using a chain per fragment logic but this is just to have the data standardized. Some tools just set all atoms belonging to chain 'X' and some tools simply respect that and let the structure without chains.

Thanks for explanation. But maybe then it would make sense to make chain mandatory and faithfully retain the space character ( ) as its value?

Sure, it also works to me.

We also talked about not getting constrained by the limits of PDB format regarding the 1 character string in the chain name, so the missing chain could also be 'Not defined', '', null or many others. As you prefer.

Dani Beltrán added 2 commits February 23, 2022 11:40

added new biomol fields

eca24df

added new biomol sequence fields

572a29f

JPBergsma reviewed Feb 23, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

d-beltran and others added 3 commits February 23, 2022 16:53

Update optimade.rst

173172f

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

Title underlines fit

5911d53

More explained biomol_chain types

05a8fed

JPBergsma reviewed Feb 24, 2022

View reviewed changes

merkys reviewed Feb 24, 2022

View reviewed changes

Added standard labels for biomol_chain types

76f690c

JPBergsma reviewed Feb 28, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

d-beltran added 4 commits February 28, 2022 14:59

Added underscore to new fieldnames

1f5650d

Amend

a8781ac

New species property: biomol atom name

ac6ed65

Restructured biomol sequences

55f71e2

ml-evs added the type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. label Jun 1, 2022

rartino added the PR/waiting-for-update This PR has been reviewed and is waiting for the author to response or update the PR label Jun 29, 2022

d-beltran added 2 commits September 13, 2022 15:31

Merge branch 'develop' into iss389_biomol

534ec8d

Laussane discussions update

f865a5a

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

d-beltran and others added 4 commits September 13, 2022 18:17

Update optimade.rst

af3c817

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

Update optimade.rst

616019c

Update optimade.rst

bd3e9e1

Update optimade.rst

ce93c9d

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

added a few breaklines with correct indent

a875aaf

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Sep 13, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

d-beltran and others added 6 commits September 14, 2022 12:33

Update optimade.rst

a1ddd88

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

Update optimade.rst

a0ee16a

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

Update optimade.rst

c0ea1ac

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

Update optimade.rst

af14d72

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

insertion_code renamed as icode

7610146

Merge branch 'develop' into iss389_biomol

c38075c

ml-evs mentioned this pull request Dec 5, 2022

OPTIMADE v1.2 release planning #429

Open

merkys reviewed Jun 8, 2023

View reviewed changes

rartino mentioned this pull request Jan 10, 2024

InChIKey property #466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New biomol fields #400

New biomol fields #400

d-beltran commented Feb 23, 2022

JPBergsma Feb 24, 2022

d-beltran Feb 24, 2022

JPBergsma Feb 28, 2022

d-beltran Feb 28, 2022

merkys Feb 24, 2022

d-beltran Feb 24, 2022

JPBergsma Feb 25, 2022

merkys Feb 25, 2022

d-beltran Feb 25, 2022

JPBergsma commented May 19, 2022

d-beltran commented May 20, 2022

JPBergsma commented May 23, 2022

JPBergsma May 4, 2022

JPBergsma May 5, 2022

JPBergsma Sep 13, 2022

JPBergsma Sep 13, 2022

JPBergsma Sep 13, 2022

merkys Jun 8, 2023

d-beltran Jun 8, 2023 •

edited

merkys Jun 8, 2023

d-beltran Jun 8, 2023

	- Number of values in :property:`sequences` and :property:`sequence_types` MUST match.
	- The number of values and their order in :property:`sequences` and :property:`sequence_types` MUST match.

New biomol fields #400

Are you sure you want to change the base?

New biomol fields #400

Conversation

d-beltran commented Feb 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPBergsma commented May 19, 2022

d-beltran commented May 20, 2022

JPBergsma commented May 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-beltran Jun 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-beltran Jun 8, 2023 •

edited