Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish experimental structures from theoretical #406

Open
merkys opened this issue May 31, 2022 · 42 comments · May be fixed by #455
Open

Distinguish experimental structures from theoretical #406

merkys opened this issue May 31, 2022 · 42 comments · May be fixed by #455
Labels
topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.

Comments

@merkys
Copy link
Member

merkys commented May 31, 2022

As suggested by @BobHanson, there should be standard means to distinguish between experimental and theoretical structures. This could be a property with boolean/enum values. I would suggest "MUST" level of support (maybe even for queries), as I believe this bit of information should always be available.

@ml-evs
Copy link
Member

ml-evs commented May 31, 2022

Could we get away with defining a new enum value in structure_features for this?

@merkys
Copy link
Member Author

merkys commented May 31, 2022

Could we get away with defining a new enum value in structure_features for this?

Sounds good to me. Should it be theoretical for theoretical structures? Or rather experimental for experimental ones? Which is more natural? For me it is theoretical, but I come from experimental background, hence my bias 😄

@merkys merkys added topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. labels May 31, 2022
@JPBergsma
Copy link
Contributor

JPBergsma commented May 31, 2022

I think we will need to have both theoretical and experimental. That way we can implicitly also have the case where it is undefined, which may be useful for old entries for which it was not recorded whether it is a theoretical or experimental structure, or simply for databases that have not been updated.

What you are proposing does stretch the meaning of the structure_features field, which is defined as: A list of strings that flag which special features are used by the structure.

Perhaps this is also a good moment to think about how we want to include more detailed information about how the structure was generated. Especially information that would be interesting for generating querying structures. X-ray scattering or neutron scattering, ab initio calculations. The software package that was used. etc.

@BobHanson
Copy link

BobHanson commented May 31, 2022 via email

@merkys
Copy link
Member Author

merkys commented May 31, 2022

Reading @JPBergsma and @BobHanson responses I am now leaning towards separate property. It could actually provide more information about the origin of a structure. In the COD, we have a CIF data item _cod_struct_determination_method with the following possible values: single crystal, powder diffraction and theoretical. Maybe something similar could be introduced into OPTIMADE.

@BobHanson
Copy link

This sounds great to me. But can you have theoretical PD? Re there two concepts here?

data[i].nature: {"experimental"|"theoretical"}
data[i].method: {"single crystal diffraction"|"powder diffraction"}

@merkys
Copy link
Member Author

merkys commented May 31, 2022

This sounds great to me. But can you have theoretical PD? Re there two concepts here?

data[i].nature: {"experimental"|"theoretical"} data[i].method: {"single crystal diffraction"|"powder diffraction"}

Right. Then these should be separate properties.

@merkys
Copy link
Member Author

merkys commented Jun 1, 2022

How should we name such a property? Some suggestions:

  • nature
  • origin
  • determination_method.

Personally, nature does not sound immediately clear to me, origin might also be quite ambiguous.

@JPBergsma
Copy link
Contributor

Personally, nature does not sound immediately clear to me, origin might also be quite ambiguous.

Yes, I also would not know a good name for this distinction. From the suggestions above I found determination_method the clearest. But perhaps we can also name it simply experimental_or_theoretical .

@BobHanson
Copy link

BobHanson commented Jun 1, 2022 via email

@merkys
Copy link
Member Author

merkys commented Jun 2, 2022

@JPBergsma: experimental_or_theoretical will need renaming should more enumerator values be introduced (i.e., mixed or something else).

@BobHanson: "experimental_method": "experimental" sounds slightly wonky to me.

@BobHanson
Copy link

Ah, right. This was in reference to

experimental_method: {single crystal diffraction | powder diffraction|...}
investigation_type: {experimental | theoretical}

brainstorming...

@blokhin
Copy link
Member

blokhin commented Jun 3, 2022

Following the discussion with @sauliusg I can also point out many edge cases where the experimental or theoretical nature is not immediately clear. An example is de-novo crystal structure refinement, see [1], [2], and many more.

@blokhin
Copy link
Member

blokhin commented Jun 3, 2022

cf. computational experiments vs. experimental modeling

@BobHanson
Copy link

BobHanson commented Jun 3, 2022 via email

@JPBergsma
Copy link
Contributor

experimental_or_theoretical will need renaming should more enumerator values be introduced (i.e., mixed or something else).

Yes, you are right, that is not convenient.

How about method_class or method_category ? That allows another field named method that holds a more specific term for the procedure used to generate the data.

@merkys
Copy link
Member Author

merkys commented Jun 6, 2022

Indeed, there is a whole spectrum of methods ranging from purely experimental (can we actually get coordinates without any theoretical assumptions?) to purely theoretical. We probably would need a separate ontology just to identify where a structure sits in that spectrum.

@BobHanson
Copy link

BobHanson commented Jun 6, 2022 via email

@merkys
Copy link
Member Author

merkys commented Jun 6, 2022

@BobHanson Is there a link for the said ICSD conception?

@rartino
Copy link
Contributor

rartino commented Jun 7, 2022

I understand the desire to add something ASAP to help distinguish experimental and theoretical structural data. However, I'd suggest to be careful to not over-design this interface, since it is debatable if this info even belongs in this endpoint. Going forward, we won't be able to stuff all possible experimental and theoretical details related to a structure into the structure endpoint and, at least for theoretical structures, I believe our consensus is that things like "which method", etc., belongs in the calculations endpoint with relationships to "input" and "output" structures.

Hence, I suggest this to just be a simple boolean field: experimental that is defined to be True if, and only if, the structural data, including the atomic coordinates, represented by the structure have been obtained more or less directly out of an experiment and thus the crystal structure reasonably can be understood to have been observed in nature.

The alternative, False, just means that the structure has been obtained some other way. E.g., hypothetical structures through substitutions (perhaps including DFT relaxations, etc., but not necessarily), structure prediction algorithms, just random initialization, etc. No guarantees that these structures "make sense".

(Or is there a very strong desire to also distinguish theoretical structures that the database strongly believes are at, or very close, to the convex hull of stability? This, I believe, is the ICSD criterion for inclusion.)

@blokhin
Copy link
Member

blokhin commented Jun 7, 2022

Unfortunately it starts to be complicated here. Imagine we took an experimental structure and relax it fully with the DFT, ending up with the different cell, symmetry, atomic positions, etc. Is the structure still experimental?

@rartino
Copy link
Contributor

rartino commented Jun 7, 2022

@blokhin

Unfortunately it starts to be complicated here. Imagine we took an experimental structure and relax it fully with the DFT, ending up with the different cell, symmetry, atomic positions, etc. Is the structure still experimental?

It was my intent to mostly avoid this complexity by a single stringent definition separating everything into "directly from experiment" vs. other. My definition above was meant to say that your example is not an experimental structure.

@ml-evs
Copy link
Member

ml-evs commented Jun 7, 2022

Following the discussion with @sauliusg I can also point out many edge cases where the experimental or theoretical nature is not immediately clear. An example is de-novo crystal structure refinement, see [1], [2], and many more.

I think my suggestion was for actual vs hypothetical which maybe makes this slightly clearer (though shifts the vagueness elsewhere, e.g. whether a DFT database that simply took experimental structures and calculated band gaps without relaxing should report itself as hypothetical or actual).

The two relevant axes for filtering seem to me to be whether something has actually been made, and whether the structure is simply the result of minimising or sampling of a Hamiltonian

@BobHanson
Copy link

BobHanson commented Jun 8, 2022 via email

@merkys
Copy link
Member Author

merkys commented Jun 9, 2022

Having read the discussion, I tend to agree with those of you favoring single boolean flag. The question now is where to draw the line. However, neither ICSD paper nor related discussion on matsci.org does provide clear criteria (thanks @BobHanson for links, though). @vaitkus, maybe IUCr has put up any criteria?

I am a bit skeptical regarding the structures relationships with calculations though. In TCOD we have theoretically calculated structures from journal publications, but usually machine-readable metadata related to actual calculations is scarce (but reported in human-readable publications). Thus if calculations entries become mandatory for theoretical structures, we would not be able to return much meaningful data in them.

@vaitkus
Copy link
Contributor

vaitkus commented Jun 27, 2022

@merkys, as far as I know, the IUCr does not have any such criteria.

However, the ICSD paper lists three types of subclasses of theoretical structures:

  • Predicted (non-existing) crystal structure.
  • Optimized (existing) crystal structure.
  • Combination of theoretical and experimental structure.

Based on this, I would say that according to them anything that is not purely experimental is classified as theoretical.

@merkys
Copy link
Member Author

merkys commented Jun 27, 2022

Based on this, I would say that according to them anything that is not purely experimental is classified as theoretical.

I think there might be difficulties in drawing the line between refinement with statistical potentials, forcefields and DFT.

@rartino
Copy link
Contributor

rartino commented Jun 28, 2022

@merkys

I am a bit skeptical regarding the structures relationships with calculations though. In TCOD we have theoretically calculated structures from journal publications, but usually machine-readable metadata related to actual calculations is scarce (but reported in human-readable publications). Thus if calculations entries become mandatory for theoretical structures, we would not be able to return much meaningful data in them.

I don't think anyone proposed to make them mandatory for theoretical entries? Just that if you have data or metadata related to the calculation itself for, say, a calculation that started from one structure, and resulted into a couple of output structures, that data would better belong under the calculations endpoint than being stored alongside the structures. For example, if you want to provide details on cutoffs, k-point grids, DFT functionals, etc.

@merkys
Copy link
Member Author

merkys commented Jun 28, 2022

I don't think anyone proposed to make them mandatory for theoretical entries?

No. I mistakenly assumed this was the suggested solution for telling experimental structures from theoretical.

Just that if you have data or metadata related to the calculation itself for, say, a calculation that started from one structure, and resulted into a couple of output structures, that data would better belong under the calculations endpoint than being stored alongside the structures. For example, if you want to provide details on cutoffs, k-point grids, DFT functionals, etc.

Agree.

@BobHanson
Copy link

BobHanson commented Oct 11, 2022 via email

@ml-evs
Copy link
Member

ml-evs commented Feb 17, 2023

Revisiting this ahead of today's meeting... we are currently very open to (and in fact should be encouraging!) huge databases of ML structures (not even necessarily ab initio/MM refined) suddenly swamping out all of our multi-provider queries.

I still think the lowest friction way of adding this would be with specific structure_feature tags that could be eventually be promoted to standalone properties (in say, v2...). Some concepts that we've discussed that could be added as tags:

  • experimental: atomic coordinates arise directly from an experimental technique
  • asdfasdfsaf/computed?: calculation models a known experimental structure (somehow the opposite of hypothetical, which should be the default, but cannot think of a good name)
  • stable: a database-local assessment of thermodynamic stability -- strictly I would say this should be the exact structures that form the convex hull that other structures are assessed against (bundling this in here feels like cheating but we are really making no progress on getting people to add their stability data to OPTIMADE when they have it). (I still want this but we can discuss in another issue)

This would render our default structure with structure_features = [] as purely hypothetical, "unstable" and not experimentally derived.

Potential nasty side effects:

  • databases that do not control what gets put into them (very roughly speaking), e.g., NOMAD, MaterialsCloud etc. would be "stuck" at this default level without extra work (e.g., structure matching, a la NOMAD encylopedia).
  • additional work needed by those who do serve data that goes beyond the "default" above, e.g., every COD entry gets set to experimental, every MP entry that has an associated ICSD code becomes computed (or whatever).
  • we slightly trample on the current definition of structure_features which is purely an implementation detail of: which exotic OPTIMADE features does this structure require to be conveyed through the OPTIMADE structure type. This definition would have to be significantly broadened to accommodate the above, though personally I think the field name is vague enough that this wouldn't be too much of a headache.

I really think that this is the biggest added-value change we could make to OPTIMADE right now, and its something we need to do if we want to scale out to new databases beyond our existing community

@merkys
Copy link
Member Author

merkys commented Feb 17, 2023

Thanks for reviving this issue. Admittedly, structure_features approach has its appeal, but side effects seem way too serious to me. Having a property taking values experimental or theoretical with "do not assume anything" as a default has less drawbacks (I think). Then there is the whole spectrum in between, but I do not think we can tackle them anytime soon.

@rartino
Copy link
Contributor

rartino commented Feb 17, 2023

I agree with that we should rush this feature.

I'd rather not place it in structure_features, since the idea was that if a client sees anything in there that it doesn't recognize, it should not assume it understand how to represent the structure. This info is not that. But, why not just under a new name?

We can do a list the same way as structure_features works, e.g., structure_classes, structure_types, structure_tags? On the other hand, experiment vs computed (refined, processed?) seems mutually exclusive, so isn't it better to just represent this info as a single string? We define what we have now, but remain open to additional strings in the future.

So, here is a concrete proposal:


Field name: structure_origin is a string that SHOULD take one of the following values:

  • experimental [as defined by @ml-evs] all the structural information (coordinates, elements, assemblies, etc.) is a direct and unmodified representation of the outcome of an experimental technique for structure determination.
  • processed: the structural information originates from experimental data, but has been further processed in such a way that the resulting structure still is recognizable as the experimental one it was based on. For example, structures that have undergone relaxation using ab-inito calculations with the experimental structure as the starting point are meant to be placed in this category. Structures that have substituted one or more elements (but otherwise kept the experimental coordinates the same) are not meant to be in this category. Ultimately there is a level of subjectivity in this definition which is to be decided by the database provider.
  • predicted: the structural information has no direct connection to the outcome of an experimental technique on an existing material, but has undergone theoretical processing that motivates the database provider to suggest it as a candidate for a synthesizable structure. This category is intended for, e.g., theoretically invented structures relaxed using ab-inito calculations and found to end up close to the convex hull of stability, or structures generated out of an AI model demonstrated to have a reasonable predictive power. This category definition has a rather high level of subjectivity which has to be decided by the database provider.
  • hypothetical: the structure comes with no guarantee of having any relationship to synthesizable structures or structures that can be found in nature. This category is encouraged for, e.g., randomly placed atoms used as a starting point for further processing or the outcomes of AI models for which the predictive power is not deemed sufficient for predicted.

If the field is missing or equal to an empty string or null, it means no guarantees are given with respect to the origin of the structure (which can be interpreted as similar to hypothetical).

Database-specific strings using a database provider prefix (e.g., _exmpl_experimental_at_extreme_pressure MAY be used but are strongly discouraged.


However, writing the above up, I realize that some kind of info on whether the above classification refers to structures existing at NTP or other conditions would also be useful here. If the reason you want this is because you want to filter on "reasonable structures" to use for your AI model, I'm not sure you want to include someones database of 1M structures relaxed to the convex hull but at one billion bar and 10000 K.

@merkys
Copy link
Member Author

merkys commented Feb 17, 2023

I agree with @rartino's proposal, but would change the following:

If the field is missing or equal to an empty string or null, it means no guarantees are given with respect to the origin of the structure (which can be interpreted as similar to hypothetical).

into the following:

If the field is missing or equal to an empty string or null, it means no guarantees are given with respect to the origin of the structure.

Some structures in the COD have so little accompanying metadata that no one can really tell to which class such structures belong. I think "do not assume anything" is a reasonable default value.

@JPBergsma
Copy link
Contributor

I think we should see this point as part of the larger issue, that we have not standardized anything yet about the methods used to generate the structure. So I think we should think about how we want to include such information in general. (We do not need to discuss which properties should be included. That could be addressed in a later issue.)

@rartino
Copy link
Contributor

rartino commented Feb 17, 2023

I agree with @rartino's proposal, but would change the following:

If the field is missing or equal to an empty string or null, it means no guarantees are given with respect to the origin of the structure (which can be interpreted as similar to hypothetical).

into the following:

If the field is missing or equal to an empty string or null, it means no guarantees are given with respect to the origin of the structure.

The question here, I think, is: should hypothetical only mean there are "no guarantees" (which applies also to COD), or that one knows that the coordinates are "not experimental"? In the write-up I meant the former, but I think @merkys has a point that it would be nice to be able to specifically say "I don't know" and have that be the default.

So, how about we add unknown to the list of options, and say that no value/null is the same as unknown? (And maybe slightly update the hypothetical definition to say that unknown is the better choice if it is not known which category it is.)

@ml-evs
Copy link
Member

ml-evs commented Feb 17, 2023

Thanks @rartino @sauliusg @merkys and @JPBergsma for the extended discussion after today's meeting. I will try my best to summarize the issues with adding this and the options we discussed.

  1. We add this field as a new property, say structure_origin that follows an enum of some TBD tags experimental, predicted etc. and make queries on it mandatory with a default value of unknown.
    • The concept of this field has broad support, modulo implementation details.
    • Adding this field in this way would mean that no v1.1 implementation could support these queries without updating to 1.2 (due to the specific way unknown fields are meant to be handled), hence delaying the usefulness of this field.
  2. We overload the structure_features field an extend its definition to include non-overlapping tags, the omission of which (the default) would indicate unknown structure source.
    • This is admittedly bad API design for many reasons, and would need a change in the definition of the structure features field that may break it for future use cases. The question would be whether it is worth sacrificing our design to pragmatically enable these queries to both v1.1 and v1.2 implementations, from the second v1.2 is released.
  3. We add explicit support for filtering on properties_that_might_not_be_in_this_version via a prefix, say _future_ or _optimade_next or _optimade_v1.2 that means that the field from 1) can still be sent to v1.1 URLs and receive a sensible default response (null)
    • This introduces some complexity to both users and implementers: users have to know to use this weird additional syntax when querying what otherwise appears to be a normal OPTIMADE property and v1.2 implementations have to know to strip this prefix from the field and check if it is known to them.
    • A client with full support for version negotiation could get around this anyway, though no generic multi-provider, multi-version client exists yet that knows which fields are allowed between different versions (afaik).
  4. We finally disentangle prefixes from providers and serve an OPTIMADE-consortia namespace that provides this property, in a slight modification to 3).
    • In this way, it remains possible to query v1.1 implementations with this field, but it becomes hard for it to re-enter into the proper specification.
    • This would require significant coordination effort for setting up a namespace, best practices etc. and words to that effect in the specification, HOWEVER this is absolutely something we will want to do in the case of other properties, so this effort won't be "wasted".
    • In fact, this could become the "proper" mechanism for adding new fields between minor versions of OPTIMADE. The prefix could become _optimade_2.0 and the assurance could be made that any fields in the _optimade_2.0 prefix will be added at the next version, so supporting them "early" is not wasted effort. (This was not discussed but came to me whilst writing this up). I don't know how hard a requirement we would want to make this on ourselves -- it should be treated in the same way we would treat removing/changing a field between major versions (i.e., we're allowed to, but good design would suggest that we keep some similarity at least).
    • The backwards-compatible queries leading to the right answers nature of this query only works if the default value of the field carries some semantic information. If we standardized e.g., band gap in this optimade_2.0 prefix (not that we ever would), then a query for _optimade_2.0_bandgap > 0 will return anything with a null band gap in a non-supporting databases, because we want structure_origin = null to mean something, this could be a useful mechanism.

I hope that adequately summarizes our discussion. I am personally leaning towards 4 and would be happy to help set this up.

@merkys
Copy link
Member Author

merkys commented Feb 20, 2023

Having read @ml-evs proposal, option 1 still sounds the most elegant to me. I get that a new specification version (or an RC) has to be released and implemented to support such structure_origin property, but I would argue that so would the rest of the options.

According to the specification, implementations can serve preview (or RC) versions of OPTIMADE:

API versions that are published with a suffix, e.g., :val:-rc<number> to indicate a release candidate version, SHOULD be served on versioned base URLs without this suffix.

So once structure_origin reaches RC, clients may immediately start querying /v1.2 with this new feature. If implementations return HTTP error 553 Version Not Supported, a client drops the structure_origin and fallbacks to /v1.1 or /v1.

@rartino
Copy link
Contributor

rartino commented Feb 20, 2023

After thinking a bit more on this since that discussion: I sympathize with what @ml-evs wants to do by introducing this feature without having older servers return errors, and going forward the situation with forwards-compatibility needs to be improved. However, for this one change, I still come down on the side that we should be able to accept the breakage:

  • We still don't have that many separate servers implementing OPTIMADE in production yet.
  • The primary use case we are trying to address here is: "Give me all structures that "make sense"; exclude arbitrary made-up structures, intermediate steps, etc."
  • For this use case, if you send a structure_origin = ... query to 10 servers and 5 of them return an error, you can just re-send the same query to the 5 that recognize structure_origin and get the data you would have received with @ml-evs's approach (2).
  • Now, the user can stop here (with the same results as in @ml-evs (2)), but will probably find it to be better to go through the databases giving errors and make a manual decision on the database level about whether to include their structures or not. (For example, the user probably still wants to include all structures from COD, given how their database is created from primarily experimental structures, even if this provider do not yet implement structure_origin).

Now, for the general issue: going forward, I think the most direct solution would be to just demote the error for unrecognized properties without prefixes into a mandatory warning. How about the following changes to section 3.9?:


3.9 Handling unknown property names

When an implementation receives a request with a query filter that refers to an unknown property name it is handled differently depending on the database-specific prefix:

  • If the property name has no database-specific prefix, or if it has the database-specific prefix that belongs to the implementation itself, the error 400 Bad Request MUST be returned with a message indicating the offending property name.
  • If the property name has a database-specific prefix that does not belong to the implementation itself, it The implementation MUST NOT treat this as an error, but rather MUST evaluate the query with the property treated as unknown, i.e., comparisons are evaluated as if the property has the value null.
    • If the property does not have a database-specific prefix the implementation MUST issue a warning about the unrecognized property name.
    • Furthermore, If the property has a database-specific prefix that the implementation does not recognize the prefix at all, it SHOULD return a warning that indicates that the property has been handled as unknown.
    • On the other hand, If the property has a recognized prefix, the prefix is recognized, i.e., as belonging to a known database provider, the implementation SHOULD NOT issue a warning but MAY issue diagnostic output with a note explaining how the request was handled.

@ml-evs
Copy link
Member

ml-evs commented Feb 21, 2023

I think I have been talked off the proverbial ledge here, I must admit that writing out options 1-4 above did seem like overkill. My only concern now with adjusting the unknown property names directly to avoid breakages is that results returned from a unknown property filter will be "wrong" and will require all clients to strictly check warnings (somehow this feels like more of a breaking change now than queries not working across versions 😅). I was fine with this in the specific case of adding a field with a default value that can be sensibly applied backwards (although still concede this is poor API design), but think we should exercise some caution...

@merkys merkys linked a pull request Jun 8, 2023 that will close this issue
@merkys
Copy link
Member Author

merkys commented Dec 7, 2023

We may use crystal structure prediction outputs as a possible use case, for example this paper by Reilly et al.. Table 2 of the paper gives a variety of methods used. Structure aggregators like TCOD will have to fit them into the categories proposed for OPTIMADE.

@ml-evs
Copy link
Member

ml-evs commented Dec 7, 2023

Indeed, CSP was my main initial motivation too. Hopefully we can revisit #455 in 2024...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants