Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SMILES property #368

Open
JPBergsma opened this issue Jul 5, 2021 · 59 comments · May be fixed by #392 or #436
Open

Add SMILES property #368

JPBergsma opened this issue Jul 5, 2021 · 59 comments · May be fixed by #392 or #436
Labels
type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.

Comments

@JPBergsma
Copy link
Contributor

Do we want to allow the use of smiles string in the field chemical_formula_descriptive ?
The SMILES notation for molecular formulas uses '#' and '$' to indicate triple and quadruple bonds,
the characters '/' and '' to indicate whether the bonds are in the cis or trans orientation and '@' and '@@' to differentiate enantiomers. Finally, ring numbers with more than one digit have to be preceded by a '%' sign.
It, therefore, seems reasonable to me to add these to the allowed characters for the chemical_formula_descriptive field.

Or do you think we should add a separate SMILES field instead?

@merkys
Copy link
Member

merkys commented Jul 6, 2021

Or do you think we should add a separate SMILES field instead?

I would suggest so. chemical_formula_descriptive has its own purpose and semantics, and they should not change.

@rartino
Copy link
Contributor

rartino commented Jul 6, 2021

@JPBergsma the topic of SMILES have come up a few times and a standardization for SMILES use in OPTIMADE would likely be very useful. If you are familiar with SMILES usage, could you perhaps describe a few "search scenarios" of SMILES data? E.g., what would you be searching for? How do you envision such a search could be expressed, etc.?

@JPBergsma
Copy link
Contributor Author

Sorry, I did not read the specification for chemical_formula_descriptive well enough the first time and I overlooked that it is already defined by the IUPAC's Nomenclature. I, therefore, had already closed the issue but unfortunately, I did not have sufficient privileges to remove it.

It would indeed be better to add a separate field for the SMILES string, although we could also think about other ways to add topological information, as smiles strings cannot be compared directly.

@JPBergsma JPBergsma reopened this Jul 6, 2021
@rartino rartino changed the title Add '@','%','/','\','#' and '$' to allowed characters in the chemical_formula_descriptive field Add property for SMILES that allows '@','%','/','\','#' and '$' Jul 6, 2021
@rartino
Copy link
Contributor

rartino commented Jul 6, 2021

(I took the liberty of editing your issue title to match - feel free to adjust it)

@JPBergsma
Copy link
Contributor Author

First of all, defining the topology of a molecule allows you to distinguish between molecules with the same elemental composition but a different structure. Perhaps the current IUPAC definition is also able to do so, but via the link in optimade.rst https://www.qmul.ac.uk/sbcs/iupac/bibliog/blue.html I only found information about how to name chemical compounds and not how to write the structural formula. (IUPAC did define the InChI format which does contain the molecular structure, but that is different from the example fields in OPTIMADE.)

Ideally, having the structural data of a molecule would also allow you to find molecules with a mostly similar structure but some small differences. For example, a structure where a hydrogen atom has been replaced by a methyl group or a bromine atom has been replaced by a chlorine atom. While this would be quite useful, it may be difficult to implement such a search.

I am not sure whether SMILES is the best option for this. It has the advantage that the strings are relatively human-readable but multiple SMILES strings can encode for the same molecule. So you first have to convert the string to a structure before you know whether they are identical, or you have to agree on which algorithm to use to generate SMILES strings.

There are other ways to store the structure of a molecule, like InChI, and another option would be to use a connectivity matrix.

@ml-evs ml-evs added the type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. label Jul 8, 2021
@JPBergsma
Copy link
Contributor Author

During OMDI I talked with someone from the Ocelot database.
Their database has crystal structures of organic molecules.
They use SMILES strings to search to select structures as one structure can have many names and a simple structural formula is not descriptive enough. So I think there would definitely be a use for a SMILES field within Optimade.
In the original SMILES string, there could be multiple strings encoding the same molecule. Therefore they first convert the string to a structure and then convert it back to a smiles string with a known algorithm so the SMILES strings are guaranteed to be the same. They also match chemical groups, for example when I searched for benzene, they also returned molecules containing a benzene ring. They have a git reposit, so perhaps we could reuse some of their code to implement this in the Optimade python tools.

@JPBergsma JPBergsma reopened this Oct 17, 2021
@merkys
Copy link
Member

merkys commented Oct 18, 2021

I support standardizing a separate property for SMILES. However, there are some issues related both to its definition and usability.

  1. There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.
  2. The same molecule can yield different SMILES. Canonicalization algorithms exist, but again there are many, without a prevalent one.
  3. SMILES matching is not string matching. While identical SMILES almost always mean identical molecules, this is pretty much the only comparison one can do with plain strings. There are tools like Mychem which implement substructure search using SMILES strings in MySQL, but the general SMILES comparison usually boils down to subgraph isomorphism. Fingerprinting techniques are a viable alternative.
  4. SMILES are directed mostly at organics. Therefore, compounds beyond organics are not trivial to represent, resulting in the need for additional conventions on representing them. We have contributed to an article about that, Quirós et al. 2018.

InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage.

@rartino
Copy link
Contributor

rartino commented Oct 18, 2021

There is also the question how we handle this type of extension into string-like complex properties in the OPTIMADE filter language (and otherwise in our type system). Far back I wrote up my thoughts on this here: #157 (comment)

But, in short, we probably need to have some way to tell a normal string and a smiles string apart since they will have different comparison semantics.

@JPBergsma
Copy link
Contributor Author

@merkys

There is a bunch of competing SMILES specifications. I like the OpenSMILES as it is quite well-defined, albeit somewhat limited and unmaintained. Competing specifications mean that different software suites usually support one or another specification, but usually without clearly stating which one.

1 The OpenSmiles standard is definitively an option. It seems practically the same as the SMILES definition on the Daylight website so if necessary we could switch. Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

The same molecule can yield different SMILES. Canonicalization algorithms exist, but again there are many, without a prevalent one.

2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers.
At the moment I prefer canonicalization by the server as this does not put canonicalization requirements on the client and the server would need to do some processing anyway to handle queries using SMARTS.
Internally the server may also store structure information in a different format than SMILES so it would need to do a conversion anyway.
Another question would be whether we want to canonicalize the output.

SMILES matching is not string matching. While identical SMILES almost always mean identical molecules, this is pretty much the only comparison one can do with plain strings. There are tools like Mychem which implement substructure search using SMILES strings in MySQL, but the general SMILES comparison usually boils down to subgraph isomorphism. Fingerprinting techniques are a viable alternative.

3 I think it will indeed be necessary to generate a molecular graph. Although a preselection could be made using fingerprinting, for example, by looking at the atom composition of the searched fragment, or by comparing which common structural elements are present.
This way the full structures would only need to be compared for a relatively small number of structures.

SMILES are directed mostly at organics. Therefore, compounds beyond organics are not trivial to represent, resulting in the need for additional conventions on representing them. We have contributed to an article about that, Quirós et al. 2018.

4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.

InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage.

5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust.

Standard InChI has the limitation that tautomers have the same InChI code. In a laboratory setting, it is usually not possible to separate the tautomers so this would not be a problem. But in computational chemistry, the timescales are usually so short that no conversion takes place. There is an extension for this so I think we should implement it if we would want to use InChI. That way each InChI should belong to exactly one structure.
Personally, I find InChI less intuitive and human-readable than SMILES, so simply typing in an InChI code would be more difficult than with SMILES.

A final option would be to use a molecular graph for searching.

@rartino

Unless we decide on a canonicalization algorithm, the SMILES field should indeed not have the string type as a direct comparison of uncanonicalized SMILES strings is not possible.

@merkys
Copy link
Member

merkys commented Oct 22, 2021

(For brevity, I am not citing and explicitly responding to @JPBergsma sentences with which I completely agree)

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers. At the moment I prefer canonicalization by the server as this does not put canonicalization requirements on the client

Yes, this makes sense.

and the server would need to do some processing anyway to handle queries using SMARTS.

Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

Another question would be whether we want to canonicalize the output.

Preferably yes.

4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.

This would be nice, but again, all providers should use conventions as similar as possible.

5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust.

Strictly speaking, this is true only if providers manage to use InChI library without modifying its code.

@JPBergsma
Copy link
Contributor Author

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.

the server would need to do some processing anyway to handle queries using SMARTS.
Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.)

@JPBergsma JPBergsma changed the title Add property for SMILES that allows '@','%','/','\','#' and '$' Add SMILES property Nov 19, 2021
@merkys
Copy link
Member

merkys commented Nov 26, 2021

Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.

This can already be implemented by using custom extension endpoint mechanism.

I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.

Sorry, I misparsed the term "extension".

I believe the SMARTS were originally described by Daylight. I am not sure about the state of other parallel SMARTS specifications, though.

the server would need to do some processing anyway to handle queries using SMARTS.
Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.

In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.)

Yes, that is true.

@merkys
Copy link
Member

merkys commented Nov 26, 2021

Looking back at my discussion checklist, I think we at least agree on using OpenSMILES. However, other issues still need more discussion. My suggestions to speed up the introduction of SMILES property would be the following:

  1. Server-provided SMILES need not to be canonical. Since there are many canonicalization methods and we probably cannot select one from them all, servers should just provide any SMILES representation of a structure. Then it is up to client to canonicalize them or not.
  2. Comparisons of SMILES with other SMILES or strings must not be supported, as well as querying. We may introduce this support later.

This would make the SMILES property a descriptive one. Thus, the client will be able to retrieve SMILES values alongside other structural data, but would not be able to query on them.

For dealing with inorganics I could propose adhering to Quirós et al. 2018 (disclaimer: I am one of the authors), but this would not be convenient for providers using their own conventions, or producing SMILES by Open Babel or some other software.

@JPBergsma
Copy link
Contributor Author

I agree on point 1, that databases are allowed to use their own canonicalization method.

Part of the reason to implement this though is to make it easier to search for organic molecules, as these can have the same chemical formula. For that to work, it should be possible to search for SMILES strings.
This should not be that difficult to implement. The database provider can turn the SMILES string of the query into a structure and turn it back into a smiles string with the canonicalization method of choice. The generated SMILES string can then be used for a simple string comparison with the SMILES fields in the database. Searching for fragments can still be added later on if necessary.

Quirós et al. 2018 could indeed be useful for describing metal complexes and such, as far as that they are not covered by the OpenSMILES standard.

@rartino
Copy link
Contributor

rartino commented Dec 3, 2021

Aren't we landing in that we should just standardize a SMILES field to be a normal OPTIMADE String which is specified to contain an OpenSMILES representation of the implementer's choice (much like chemical_formula_descriptive, which had similar normalization issues with competing standards), and then put the requirement on MUST or SHOULD level that all partial string matching filter operators are supported?

(I realize it was said above that it cannot be a String because uncanonicalized SMILES "cannot be compared", but, the same issue technically holds for chemical_formula_descriptive and we were ok with that...)

The database provider can turn the SMILES string of the query into a structure and turn it back into a smiles string with the canonicalization method of choice.

I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer.

@merkys
Copy link
Member

merkys commented Dec 3, 2021

I agree that we can define SMILES as a regular OPTIMADE String with all string handling operations. Thus for the time being "O" != "[OH2]" is true as these strings are not equal, despite molecules with SMILES of O and [OH2] being actually the same.

So it seems we have consensus on the most of SMILES-related issues. Let us prepare a PR then? I have opened #392 from the consensus (IMO) we achieved here.

merkys added a commit to merkys/OPTIMADE that referenced this issue Dec 3, 2021
@merkys merkys linked a pull request Dec 3, 2021 that will close this issue
@JPBergsma
Copy link
Contributor Author

If we define the SMILES field as a normal OPTIMADE string we should define the canonicalization method that should be used with OPTIMADE. Otherwise, it does not make sense to put the requirement on MUST or SHOULD level for the (partial) string matching filter operators, as one molecule can have multiple different SMILES strings.

One of the main reasons to implement the SMILES notation is to enable searching on molecular structures.
Without this, sharing data on structures composed of individual molecules would be inefficient. More structures would need to be returned than needed, since you can only select on the chemical formula and many molecules can have the same chemical formula.
I can imagine that for people who want to set up a database with molecular structures, not being able to search for molecules could be a reason to not use OPTIMADE.

 I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer.

The conversion would be needed if we do not agree on a canonicalization method. If you start generating the SMILES string from different atoms within a molecule, you would get a valid SMILES string for each starting atom, but they would all be different.
Because of this, you can not do a simple string comparison to see if two SMILES strings refer to the same molecule.
So you would first need to generate the structure from the SMILES string and then turn it back into a SMILES string with the same method that has been used to generate the SMILES strings in the database.

There are already python packages that can convert SMILES strings into structures and back. RDkit can do this, and it also guarantees the created SMILES string is canonicalized, i.e. you will always get the same string regardless of SMILES string you originally used.

A simple way to make your structures with SMILES strings searchable is to covert your SMILES into structures and then back into SMILES strings with RDkit. This way, you can be sure all strings have the same canonicalization method.
If you do the same for any SMILES string that is entered as a search term. It is guaranteed that two structures are the same if the SMILES strings match and are different when they do not match.
This means a simple string comparison, which most database backends should be able to do quickly, is sufficient to find identical molecules.

One issue that we have not yet discussed is how we are going to handle structures with multiple molecules.
Within a normal Smiles string these molecules are separated by ".", This would however require partial string matching to find the molecules. I suspect that this is relatively inefficient for databases, so I think it would be better to implement this as a list.

@merkys
Copy link
Member

merkys commented Dec 5, 2021

I agree that to implement reliable querying of exact structures we have to define canonicalization method. This will most likely boil down to choosing common software package to produce canonical SMILES for OPTIMADE output, be it RDKit, Open Babel or something else. In addition, if we want to support inorganics, all providers will have to select a common set of rules to describe them.

As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings.

Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large.

Thus I very much would want to avoid forcing all the providers to use the same canonicalization method. I am afraid that instead being a useful descriptive property, SMILES would be supported by only a few providers.

@merkys
Copy link
Member

merkys commented Dec 5, 2021

One issue that we have not yet discussed is how we are going to handle structures with multiple molecules. Within a normal Smiles string these molecules are separated by ".", This would however require partial string matching to find the molecules. I suspect that this is relatively inefficient for databases, so I think it would be better to implement this as a list.

Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., smiles HAS "O", would match water molecules). Others (CONTAINS, STARTS WITH, ENDS WITH) are not supported even on grammar level.

@JPBergsma
Copy link
Contributor Author

As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings.

Indeed, matching substructures is much more complicated and beyond the scope of PR#392.

Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large.

Screening would be less efficient for both the client and the server:
The database would have to send the SMILES strings of many structures to the client. (based on the elements in the SMILES string/molecule, some preselection can be made)
Then the client would have to convert all these SMILES strings to structures so that they can be compared with the molecular structure that the client is searching. Once the SMILES strings have been found that encode for the desired molecule, The client would have to send a query to the database for the records with these SMILES strings. And the database would, then, have to loop over all SMILES values to check which contain these SMILES strings, before returning the desired structures.
This takes much more computing time than the method I suggested.
I am therefore convinced that we should not force databases to use the screening method you described.

Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., smiles HAS "O", would match water molecules). Others (CONTAINS, STARTS WITH, ENDS WITH) are not supported even on grammar level.

There are not many useful substring queries you can do on SMILES strings. You could check whether triple and quadruple bonds or charges are present, but that's about it. So we would not lose that much by converting the field to a list.
And it would off course also be possible to expand the queryability of strings in a list, although that's best left for a different PR.

@merkys
Copy link
Member

merkys commented Jan 21, 2022

@rartino

However, I have some reluctance to smiles SMARTS "<SMARTS>", because strictly speaking, SMARTS is a query language for 'structures', it isn't inherently connected to the smiles field. One could imagine a database that does not populate the smiles field but still can be queried with SMARTS.

Agree, this does not look elegant.

Furthermore, trying to think ahead, this issue is bound to come up again with other query languages; and I'm not sure we want to try to embed everything into our own language.

Completely agree.

So, maybe the most uncomplicated solution is to see this as an alternative "filter". I.e., your option (4) but maybe naming the parameter filter_smarts. We can then say that supporting multiple different filter-type URL parameters is OPTIONAL, but if supported the construct MUST be interpreted as the intersection of the filter results. This at least supports an outermost "AND" combination (since the outermost "OR" can be done as consecutive queries).

Agree with every word here!

So it seems we are arriving at these properties (all OPTIONAL):

  • smiles: string representation of structure contents, without an attempt to canonicalize; SHOULD support all string query features;
  • smiles_substructures: a list of strings representing sensible substructures in the structure (not yet sure how to express them in canonical manner - to be discussed).

Plus filter_smarts URL parameter to select chemical structures matching SMARTS. My worry that SMARTS is not strictly defined language remains.

How about this? Still we have some homework to do regarding smiles_substructures and filter_smarts. Of course we may as well ignore the possible ambiguity and leave it for the future.

@merkys
Copy link
Member

merkys commented Jan 21, 2022

In today's Web meeting @JPBergsma advocated for specific handling of string comparisons on smiles property: the provider may optionally canonicalize the value queried with = or CONTAINS operator before performing the actual string comparison. @rartino, @ml-evs and I advocated against specific handling of queried values, as none of the string-valued properties currently in the spec mandates/suggests any specific queried value treatment.

I would be happy to include filter_smarts in #392. Could the advocates for smiles_substructures property provide a description for it, if they believe it is not superseded by filter_smarts URL parameter?

@JPBergsma
Copy link
Contributor Author

I think it would be best to create a separate issue/PR for the filter_smarts. Adding it to the current PR could again lead to a lot of new discussion, which would postpone the acceptance of the SMILES field.

I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server. In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice.

@merkys
Copy link
Member

merkys commented Jan 24, 2022

I think it would be best to create a separate issue/PR for the filter_smarts. Adding it to the current PR could again lead to a lot of new discussion, which would postpone the acceptance of the SMILES field.

Agree. It makes sense to have separate PRs. I will open a separate PR for filter_smarts.

I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server.

These are very valid arguments. However, @rartino's post and Friday's Web meeting convinced me otherwise.

In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice.

I do not see this as a problem. All entries have IDs and they can be used to pick only the unique structures.

@rartino
Copy link
Contributor

rartino commented Feb 10, 2022

I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server.

These are very valid arguments. However, @rartino's post and Friday's Web meeting convinced me otherwise.

I agree that allowing intermixed queries gives more flexibility in what queries can be expressed. I do not agree about the efficiency. Edit: (from the discussion below I see that I misunderstood the efficiency part.)

There are two different possible solutions on the backend for backends that can handle SMILES:

  1. A backend that actually supports executing efficiently an intermixed OPTIMADE + SMILES query. In that case, if we go with allowing intermixed queries, the query is just executed. If we instead go with filter_smarts the backend just combines the two queries to: <SMARTS QUERY> OR <OPTIMADE QUERY> <SMARTS QUERY> AND <OPTIMADE QUERY>. There is no loss in efficiency, but there is indeed loss in the flexibility of what can be expressed.

  2. A backend that cannot do a mixed query. In that case it has to do some form of unwrapping of the mixed query to handle it - and in most cases reject it. However, if we go with filter_smarts then it will be trivial to support at least a SMILES-only query on a backend that supports that.

Now, the question is - will (1) or (2) be the more common one? My somewhat unfounded suspicion is that there is no backend today that can efficiently do (1).

In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice.

No, if a backend that actually supports intermixed queries just translates a two filter-argument query into (<SMARTS QUERY>) OR (<OPTIMADE QUERY>) there is no loss in efficiency and no risk for duplicates. Edit: @JPBergsma was right here, then I agree with @merkys that you'd have to remove duplicates by ids.

@merkys
Copy link
Member

merkys commented Feb 10, 2022

@rartino

There are two different possible solutions on the backend for backends that can handle SMILES:

  1. A backend that actually supports executing efficiently an intermixed OPTIMADE + SMILES query. In that case, if we go with allowing intermixed queries, the query is just executed. If we instead go with filter_smarts the backend just combines the two queries to: <SMARTS QUERY> OR <OPTIMADE QUERY>. There is no loss in efficiency, but there is indeed loss in the flexibility of what can be expressed.

Didn't you suggest before that given filter and filter_smarts they should be joined as <SMARTS QUERY> AND <OPTIMADE QUERY>, or am I just misunderstanding the cited paragraph?

  1. A backend that cannot do a mixed query. In that case it has to do some form of unwrapping of the mixed query to handle it - and in most cases reject it. However, if we go with filter_smarts then it will be trivial to support at least a SMILES-only query on a backend that supports that.

Now, the question is - will (1) or (2) be the more common one? My somewhat unfounded suspicion is that there is no backend today that can efficiently do (1).

I have the same feeling about (1).

By the way, I have opened PR #398 introducing filter_smarts.

@rartino
Copy link
Contributor

rartino commented Feb 11, 2022

@merkys

Didn't you suggest before that given filter and filter_smarts they should be joined as <SMARTS QUERY> AND <OPTIMADE QUERY>, or am I just misunderstanding the cited paragraph?

Right, sorry - this was just a mistype - replace every "OR" in that reply with "AND" other than that I stand by what I said.

Edit: Eh - I see that my confusion runs deeper. @JPBergsma is right in that OR queries are less efficient in that you'd need to run two queries and will get duplicates; but @merkys is right that they can be matched by ID. Even so, I think this is a less important point than choosing the construct that the majority of backends can support without having to parse and unwrap the query string.

@ml-evs
Copy link
Member

ml-evs commented Feb 14, 2022

Just saw this blogpost on Twitter and thought it would nicely complement the SMARTS discussion for those who don't know already know what it is: Easy way to visualize SMARTS

@merkys
Copy link
Member

merkys commented Feb 15, 2022

Citing myself:

Lastly, I wonder whether the syntax and interpretation of SMARTS is the same across these three packages [Indigo Toolkit, Open Babel and RDKit]. As I do not use SMARTS much, I cannot comment, however.

Saubern et al., 2011 present some evidence that SMARTS are understood differently by different cheminformatics packages, the fact which I was almost sure about. Nevertheless, we will have to live with that - I hope the differences are minimal.

@BobHanson
Copy link

I'll weigh in here.

Canonicalization. This is a much misunderstood term. "Canonicalization" is a local database strategy that can be used to do a rapid string search for a specific compound. Databases should not/do not require that a user use any particular canonicalization. Maybe they use OpenSmiles v. 2.0.5; maybe they use something else. It doesn't matter in OPTIMADE context, because nobody cares what canonicalization the implementer used. What the database does is to convert the SMILES query to their specifically implemented canonicalization so that can do a direct string match. That's all. Think of "canonicalization" as similar to "software name and version." It's just a given algorithm written at a specific time.

Point: Don't worry about canonicalization.

SMARTS. This is the real power of the SMILES business. The goal is to find substructures within a database - all the compounds that have six-membered aromatic rings with adjacent OH groups, for example a1aa(O[H])a(O[H])aa1. Some databases can do this; others cannot. Again, no relevance of canonicalization. This is a model search, not a string search. But not every database can do this sort of thing.

I agree completely that any SMILES needs to be its own property (perhaps as chemical_SMILES).

@merkys
Copy link
Member

merkys commented Jun 2, 2022

@BobHanson Thanks for your opinion here! Could you please also check out the related PR #392 and maybe approve if you agree?

@ml-evs
Copy link
Member

ml-evs commented Jul 4, 2022

Canonicalization. This is a much misunderstood term. "Canonicalization" is a local database strategy that can be used to do a rapid string search for a specific compound. Databases should not/do not require that a user use any particular canonicalization. Maybe they use OpenSmiles v. 2.0.5; maybe they use something else. It doesn't matter in OPTIMADE context, because nobody cares what canonicalization the implementer used. What the database does is to convert the SMILES query to their specifically implemented canonicalization so that can do a direct string match. That's all. Think of "canonicalization" as similar to "software name and version." It's just a given algorithm written at a specific time.

Point: Don't worry about canonicalization.

Bit of a tangent, but this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

@merkys
Copy link
Member

merkys commented Jul 4, 2022

@ml-evs

Bit of a tangent, but this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

There is an ongoing discussion in #416 regarding symmetry properties which I believe may be related as well. I think that canonicalization may be delegated to providers, but if so, it has to be well-specified. Otherwise databases will differ in the way they do it, and we risk returning to pre-OPTIMADE state. Also, query canonicalization will put a strain on providers, not sure if negligible.

@BobHanson
Copy link

BobHanson commented Jul 4, 2022 via email

@ml-evs
Copy link
Member

ml-evs commented Jul 4, 2022

I think the real question is on query. MUST a repository be able to process a SMILES query in a meaningful noncanonical sense, or MAY it treat it as an exact string? Apologies if this has already been decided and I am repeating myself. Probably have missed a few clicks of this discussion. [1] http://opensmiles.org/opensmiles.html

Hi @BobHanson, I think this has been decided for SMILES, my comment is about whether we should adopt the same approach for simpler fields like chemical formula too.

@BobHanson
Copy link

BobHanson commented Jul 4, 2022 via email

@rartino
Copy link
Contributor

rartino commented Jul 6, 2022

this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

My take on this as an implementer is that I really want fields to have clear data types with strict comparison operator semantics. So, if chemical_formula is a string, then I want = to always mean normal string comparison - no: "but for this field equality also holds if the string has the same elements in a different order". Early drafts of OPTIMADE headed in this direction with each field describing its own operator rules, and IMO that leads to madness (and highly non-interoperable implementations).

Nevertheless, chemical formulas are obviously a major thing for us. So, if unordered element-wise comparison is useful, I see no issue with redefining chemical_formula_reduced to be a new chemical formula data type with its own clear comparison semantics, i.e., with = meaning unordered comparison over elements, but are < and > allowed? what do they mean?, etc. Furthermore, if used also for chemical_formula_descriptive we need to figure out how = works for constructs with parenthesis, brackets, etc.

@merkys
Copy link
Member

merkys commented Sep 23, 2022

Nevertheless, chemical formulas are obviously a major thing for us. So, if unordered element-wise comparison is useful, I see no issue with redefining chemical_formula_reduced to be a new chemical formula data type with its own clear comparison semantics, i.e., with = meaning unordered comparison over elements, but are < and > allowed? what do they mean?, etc. Furthermore, if used also for chemical_formula_descriptive we need to figure out how = works for constructs with parenthesis, brackets, etc.

I agree with @rartino here, but I would really prefer keeping things simple. My main concern is that both defining the new semantics and implementing them (properly) would require much effort.

@BobHanson
Copy link

BobHanson commented Sep 23, 2022 via email

@JPBergsma
Copy link
Contributor Author

this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask database to handle this themselves (provided the return formulae are in the canonical order)? Does adding this feature suggest we need to rethink how we handle our string fields (chemical_formula_reduced being the only really important one, I would argue)?

I think we should return an error message in this case, stating that the value for the chemical elements should be in alphabetical order.

@BobHanson
Copy link

BobHanson commented Oct 11, 2022 via email

@JPBergsma
Copy link
Contributor Author

JPBergsma commented Oct 11, 2022

The code for checking whether each value is smaller than the next value is much simpler than that for a sorting algorithm. Although higher programming languages can provide their own sorting algorithms, so in terms of programming work it may not make much difference.

I think it would be good if a server gives an error when a query is malformed. It is easy to make a typo, and this way we can at least in some cases inform the user about this.
This does not only apply to the chemical formula fields, but to all other fields as well.
So, it would be good for consistency to return an error when a user gives an invalid chemical formula. In this case, it may be easy and unambiguous to convert it to a valid query, but this is not possible for many of the other query fields.
If we do want to accept SiO2, we must in my opinion update the optimade specification.

ps. (If the user queries for SC did he/she mean to search for CS or Sc?)

@merkys
Copy link
Member

merkys commented Oct 11, 2022

I completely agree with @JPBergsma on reporting malformed queries as errors and possibility to relax the specification in the future. I would not hurry with the latter, though.

@rartino
Copy link
Contributor

rartino commented Oct 11, 2022

it seems to me that "reduced" here is a fine qualifier that can specify "O2Si" and not "SiO2". No one will know what "reduced" means unless they read the information anyway, and that information can explicitly say, "for example, 'O2Si', not 'SiO2' " to make it absolutely clear what is required.

Maybe I misunderstand you, but as far as I know the word "reduced" in chemical formula is rather meant to refer to the following requirement (quoted from the specification): "For structures with no partial occupation, the chemical proportion numbers are the smallest integers for which the chemical proportion is exactly correct." I think this is a fairly standard use of "reduced"?

There is no word in the field name meant to state the need to order elements. That is "just" a part of the specification ("elements MUST be placed in alphabetical order, followed by their integer chemical proportion number.")

I think one ends up with rather different viewpoints here if one views OPTIMADE as "the user interface" for materials data queries, or "just" an underlying standardized communication protocol. I see no problem with, e.g., Jmol sorting elements for a user who use OPTIMADE to query an OPTIMADE database for a chemical formula before sending the query to OPTIMADE.

Totally with the idea that a string is a string. (Except in the case of SMILES, which I would argue is a special case.) Machines will not care.

We probably need to pick up the discussion again in the smiles thread on what semantics people who want to filter on smiles want. I would argue that if they are different from strings, there should be a smiles datatype.

@merkys
Copy link
Member

merkys commented Oct 12, 2022

I think one ends up with rather different viewpoints here if one views OPTIMADE as "the user interface" for materials data queries, or "just" an underlying standardized communication protocol. I see no problem with, e.g., Jmol sorting elements for a user who use OPTIMADE to query an OPTIMADE database for a chemical formula before sending the query to OPTIMADE.

Well put. I view OPTIMADE as "just" an underlying standardized communication protocol, hence my animosity towards some of provider-intensive extensions.

We probably need to pick up the discussion again in the smiles thread on what semantics people who want to filter on smiles want. I would argue that if they are different from strings, there should be a smiles datatype.

Maybe this is the right thread to do so? Or probably even better way would be to put together an alternative to PR #392 defining SMILES as datatype with its own query semantics. Admittedly, I am not a fan (the COD will not be able to handle such queries; I cannot see a way to elegantly introduce SMILES datatype at grammar level, although we have timestamp datatype), but I can start a draft.

This was referenced Dec 6, 2022
@merkys
Copy link
Member

merkys commented Dec 7, 2022

As promised, I have created PR #436 introducing SMILES data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants