Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting malformed chemical formulas as substances #202

Closed
bertiewooster opened this issue Jan 17, 2022 · 32 comments · Fixed by #205
Closed

Interpreting malformed chemical formulas as substances #202

bertiewooster opened this issue Jan 17, 2022 · 32 comments · Fixed by #205

Comments

@bertiewooster
Copy link

bertiewooster commented Jan 17, 2022

Hi, thanks for the ChemPy package! As I was using Substance.from_formula, I noticed that one could easily make a mistake in the formula string and not be notified about it. For example, if I wanted to add methanol, if I don't make any mistakes it works great:

>>> from chempy import Substance
>>> methanol = Substance.from_formula('CH3OH')
>>> methanol.name
'CH3OH'
>>> methanol.composition
{6: 1, 1: 4, 8: 1}

If I make a minor mistake, for example forgetting to capitalize the first H for hydrogen, ChemPy gives no warning and simply stops at the last valid element. So the formula string is interpreted as simply C for carbon, even though the name is the entire supplied formula string Ch3OH:

>>> c = Substance.from_formula('Ch3OH')
>>> c.name
'Ch3OH'
>>> c.composition
{6: 1}

Is there something like a strict flag (option) to throw a warning or even exception (error) if the entire formula string cannot be interpreted as a substance? If the entire formula string cannot be interpreted as a substance, should the substance's name have only the part that was interpreted as a substance?

@bjodah
Copy link
Owner

bjodah commented Jan 17, 2022

Hi @bertiewooster, glad to hear you're finding ChemPy useful, and thanks for this report!
No unfortunately, there's currently no such strict flag. But I agree that we really should have one, and perhaps the default should even be "strict" mode. And I agree that any partial parsing of substance name as a formula should not be accepted by default.

Currently my time available for ChemPy is somewhat limited, but we have a few interested parties so hopefully someone will find the time to implement this soon (you're of course welcome to submit a patch yourself if you wish/can but just reporting issues is already much appreciated!).

@bertiewooster
Copy link
Author

bertiewooster commented Jan 22, 2022

Hi @bjodah thanks for the response! If no one else expresses interest in taking this on, I can try.

Regardless, I thought it'd be useful to establish the parameters of a chemical formula that ChemPy is designed to interpret. Is there an existing standard for chemical formulas (I couldn't find one easily from IUPAC)? If not, here's my understanding:

ElaqElbq(ElcqEldq)r...Elnqsc

where

  • Ela, Elb,...Eln is the symbol (for example, C or Ni) for an element in the periodic table
    • The first letter of an element symbol is always capitalized
    • No subsequent letter of an element symbol is capitalized
  • (ElcqEldq)r indicates that the quantity of the component inside parentheses, ElcqEldq, is r
>>> f = Substance.from_formula('Fe[CN]6-3')
>>> f.composition
{26: 1, 0: -3}
>>> fb = Substance.from_formula('Fe{CN}6-3')
>>> fb.composition
{26: 1, 0: -3}

whereas parentheses are currently supported, for example

>>> ferricyanide = Substance.from_formula('Fe(CN)6-3')
>>> ferricyanide.composition
{26: 1, 6: 6, 7: 6, 0: -3}
  • q is the quantity of of that element in the substance
    • q can be any floating-point number greater than zero, for example 0.003, 1, 2, 2.3, etc.
  • optional: sc are the sign and charge, expressing that the substance is not electrically neutral
    • s is either + (for positive, fewer electrons than the neutral substance) or - (negative, more electrons than the neutral substance)
    • c is the charge, which must be an integer--or are there any instances of non-integer charge?

Expanded formulas

Should we support any of the following?

  • Dot notation for other molecules in a crystalline framework, for example CuSO4⋅5H2O? Supporting the dot per se would be easy--we'd just ignore it--but allowing a quantity (5) before a molecule (H2O) would be an addition to the framework and parsing.
  • @ At notation for one substance encaged within another, for example Li@C60 indicates Li in caged within C60. I suggest yes because we would simply ignore the @ symbol.
    • Are there other symbols for related concepts?

@jeremyagray
Copy link
Collaborator

I think I actually have a version of the formula parser (or at least the grammar) in chempy that handles dot notation and brackets for complexes. As I recall it actually broke a fair amount of the current tests and time constraints stopped my inquiries in that direction.

While writing parsers for other situations I have used pyparsing's parse exceptions to signal errors so it can be done (easily). I suppose the questions are what features are desirable in the formula parser (dot notation for hydrated crystals, square brackets for complexes, @ caged symbol, etc.), what behavior is desired for the strict parser flag, and whether these are separate issues.

If I can get some clarity on these issues, I will patch the relevant bits of my parser onto the current version and investigate further.

@bertiewooster
Copy link
Author

To summarize what I believe are the questions, and offer my suggestions:

  • [square brackets] and {curly braces}: Accept, with same meaning as (parentheses)
  • charge: must be an integer
  • @ at notation for one caged symbol: Accept. (Simply ignore the @ symbol.)
  • ⋅ dot notation for hydrated crystals: Accept; if this proves difficult to parse, could leave it as a future improvement
  • strict parser flag: Do not add; instead, always parse strictly, and throw an easy-to-understand exception if formula string does not meet criteria.
    • Alternatively, for the sake of backward compatibility, we could have a strict flag which invokes the new parsing rules; if strict is not specified, the old, lax parsing rules would be used. My guess is that any user code that relies on the old parsing scheme to not reject malformed formulas is quite likely misinterpreting the formula. So I'm not in favor of this alternative because I think the stricter parsing is more likely to solve unnoticed problems rather than create problems, and I think users should by default get the new parsing which is more accurate and more likely to find errors in formulas.

@jeremyagray
Copy link
Collaborator

  • [square brackets] and {curly braces}: Accept, with same meaning as (parentheses)

I have this in a local parser, but I intended to treat brackets as complexes. PyParsing lets you treat matches individually, so I think in general you could treat brackets as grouping symbols and also have the complex information available. Is there a special meaning for braces or are they simply a grouping alternative?

  • charge: must be an integer

This is part of the grammar already.

  • @ at notation for one caged symbol

This should be easy to add.

  • ⋅ dot notation for hydrated crystals

This is already in the parser and tests. I can't remember if it is documented.

  • strict parser flag

I think this is easily doable. I believe it would reduce the number of errors as well because it does seem unexpected that anything else would run after a failed parse.

Let me work on my local branch some and I'll report back with some details.

@bertiewooster
Copy link
Author

bertiewooster commented Jan 26, 2022

Is there a special meaning for braces or are they simply a grouping alternative?

IUPAC provisional recommendation (see IR-2.2.1 at bottom of page 3) lists braces as a grouping alternative, to be used alternately with parentheses to help avoid confusion:

In formulae, square brackets, parentheses and braces are used in the following nesting order: [], [( )], [{( )}], [({( )})], [{({( )})}], etc. Square brackets are normally used only to enclose formulae; parentheses and braces are then used alternately (see also Sections IR- 4.2.3 and IR-9.2.3.2).

By the way, how would treating square brackets as complexes (rather than simply a grouping alternative) affect the resulting substance?

@jeremyagray
Copy link
Collaborator

IUPAC provisional recommendation (see IR-2.2.1 at bottom of page 3) lists braces as a grouping alternative, to be used alternately with parentheses to help avoid confusion:

I assumed that was the case. But nomenclature is such a wide and specialized area, sometimes it's hard to keep track of it all.

By the way, how would treating square brackets as complexes (rather than simply a grouping alternative) affect the resulting substance?

Hopefully, it wouldn't. I was looking more to add information as opposed to treating them differently. Things that depend on elemental composition should never know the difference. But, if something is in square brackets, then code could (optionally) treat it as a coordination complex (as opposed to a polyatomic ion or other group) and do appropriate stuff with it like deducing oxidation states, naming, decomposing the group into ligands, etc. I think my goal at the time was to use the parser with some additional nomenclature code to be able to interconvert formulas and names.

@bertiewooster
Copy link
Author

But nomenclature is such a wide and specialized area, sometimes it's hard to keep track of it all.

I haven't been able to find an approved IUPAC standard for formula nomenclature. The closest thing I found is this story, Nomenclature Notes:

The principal enclosing marks are parentheses, ( ), sometimes simply called brackets or round brackets, curly brackets, { }, also often called braces in U.S. texts, and square brackets, [ ]. These are the principal marks used, and though some others may be found in specialized literature, these are those used by chemists. However, the order in which they are used depends upon the specific context.

@jeremyagray
Copy link
Collaborator

I haven't been able to find an approved IUPAC standard for formula nomenclature.

I'm sure there is one since they have a standard for everything, but I haven't looked. I have looked at the one for organic nomenclature and it makes an excellent doorstop.

Regardless, I've done a little work and hacked in everything except the @ and strict parsing (and decimal subscripts from another issue), with some tests and some other things. All the grouping symbols are working, but the brackets don't yet have the metadata indicating they are complexes and the braces had to have a hack in the latex formatter to work.

I'll try to finish the remaining two (or three) in the next few days in the above referenced branch. I have a feeling that avoiding formatter hacks and adding metadata will require more extensive use of PyParsing's parse actions and some (hopefully internal only) reformatting of the parsing functions.

jeremyagray added a commit to jeremyagray/chempy that referenced this issue Jan 27, 2022
Parse species like `Li@C60` indicated one species trapped by another,
such as a lithium atom inside a buckminsterfullerene.

Closes: closes bjodah#202
Signed-off-by: Jeremy A Gray <jeremy.a.gray@gmail.com>
@jeremyagray
Copy link
Collaborator

My branch should have everything provisionally implemented now, so have a look. I think it should have more tests probably some more reorganization of the tests.

I really think the choice of operators bears some thought and discussion. In this branch, I'm using .. for hydrates, . for decimal subscripts/coefficients, [*']+ for primes/excited states, and a prefix @ for caged species. It makes more sense to me to use * as the hydrate operator than as an excited state indicator, but that's a typical use in typeset equations and it's already in one of the existing tests. I didn't see an easy way to use . as both the hydrate indicator and the decimal point without it only working for water. I suppose we could go Romance language style and use a comma for the decimal point as well. I really just like the * as the text version of the hydrate dot.

These changes don't affect most of the other tests, but accomodating rational subscripts did require some additional sympy.nsimplify() calls in balance_stoichiometry() in chemistry.py.

@bertiewooster
Copy link
Author

@jeremyagray Thanks. Could you create a (draft) pull request to highlight the changes?

@bertiewooster
Copy link
Author

How about using a raised dot for hydrates?

@bertiewooster
Copy link
Author

Suggest a couple additional tests that the following parse to substances correctly, unless I missed you already adding similar tests:

  • Ag[NH3]+2
    • Checking ] sign charge
  • Fe[CN]6-3
    • Checking ] quantity sign charge
  • [Ni(NH3)6]+2[PtCl6]-2
    • I believe people sometimes keep the charge on each coordination compound even when paired to form a neutral compound

@jeremyagray
Copy link
Collaborator

How about using a raised dot for hydrates?

I was thinking in ASCII only. I know everything is in utf-8 now, but the parser and tests are still written as if they expect ASCII. I don't know when that paradigm shift should be made. If the switch is made to parse utf-8, then it doesn't matter as the subscripts, coefficients, and charges are all different characters and become easy to parse.

  • Ag[NH3]+2: Added; passing.
  • Fe[CN]6-3: Already passing.
  • [Ni(NH3)6]+2[PtCl6]-2: This fails by design, raising ValueError; try with Na+Cl- (which I added as a negative test). This happens in the charge string post processing and is something that could be changed, but since the correct charges can be inferred from the correct formula I feel this would also need validation to ensure the user added the correct charges. This may also have an overall charge that would have to be parsed.

@bertiewooster
Copy link
Author

Both of those decisions seem fine:

  • using .. for hydrate to stick with ASCII rather than needing utf-8
  • disallowing []+[]-

Just to check, there's no currently-supported symbol for hydrates, correct?

@jeremyagray
Copy link
Collaborator

Just to check, there's no currently-supported symbol for hydrates, correct?

Right now it is .. I don't know if it's documented or not, but it's built into the parser post-processing functions and it's tested in the tests. I forget it's there all the time because it's not explicitly in the parser grammar.

@bertiewooster
Copy link
Author

bertiewooster commented Jan 30, 2022

While it would be nice to keep the hydrate symbol as the period . it seems like that could make parsing ambiguous, for example 'X3.6H2O' could mean X3.6H2O or X3•6H2O. So that makes sense to use two periods .. as the hydrate symbol.

How about tests to verify that capitalization is being taken into account, for example

  • NI = nitrogen iodine
  • Ni = nickel

And test that a charged caged item is valid, for example Li+@C60 = Li+ @ C60.

@jeremyagray
Copy link
Collaborator

I checked the NI/Ni examples and they passed without modification. PyParsing is case-sensitive by default so it handles cases like that.

As for the latter example, the parser will not handle that as it is currently constructed since the actual parser is only handling formula groups and their subscripts and helper functions are processing the rest in _get_charge() and friends. So even something like Li+Cl fails as it tries to convert Cl to an integer and Li+Cl- throws and exception for both signs being present. Even once I coded around the former case, the positive charge appeared in the composition and broke tests there. So, trying to parse Li+@C60 resulted in trying to convert @C60 to an integer, working around and parsing Li+Cl fails the composition, Li+Cl- fails with both signs. (Li@C60)+ works as a workaround with the current parser and isn't too ambiguous.

To actually allow flexibility like this, the parser itself needs to generate a useful data structure that holds all this information and the helper functions need to just query this data structure and not do any text processing at all. That way, parsing Li+@C60 returns something like

{
  "composition": {0: 1, 3: 1, 6: 60},
  "charge": 1,
  "cage": True,
  "cations": ("Li+",),
  "terms": ("Li+", "@C60",),
  ...
}

instead of ("Li", "+@C60") and then trying to grab the charges from each item. But this would be a wide ranging, breaking change because everything that depends on reactions or formulas depends on the parser. Once you have a good data structure to represent a term (I'm thinking a class with methods or dict with helper functions), the terms can be organized in nested lists much like how the current parser is recursively finding terms in a formula.

@bertiewooster
Copy link
Author

Thanks for all your work on this as I throw edge cases at you :-)

(Li@C60)+ works as a workaround with the current parser and isn't too ambiguous.

That works for me--I think it's best to merge in (and perhaps release?) your much-improved parser with its current capabilities, and leave that Li+@C60 edge case for the future because addressing it would cause a wide ranging, breaking change.

Where should we document the symbols such as .. and @ -- in the README.md file, or elsewhere?

Other than that, unless you have any more little improvements in the works, recommend you create a pull request so it can be reviewed and merged.

@jeremyagray
Copy link
Collaborator

The documentation is in the README, the example Jupyter notebooks, and in autogenerated API documentation. I have to look at the autogen setup to see where and how to document these changes for the API docs, but some documentation will need to be added to the README as well.

The only other things to do are to decide what to do about the break caused by changing from . to .. as the hydrate indicator (major or minor version bump, split and wait, etc.) and to run a full set of tests with the numerical tests included, but I need to do some work on a separate issue first to sort that.

@bertiewooster
Copy link
Author

Would you like me to try updating the Parsing formulae section of README.rst?

@bjodah
Copy link
Owner

bjodah commented Jan 31, 2022

Great work guys! And thank you @jeremyagray for making this happen. I agree with your direction, and I too think we should mint a new release once we've merged the changes to the parser. I'm fine with breaking backwards compatibility if needed (unless we can do a deprecation cycle with very little extra work).

@jeremyagray
Copy link
Collaborator

Sure, @bertiewooster, just let me get things together and push out the pull request, probably later this week or this weekend.

As for the breaking change on the hydrate operator, it would probably be nice to the users to warn them about the change. I'll look at it some more to see if there is a less hacky way to integrate rational subscripts that avoids the breaking change.

@bertiewooster
Copy link
Author

The breaking change is that . used to be the hydrate operator, but we're changing it to .. correct?

The parser did not previously support rational subscripts, for example Al0.02Cr0.5, correct? (I can't get the released version of ChemPy to correctly parse something like that.) The parser will support rational subscripts after this change, correct? So for example Substance.from_formula('Al0.02Cr0.5').composition should be {13: 0.02, 24: 0.5}? Do we have tests for that?

@jeremyagray
Copy link
Collaborator

Yes, this does add rational subscripts, with some tests, and changes the operators as you describe. This was requested back in issue #176. But I am still trying to avoid the new operator so as to not break anything.

I've got my installation issues sorted and have found some more parsing induced test failures in some skipped tests in test_chemistry.py that I need to investigate, so my PR is pending that fix.

@jeremyagray
Copy link
Collaborator

I've finally got the PR finished for those interested. I found no good way to avoid using the new hydrate operator without presupposing some knowledge of the formula and structure so that will still be a breaking change. There are alternatives, such as requiring unicode input, that would vastly simplify parsing.

I did need to make a few more modifications to the parser internals, but I don't think there are any external changes other than the intended ones and the added tests seem to support this conclusion as far as they go.

@bertiewooster
Copy link
Author

@bjodah I'm happy to work on updating the README for the new . hydrate .. and caged @. Does the README use doctests?

Side note: I was thinking of updating the format to demonstrate what ChemPy produces to clarify a new user doesn't need to create a dictionary, for example instead of

>>> ferricyanide.composition == {0: -3, 26: 1, 6: 6, 7: 6}  # 0 for charge
True

using

>>> ferricyanide.composition
>>> # Composition dictionary keys: 0 for charge, positive integers for atomic numbers
>>> # Composition dictionary values: Quantities of charge or atoms
{0: -3, 26: 1, 6: 6, 7: 6}

@bjodah
Copy link
Owner

bjodah commented Mar 29, 2022

Hi @bertiewooster, sorry for the delay. Sure those updates look great!
Yes, I believe theres a doctest for the README, but if it's causing troubles we can always disable it.

@bertiewooster
Copy link
Author

bertiewooster commented Mar 29, 2022

I have not tried the doctest yet. How do I run it?

@bjodah
Copy link
Owner

bjodah commented Mar 29, 2022

Let's see if I remember:

$ git grep doctest
...
scripts/run_tests.sh:MPLBACKEND=Agg ${PYTHON:-python3} -m doctest README.rst

So python -m doctest README.rst might be enough?

@bertiewooster
Copy link
Author

@bjodah That doctest command ran as it should! I asked partly because one can't run doctest directly in SymPy, which is the only other project I've used doctest in.

I got all the doctests to pass except the known issue that @jeremyagray is coding a fix for:

    Substance.from_formula("Ca2.832Fe0.6285Mg5.395(CO3)6").composition
    Traceback (most recent call last):
    ...
    KeyError: '.'

So as soon as he pushes that fix, I should be able to pull it and get all the README.rst doctests passing.

P.S. Minor annoyance: For some reason, when I run python -m doctest README.rst, git claims that an image file, examples/kinetics.png, got changed. (I undid the "changes" so they won't be committed to the repo.) Wondering if this is a known issue, or maybe for some reason now occurring because I added the following lines to .gitignore?

.vscode/
.vs/

(.vscode/ is to exclude config files for Visual Studio Code, which I'm using. I figured I might as well add .vs/ in case any contributor uses Visual Studio Pro or perhaps Community Edition).

@jeremyagray
Copy link
Collaborator

I can't replicate the image getting changed on running the readme doctests, so it may be environment dependent.

I didn't know those tests were there and not running with the rest. When I did run them on my branch for #209, the redox tests around line 225 had to be reversed to work. I think that this reversal doesn't really matter since there is no implication that the reaction is spontaneous as produced since we're not using any equilibrium constant or reaction potential data to produce the reaction. If this is the case, we probably need to document that the reaction is balanced but not necessarily spontaneous as written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants