Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gdb format molecule reading #291

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

loriab
Copy link
Collaborator

@loriab loriab commented Jun 11, 2022

See description and purpose and proposed tests at #288. This is a separate implementation of the parsing.

Status

  • Code base linted
  • Ready to go

@codecov
Copy link

codecov bot commented Jun 11, 2022

Codecov Report

Merging #291 (829dd44) into master (60934f3) will increase coverage by 0.02%.
The diff coverage is 98.11%.

@coltonbh
Copy link
Collaborator

coltonbh commented Jun 12, 2022

For completeness adding my commends from Slack here...

We've lost the desired high-level API from my implementation in #288. i.e., Molecule.from_file("path_to_gdb.xyz"). I think we should maintain that API before merging in the PR 🙂

One can test the implementation against the whole dataset using the following gdb.py script python gdb.py path_to_unzipped_gdb_dataset. One should see only 68 failures on the files that contain funny float values (these files should fail).

from qcelemental.models import Molecule
from qcelemental.exceptions import MoleculeFormatError
from pathlib import Path


if __name__ == "__main__":
    import sys

    path = Path(sys.argv[1])
    failures = []
    for i, p in enumerate(path.iterdir()):
        full_path = p.resolve()
        try:
            Molecule.from_file(full_path)
        except MoleculeFormatError as e:
            print(full_path.name)
            failures.append(full_path.name)
        if i % 1000 == 0:
            print(i)

    print(failures)
    print(f"Total Failures: {len(failures)}")

The changed test implementation from

unprocessed, processed = _filter_xyz(string, strict=True)

to

final = qcelemental.molparse.from_string(string, return_processed=False, dtype="gdb")

is what makes this PR still "pass" the tests I wrote, but we've lost the Molecule.from_file API.

@loriab
Copy link
Collaborator Author

loriab commented Jun 12, 2022

Sorry, saw this after Slack, so I'll repeat here :-)

A near-high-level API should work now as Molecule.from_file("path_to_gdb.xyz", dtype="gdb")

For anyone following along, the key difference is that this PR parses gdb as a separate dtype, whereas #288 parses gdb under "xyz" dtype with some regex relaxations. Maybe that's ok, as gdb is a correct superset of xyz, but I do worry about less guidance/errors being returned to the user. e.g., the below could pass, when it probably wasn't the user's intended geometry.

3

O 8   0 0 0
H 1   1 0 0
H 1   0 1 0

@coltonbh
Copy link
Collaborator

Cool! Thanks for the update :)

I worry about the alternative case, i.e., end users see all the .xyz files in gdb, try to open them with Molecule.from_file("path_to_gdb.xyz") and see failures. Will they know to look for a dtype=gdb kwarg? My suspicion is that the only people who would are you and I ;) Is there a way we can make this obvious for end users? I'd prefer all .xyz files to "just work" for end users and under the hood we handle the nuances.

Is there a reason you prefer requiring the extra kwarg that I'm missing? :) Possible to fall back on a gdb parsing scheme if others fail rather than requiring the end user to declare it explicitly?

@coltonbh
Copy link
Collaborator

Also, I still see many more failures with the current code. Better than before, but I get 613 failures on the gdb dataset instead of 68. I don't think we've quite hit the general case yet with this code :)

@coltonbh
Copy link
Collaborator

Ideal scenario for this PR:

  1. Molecule.from_file("path_to_gdb.xyz") parses correctly. The gdb file are .xyz files, I think it's easiest for users if they parse as such without additional keywords. I suspect most probably wouldn't find the dtype='gdb' keyword or think to look for a different datatype other than .xyz since that is the file extension. Perhaps we could nest the logic under the "parse_xyz" stuff as another xyz filetype to try?
  2. The code works on all gdb.xyz files except the 68 with strange float values in their geometries.

Can you help me to understand this scenario you are concerned about?

3

O 8   0 0 0
H 1   1 0 0
H 1   0 1 0

Would this be a format we expect users to encounter in regular use or more a hypothetical that concerns you?

Thanks for your time on this. I'm happy to help finish the implementation if you can point out the concerns you have with #288 that may have undesired behavior. I found the xyz parsing implementation a bit tricky to insert new logic into so I understand that this seemingly simple feature may be taking a lot of your time. Thanks for the back-and-forth to get something that works while respecting the code design you'd like to keep in qcel :)

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Jun 13, 2022

This pull request introduces 1 alert and fixes 1 when merging 508817f into cb04079 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused local variable

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Jun 13, 2022

This pull request introduces 1 alert and fixes 1 when merging 829dd44 into cb04079 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused local variable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants