Add gdb format molecule reading #291

loriab · 2022-06-11T06:52:38Z

See description and purpose and proposed tests at #288. This is a separate implementation of the parsing.

Status

Code base linted
Ready to go

codecov · 2022-06-11T06:55:25Z

Codecov Report

Merging #291 (829dd44) into master (60934f3) will increase coverage by 0.02%.
The diff coverage is 98.11%.

coltonbh · 2022-06-12T02:42:46Z

For completeness adding my commends from Slack here...

We've lost the desired high-level API from my implementation in #288. i.e., Molecule.from_file("path_to_gdb.xyz"). I think we should maintain that API before merging in the PR 🙂

One can test the implementation against the whole dataset using the following gdb.py script python gdb.py path_to_unzipped_gdb_dataset. One should see only 68 failures on the files that contain funny float values (these files should fail).

from qcelemental.models import Molecule
from qcelemental.exceptions import MoleculeFormatError
from pathlib import Path


if __name__ == "__main__":
    import sys

    path = Path(sys.argv[1])
    failures = []
    for i, p in enumerate(path.iterdir()):
        full_path = p.resolve()
        try:
            Molecule.from_file(full_path)
        except MoleculeFormatError as e:
            print(full_path.name)
            failures.append(full_path.name)
        if i % 1000 == 0:
            print(i)

    print(failures)
    print(f"Total Failures: {len(failures)}")

The changed test implementation from

unprocessed, processed = _filter_xyz(string, strict=True)

to

final = qcelemental.molparse.from_string(string, return_processed=False, dtype="gdb")

is what makes this PR still "pass" the tests I wrote, but we've lost the Molecule.from_file API.

loriab · 2022-06-12T05:11:55Z

Sorry, saw this after Slack, so I'll repeat here :-)

A near-high-level API should work now as Molecule.from_file("path_to_gdb.xyz", dtype="gdb")

For anyone following along, the key difference is that this PR parses gdb as a separate dtype, whereas #288 parses gdb under "xyz" dtype with some regex relaxations. Maybe that's ok, as gdb is a correct superset of xyz, but I do worry about less guidance/errors being returned to the user. e.g., the below could pass, when it probably wasn't the user's intended geometry.

3

O 8   0 0 0
H 1   1 0 0
H 1   0 1 0

coltonbh · 2022-06-12T05:27:13Z

Cool! Thanks for the update :)

I worry about the alternative case, i.e., end users see all the .xyz files in gdb, try to open them with Molecule.from_file("path_to_gdb.xyz") and see failures. Will they know to look for a dtype=gdb kwarg? My suspicion is that the only people who would are you and I ;) Is there a way we can make this obvious for end users? I'd prefer all .xyz files to "just work" for end users and under the hood we handle the nuances.

Is there a reason you prefer requiring the extra kwarg that I'm missing? :) Possible to fall back on a gdb parsing scheme if others fail rather than requiring the end user to declare it explicitly?

coltonbh · 2022-06-12T05:30:07Z

Also, I still see many more failures with the current code. Better than before, but I get 613 failures on the gdb dataset instead of 68. I don't think we've quite hit the general case yet with this code :)

coltonbh · 2022-06-12T18:50:01Z

Ideal scenario for this PR:

Molecule.from_file("path_to_gdb.xyz") parses correctly. The gdb file are .xyz files, I think it's easiest for users if they parse as such without additional keywords. I suspect most probably wouldn't find the dtype='gdb' keyword or think to look for a different datatype other than .xyz since that is the file extension. Perhaps we could nest the logic under the "parse_xyz" stuff as another xyz filetype to try?
The code works on all gdb.xyz files except the 68 with strange float values in their geometries.

Can you help me to understand this scenario you are concerned about?

3

O 8   0 0 0
H 1   1 0 0
H 1   0 1 0

Would this be a format we expect users to encounter in regular use or more a hypothetical that concerns you?

Thanks for your time on this. I'm happy to help finish the implementation if you can point out the concerns you have with #288 that may have undesired behavior. I found the xyz parsing implementation a bit tricky to insert new logic into so I understand that this seemingly simple feature may be taking a lot of your time. Thanks for the back-and-forth to get something that works while respecting the code design you'd like to keep in qcel :)

lgtm-com · 2022-06-13T05:28:40Z

This pull request introduces 1 alert and fixes 1 when merging 508817f into cb04079 - view on LGTM.com

new alerts:

1 for Unused local variable

fixed alerts:

1 for Unused local variable

lgtm-com · 2022-06-13T05:52:33Z

This pull request introduces 1 alert and fixes 1 when merging 829dd44 into cb04079 - view on LGTM.com

new alerts:

1 for Unused local variable

fixed alerts:

1 for Unused local variable

add gdb parsing

98e21e7

format and changelog

03a1380

allow gdb to work through Molecule

a72c726

more gdb and error tests

508817f

format

829dd44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gdb format molecule reading #291

Add gdb format molecule reading #291

loriab commented Jun 11, 2022 •

edited

codecov bot commented Jun 11, 2022 •

edited

coltonbh commented Jun 12, 2022 •

edited

loriab commented Jun 12, 2022

coltonbh commented Jun 12, 2022

coltonbh commented Jun 12, 2022

coltonbh commented Jun 12, 2022

lgtm-com bot commented Jun 13, 2022

lgtm-com bot commented Jun 13, 2022

Add gdb format molecule reading #291

Are you sure you want to change the base?

Add gdb format molecule reading #291

Conversation

loriab commented Jun 11, 2022 • edited

Status

codecov bot commented Jun 11, 2022 • edited

Codecov Report

coltonbh commented Jun 12, 2022 • edited

loriab commented Jun 12, 2022

coltonbh commented Jun 12, 2022

coltonbh commented Jun 12, 2022

coltonbh commented Jun 12, 2022

lgtm-com bot commented Jun 13, 2022

lgtm-com bot commented Jun 13, 2022

loriab commented Jun 11, 2022 •

edited

codecov bot commented Jun 11, 2022 •

edited

coltonbh commented Jun 12, 2022 •

edited