Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QS: Migrate to pkgutil for databases #985

Open
wants to merge 3 commits into
base: quantumstrand
Choose a base branch
from

Conversation

ooprathamm
Copy link
Contributor

Closes #759

@mr-tz
Copy link
Collaborator

mr-tz commented Apr 9, 2024

overall this appears to add complexity vs. reducing it. can you walk me through the benefits of this approach?

@ooprathamm
Copy link
Contributor Author

pkgutil.get_data is supposed to be better at getting a resource from a package.

Profiling load_databases:
image
image
(Similar perf across multiple runs though)

  • As QS is distributed as a bundled .exe pkgutil references directly to these embedded files, rather than relative to floss.qs.db path.
  • Directly reading bytes, rather than 2 step approach to find path and then read_bytes()

Also pyinstaller .spec guide mentions both the approaches without specifing any added benefits [using .spec]

@williballenthin could you provide us with your concerns for opening the issue .

@williballenthin
Copy link
Collaborator

williballenthin commented Apr 10, 2024

I understand pkgutil to expose an interface that Python packagers will implement when building single-file executables, like PyInstaller. The get_data routine lets you access supporting data (non-.py files) found within a module.

This doesn't really matter for packagers that extract a copy of the module to the file system during loading, like PyInstaller. For these packagers, we can use pathlib to access file system paths reletive to Python module contents. For PyInstaller, pkgutil is probably a small shim over file system operations, anyways.

However, there are packagers that don't write module content to disk. They might keep Python byte code in memory, having read it from a compressed resource or even from the network. In this case, we can't rely on using pathlib to access data files. We should use pkgutil instead.

But! We might decide that we don't want to support those packagers (which seem to be uncommon) and consider this out of scope. Or, we could follow the spec and be prepared to switch to something other than PyInstaller.

Does that make sense? @mr-tz @ooprathamm

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also try to find one of these packagers and see if it works with the proposed changes. I suspect Nuitka might work for this.

@@ -51,13 +49,13 @@ def query(self, s: str) -> Set[str]:
return ret

@classmethod
def from_file(cls, path: pathlib.Path) -> "ExpertStringDatabase":
def from_file(cls, package: str, resource: str) -> "ExpertStringDatabase":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep from_file around since it's really useful for dev and testing. but let's add a from_pkgutil to address the feature request. and please factor out the common code into a sub routine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which routine will the get_default_databases use for now ? Or we intend to keep it as a future utility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please prefer pkgutil

@ooprathamm
Copy link
Contributor Author

For simplicity, the code for using from_file is commented out.
Or we could leave something like BASE_PATH = pathlib.Path(floss.qs.db.__file__).parent constant to infer on own.

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes look great!

I still think we should try to find a package that uses the pkgutil interface. Do you have some time to find one @ooprathamm?

@ooprathamm
Copy link
Contributor Author

Current Scenario -
Pyinstaller - Uses a bootloader that uncompresses data files to a temp folder. [Limitation]

Other Executable Builder Packages-

  • py2exe - Windows Only, data files bundled alongside the .exe file in the directory structure
  • cx_Freeze - Doesn't embed data files to executables. Keeps data files in a directory structure
  • Pyoxidizer [unmaintained now] - pkgutil compatability mentioned for importing modules from memory, for loading data files encourages the use of importlib.resources, doesn't mention pkgutil

Viable Option -

@williballenthin
Copy link
Collaborator

first off, great research @ooprathamm!

next, the pyoxidizer documentation is outstanding. it does nice job of summarizing the options. and since it doesn't touch on pkgutil, I wonder if we're using that wrong? or is it the new interface alluded to at the end?

based on that article, what do you think the best strategy is, @ooprathamm? and, would you consider trying out pyoxidizer on FLOSS to see if it works with our code?

@ooprathamm
Copy link
Contributor Author

ooprathamm commented Apr 14, 2024

On further looking into PyOxidizer.
pkgutil.get_data() enhancement for in_memory rsrc still pending github

The incompatibility is further confirmed here. (comparing different utilities ).

For now, It might not be good to consider PyOxidizer

@williballenthin
Copy link
Collaborator

ok sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants