Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

alixdamman
Copy link
Collaborator

Not yet finished but started the PR to open the discussion.

@gdementen
Copy link
Contributor

I know you didn't ask for it, but FWIW, I will not have time to review this initial POC before 26th of April. I need to have clear head to do this and it's not the case these days... my initial gut feeling is that you are putting too much into the LArray HDF layer (eg sort_rows, sort_columns & fill_value). Those are only useful when we are loading HDF not produced via larray, which is nice to have but is not what I expected this issue was about. I just wanted to have a simple binary save of larray objects and load them back which would be as fast as possible, and thus bypass pandas entirely. Loading arbitrary HDF objects (not produced via larray) is much more complex and I wouldn't tacke this now, unless you have a clear idea of what is needed.

@alixdamman
Copy link
Collaborator Author

PytablesHDFStore is not yet implemented. The class is almost empty. My first objective is to get the backward compatibility with old produced HDF files. Currently, the PR shows a fully implemented PandasHDFStore which just reproduces what we get before. Nothing else.

I started the PR to let you know that I'm working on it. I don't expect you to review this soon but maybe, after the 26/4, to first look at the classes structure.

@gdementen
Copy link
Contributor

I know. I just wanted to clarify my vision. Not, that you have to follow it, but I just wanted to avoid any misunderstanding.

@alixdamman
Copy link
Collaborator Author

Just to also clarify myself, my wish now is to implement a PytablesHDFStore with limited features but to also implement a LArrayHDFStore that works in a similar way as Workbook (open_excel) and to make it accessible via the public API.

@alixdamman alixdamman added this to the 0.31 milestone Apr 26, 2019
return array.astype(str)
try:
array = np.asarray(array)
if array.dtype == np.object_ or (not PY2 and array.dtype.kind == 'S'):
Copy link
Contributor

@gdementen gdementen Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converting object arrays to string seem like a bad idea (imagine object array containing floats or ints), or mixed arrays with both strings and numbers (I think most of our real-life object arrays are of that kind).

@alixdamman
Copy link
Collaborator Author

remainder: add test with an array of dtype=np.object_

@alixdamman alixdamman force-pushed the master branch 3 times, most recently from 9264074 to 01669f2 Compare May 10, 2019 10:03
@alixdamman alixdamman modified the milestones: 0.31, 0.32 Aug 1, 2019
@alixdamman alixdamman force-pushed the 724_use_pytables_instead_of_pandas branch 2 times, most recently from df28aea to 03e2fc4 Compare August 23, 2019 14:41
@gdementen
Copy link
Contributor

The pandas bypass and the internals cleanup both seem great.

However, I am not sure about making this (LHDFStore) part of the public API yet because this overlaps heavily with a lazy Session backed by an HDF file.

At this point I am unsure what is the best API to offer to our users for opening an HDF file and load/write some arrays when needed then close the file but I fear having both lazy sessions and LHDFStore as public API could confuse our users, because that would be essentially two ways to do the same thing (in addition to the current confusion about read_X and sessions). But do not revert anything (except maybe the changes to api.rst), in the worst case LHDFStore will be used by lazy sessions. We might also decide this is a better API than lazy sessions or that it's worth having both API but I simply cannot tell for now. I would like to avoid you releasing the 0.32 release with this new API being advertised and then we add another similar API in the next release and our users being confused. This might be what we do anyway in the end but this at least this needs to be thought through.

- moved LHDFStore to inout/hdf.py
- implemented PandasStorer and PytablesStorer
- updated LArray/Axis/Group.to_hdf
- removed Metadata.to_hdf and Metadata.from_hdf
- renamed PandasHDFHandler as HDFHandler
@alixdamman alixdamman force-pushed the 724_use_pytables_instead_of_pandas branch from 03e2fc4 to 559d0c0 Compare August 26, 2019 07:23
engine = 'tables'
else:
import tables
handle = tables.open_file(filepath, mode='r')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to use a context manager here?

Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a good look at this again with fresh eyes. And I think there are too many abstraction layers in there:

# -> means "uses"
# ( ) means inherits from
Session -> HDFHandler(FileHandler) -> LHDFStore -> PytablesStorer(AbstractStorer)
                                                -> PandasStorer(AbstractStorer)

I imagined something much "flatter":

Session -> PytablesHDFHandler(FileHandler)
        -> PandasHDFHandler(FileHandler)

or (more likely) to avoid a bit of code duplication:

Session -> PytablesHDFHandler(HDFHandler(FileHandler))
        -> PandasHDFHandler(HDFHandler(FileHandler))

Note that nothing forbids us to have extra methods in HDFHandler and/or P*HDFHandler for HDF specific stuff (if any is actually necessary). We could probably also make a few of the methods in FileHandler public, and add a few extra methods so that it can be used directly as a context manager like you do with LHDFStore. I don't understand why we need those two extra abstraction layers. I could understand one extra layer if we cannot accomodate the HDF specificities with extra methods/attributes (but on the top of my head, I don't see why it would be the case).

I know this has been a while, but do you remember why you did it this way instead of implementing specific FileHandlers (and enhancing the FileHandler class/protocol as needed)?

PS: The LazySession stuff can come on top of the FileHandler paradigm I think, so here I am happy I didn't go with Session subclasses for each type of file.

@@ -6714,6 +6714,9 @@ def to_hdf(self, filepath, key):
Path where the hdf file has to be written.
key : str or Group
Key (path) of the array within the HDF file (see Notes below).
engine: {'auto', 'tables', 'pandas'}, optional
Dump using `engine`. Use 'pandas' to update an HDF file generated with a LArray version previous to 0.31.
Defaults to 'auto' (use default engine if you don't know the LArray version used to produced the HDF file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change "used to produced" to either "used to produce" or "which produced"

if group is not None:
self._handle.remove_node(group, recursive=True)
paths = key.split('/')
# recursively create the parent groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alixdamman alixdamman removed this from the 0.33 milestone Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants