(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761

alixdamman · 2019-04-16T13:49:09Z

Not yet finished but started the PR to open the discussion.

gdementen · 2019-04-16T21:02:16Z

I know you didn't ask for it, but FWIW, I will not have time to review this initial POC before 26th of April. I need to have clear head to do this and it's not the case these days... my initial gut feeling is that you are putting too much into the LArray HDF layer (eg sort_rows, sort_columns & fill_value). Those are only useful when we are loading HDF not produced via larray, which is nice to have but is not what I expected this issue was about. I just wanted to have a simple binary save of larray objects and load them back which would be as fast as possible, and thus bypass pandas entirely. Loading arbitrary HDF objects (not produced via larray) is much more complex and I wouldn't tacke this now, unless you have a clear idea of what is needed.

alixdamman · 2019-04-17T05:54:27Z

PytablesHDFStore is not yet implemented. The class is almost empty. My first objective is to get the backward compatibility with old produced HDF files. Currently, the PR shows a fully implemented PandasHDFStore which just reproduces what we get before. Nothing else.

I started the PR to let you know that I'm working on it. I don't expect you to review this soon but maybe, after the 26/4, to first look at the classes structure.

gdementen · 2019-04-17T07:17:54Z

I know. I just wanted to clarify my vision. Not, that you have to follow it, but I just wanted to avoid any misunderstanding.

alixdamman · 2019-04-17T07:33:03Z

Just to also clarify myself, my wish now is to implement a PytablesHDFStore with limited features but to also implement a LArrayHDFStore that works in a similar way as Workbook (open_excel) and to make it accessible via the public API.

gdementen · 2019-04-26T09:55:33Z

larray/inout/hdf.py

-        return array.astype(str)
+    try:
+        array = np.asarray(array)
+        if array.dtype == np.object_ or (not PY2 and array.dtype.kind == 'S'):


converting object arrays to string seem like a bad idea (imagine object array containing floats or ints), or mixed arrays with both strings and numbers (I think most of our real-life object arrays are of that kind).

alixdamman · 2019-05-08T10:36:31Z

remainder: add test with an array of dtype=np.object_

gdementen · 2019-08-23T20:58:01Z

The pandas bypass and the internals cleanup both seem great.

However, I am not sure about making this (LHDFStore) part of the public API yet because this overlaps heavily with a lazy Session backed by an HDF file.

At this point I am unsure what is the best API to offer to our users for opening an HDF file and load/write some arrays when needed then close the file but I fear having both lazy sessions and LHDFStore as public API could confuse our users, because that would be essentially two ways to do the same thing (in addition to the current confusion about read_X and sessions). But do not revert anything (except maybe the changes to api.rst), in the worst case LHDFStore will be used by lazy sessions. We might also decide this is a better API than lazy sessions or that it's worth having both API but I simply cannot tell for now. I would like to avoid you releasing the 0.32 release with this new API being advertised and then we add another similar API in the next release and our users being confused. This might be what we do anyway in the end but this at least this needs to be thought through.

- moved LHDFStore to inout/hdf.py - implemented PandasStorer and PytablesStorer - updated LArray/Axis/Group.to_hdf - removed Metadata.to_hdf and Metadata.from_hdf - renamed PandasHDFHandler as HDFHandler

gdementen · 2020-11-10T12:54:46Z

larray/inout/hdf.py

+                engine = 'tables'
+            else:
+                import tables
+                handle = tables.open_file(filepath, mode='r')


wouldn't it be better to use a context manager here?

gdementen

I just had a good look at this again with fresh eyes. And I think there are too many abstraction layers in there:

# -> means "uses"
# ( ) means inherits from
Session -> HDFHandler(FileHandler) -> LHDFStore -> PytablesStorer(AbstractStorer)
                                                -> PandasStorer(AbstractStorer)

I imagined something much "flatter":

Session -> PytablesHDFHandler(FileHandler)
        -> PandasHDFHandler(FileHandler)

or (more likely) to avoid a bit of code duplication:

Session -> PytablesHDFHandler(HDFHandler(FileHandler))
        -> PandasHDFHandler(HDFHandler(FileHandler))

Note that nothing forbids us to have extra methods in HDFHandler and/or P*HDFHandler for HDF specific stuff (if any is actually necessary). We could probably also make a few of the methods in FileHandler public, and add a few extra methods so that it can be used directly as a context manager like you do with LHDFStore. I don't understand why we need those two extra abstraction layers. I could understand one extra layer if we cannot accomodate the HDF specificities with extra methods/attributes (but on the top of my head, I don't see why it would be the case).

I know this has been a while, but do you remember why you did it this way instead of implementing specific FileHandlers (and enhancing the FileHandler class/protocol as needed)?

PS: The LazySession stuff can come on top of the FileHandler paradigm I think, so here I am happy I didn't go with Session subclasses for each type of file.

gdementen · 2021-02-19T14:53:34Z

larray/core/array.py

@@ -6714,6 +6714,9 @@ def to_hdf(self, filepath, key):
            Path where the hdf file has to be written.
        key : str or Group
            Key (path) of the array within the HDF file (see Notes below).
+        engine: {'auto', 'tables', 'pandas'}, optional
+            Dump using `engine`. Use 'pandas' to update an HDF file generated with a LArray version previous to 0.31.
+            Defaults to 'auto' (use default engine if you don't know the LArray version used to produced the HDF file).


change "used to produced" to either "used to produce" or "which produced"

gdementen · 2021-02-19T16:26:59Z

larray/inout/hdf.py

+        if group is not None:
+            self._handle.remove_node(group, recursive=True)
+        paths = key.split('/')
+        # recursively create the parent groups


doesn't the createparents argument do this?
https://www.pytables.org/usersguide/libref/file_class.html#tables.File.create_group

alixdamman added the work in progress label Apr 16, 2019

alixdamman self-assigned this Apr 16, 2019

alixdamman added this to the 0.31 milestone Apr 26, 2019

gdementen reviewed Apr 30, 2019

View reviewed changes

alixdamman force-pushed the master branch 3 times, most recently from 9264074 to 01669f2 Compare May 10, 2019 10:03

gdementen force-pushed the master branch from b1f2c1d to cb86d28 Compare June 26, 2019 15:17

alixdamman modified the milestones: 0.31, 0.32 Aug 1, 2019

alixdamman force-pushed the 724_use_pytables_instead_of_pandas branch 2 times, most recently from df28aea to 03e2fc4 Compare August 23, 2019 14:41

(issue 724) :

f88090a

- moved LHDFStore to inout/hdf.py - implemented PandasStorer and PytablesStorer - updated LArray/Axis/Group.to_hdf - removed Metadata.to_hdf and Metadata.from_hdf - renamed PandasHDFHandler as HDFHandler

alixdamman force-pushed the 724_use_pytables_instead_of_pandas branch from 03e2fc4 to 559d0c0 Compare August 26, 2019 07:23

alixdamman added 3 commits August 26, 2019 10:22

removed LHDFStore from api.rst file and __init__.py module

18f0e77

added engine argument to the read_hdf method and all to_hdf methods

397012d

added doctests for LHDFStore + updated LHDFStore.summary() method

fa1c222

alixdamman force-pushed the 724_use_pytables_instead_of_pandas branch from 559d0c0 to fa1c222 Compare August 26, 2019 09:40

alixdamman mentioned this pull request Sep 3, 2019

rename and update LHDFStore + add open_hdf (like in xw_excel.py) #614

Open

3 tasks

alixdamman removed this from the 0.32 milestone Oct 2, 2019

alixdamman added this to the 0.33 milestone Oct 22, 2019

gdementen force-pushed the master branch from 283bb17 to a82ed38 Compare November 16, 2019 20:22

gdementen mentioned this pull request Jan 13, 2020

842 dump and load scalars of sessions #843

Merged

gdementen reviewed Nov 10, 2020

View reviewed changes

gdementen reviewed Feb 19, 2021

View reviewed changes

alixdamman removed this from the 0.33 milestone Jun 30, 2021

gdementen mentioned this pull request Sep 16, 2022

implement Dataset API #1017

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761

(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761

alixdamman commented Apr 16, 2019

gdementen commented Apr 16, 2019

alixdamman commented Apr 17, 2019

gdementen commented Apr 17, 2019

alixdamman commented Apr 17, 2019

gdementen Apr 26, 2019 •

edited

alixdamman commented May 8, 2019

gdementen commented Aug 23, 2019

gdementen Nov 10, 2020

gdementen left a comment

gdementen Feb 19, 2021

gdementen Feb 19, 2021

(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761

Are you sure you want to change the base?

(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761

Conversation

alixdamman commented Apr 16, 2019

gdementen commented Apr 16, 2019

alixdamman commented Apr 17, 2019

gdementen commented Apr 17, 2019

alixdamman commented Apr 17, 2019

gdementen Apr 26, 2019 • edited

Choose a reason for hiding this comment

alixdamman commented May 8, 2019

gdementen commented Aug 23, 2019

gdementen Nov 10, 2020

Choose a reason for hiding this comment

gdementen left a comment

Choose a reason for hiding this comment

gdementen Feb 19, 2021

Choose a reason for hiding this comment

gdementen Feb 19, 2021

Choose a reason for hiding this comment

gdementen Apr 26, 2019 •

edited