User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

loco-philippe · 2024-05-10T08:07:05Z

This PR follow the issue #9015 as proposed by @max-sixty.

I added an additional section in the pandas.rst file to provide a third-party pandas interface alternative that is lossless and reversible.

The main contribution is the ability to find the multidimensional structure hidden by the tabular structure.

welcome · 2024-05-10T08:07:08Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

for more information, see https://pre-commit.ci

max-sixty · 2024-05-10T16:22:53Z

Overall I think this looks reasonable! I don't have strong views on the substance, others should comment if they have views...

We'll need to add ntv_pandas to the docs dependencies for the docs tests to pass.

loco-philippe · 2024-05-10T20:04:05Z

@max-sixty

Thanks for the answer to the question I was going to ask!

Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?

mathause · 2024-05-10T20:16:33Z

Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?

Yes, please go ahead.

mathause

Some comments. And as mentioned you will need to add the new dependency to https://github.com/pydata/xarray/blob/main/ci/requirements/doc.yml

doc/user-guide/pandas.rst

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

loco-philippe · 2024-05-13T22:11:43Z

Error Read the Docs :

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 import ntv_pandas as npd

ModuleNotFoundError: No module named 'ntv_pandas'

<<<-------------------------------------------------------------------------

Is it because the package name (ntv-pandas) is different from the module name (ntv_pandas) ?

max-sixty · 2024-05-13T23:11:12Z

ModuleNotFoundError: No module named 'ntv_pandas'

Is it available on conda? Otherwise if only on pip then place it at the bottom in the pip section

mathause · 2024-05-14T09:03:29Z

Now you get a syntax error because arr[*idx] is only available in python 3.11 while our docs are in python 3.10 (and your ntv-numpy package is python 3.9+ - you should add tests for your minimal python version and dependencies).

https://github.com/loco-philippe/ntv-numpy/blob/4b57b8cc1bfab749c01ddf7edbc38a9ef53623df/ntv_numpy/xconnector.py#L268-L269

mathause · 2024-05-14T09:09:55Z

Checking again - you absolutely need to start testing your packages continuously. I am reluctant to 'endorse' a package that does not have a CI pipeline. Let us know if you need support with that.

loco-philippe · 2024-05-14T12:15:10Z

@mathause

This bug is clearly unacceptable (I'm ashamed) !!

This demonstrates that I now need to take the time to have a robust CI process.
I will first use GitHub Actions to have a build and test pipeline (if you have minimum requirements to respect to endorse packages in Xarray or other advice, I'm interested!)

For the current PR, two solutions:

solution 1: I correct the identified bug and I check that all the tests are validated on each of the python versions before making a new commit,
solution 2: I integrate the above actions into a CI pipeline before making a new commit

It seems to me that solution 2 is preferable (unless you want to go quickly and in which case I will follow solution 1).

loco-philippe · 2024-05-15T22:41:48Z

The issue is fixed (new version of the ntv_numpy package) and the CI GHA pipeline integrates the tests with python versions 3.10 and 3.11.

I will then add xarray accessors as defined and build a new version.

mathause · 2024-05-15T23:16:21Z

Cool great to hear!

Thinking about this - I would second @keewis idea. Featuring a prominent section without code would give it visibility, limit maintenance burden, and keep the docs environment smaller. (I maintain another smaller package where a third party package is featured in the docs. While I find it super cool that someone created an extension, it has caused above issues for me.)

Maybe this could be something along the lines:

Lossless and reversible conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The previous example shows that the conversion is not reversible (lossy roundtrip) and
that the size of the ``datasets`` increases. To avoid these problems, the third-party 
`ntv-pandas`__ library offers lossless and reversible conversions between 
``Dataset``/ ``DataArray`` and pandas ``DataFrame`` objects.

__ https://github.com/loco-philippe/ntv-pandas

If you have not done so yet, you can showcase the examples in the docs of ntv-pandas.

for more information, see https://pre-commit.ci

loco-philippe · 2024-05-19T22:23:05Z

The proposal to have Xarray documentation without code linked to a third party package seems logical to me.

So I modified the PR taking into account @mathause's proposal (with some modifications).

I also added the Xarray accessors in the ntv_numpy package. We now have symmetric methods Dataset.nxr.to_dataframe and DataFrame.npd.to_xarray).

Another question: I haven't found any other tools or methodologies that analyze the structure of a tabular data to extract the multidimensional structure. Do you know any?

mathause

Thanks for implementing our suggestions. Need to remove the packages from the doc.yml again. I have some optional minor suggestions, but looks good to go.

doc/user-guide/pandas.rst

ci/requirements/doc.yml

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

mathause

Looks good!

loco-philippe · 2024-05-26T20:48:28Z

Thank you @max-sixty and @mathause for your approval of this PR.

I finally added another example (use case) which I think should address new needs (but I don't know Xarray well enough and I'm interested in your opinion).

Note: JSON, Pandas and Xarray interfaces are built on a neutral format as defined in this notebook. I think this structure is consistent with your roadmap (I'm also interested in your opinion).

mathause · 2024-05-29T18:53:41Z

close/ open to trigger the tests again

mathause · 2024-05-30T07:46:06Z

No idea why this does not work, but I don't see this could have something to do with the PR itself. I'll merge manually.

Thanks for your PR and willingness to adopt to our changes!

welcome · 2024-05-30T07:46:20Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

loco-philippe · 2024-05-30T13:58:44Z

Thanks also for spending time on this PR.

The changes were useful and relevant, so it was normal to take them into account !

loco-philippe added 2 commits May 9, 2024 19:16

Update pandas.rst

fcedc2d

Update pandas.rst

09d3e93

pre-commit-ci bot and others added 5 commits May 10, 2024 08:07

[pre-commit.ci] auto fixes from pre-commit.com hooks

763e6d4

for more information, see https://pre-commit.ci

Merge branch 'main' into main

d4e0b8c

Update pandas.rst

6ca399d

Merge branch 'main' of https://github.com/loco-philippe/xarray

c8e2c3b

[pre-commit.ci] auto fixes from pre-commit.com hooks

878b683

for more information, see https://pre-commit.ci

Update ecosystem.rst

56a488f

loco-philippe mentioned this pull request May 10, 2024

(Blessed) JSON serializable format numpy/numpy#12481

Open

mathause reviewed May 13, 2024

View reviewed changes

mathause and others added 5 commits May 13, 2024 10:12

Merge branch 'main' into main

15eff3c

Update doc/user-guide/pandas.rst

c1a3ff5

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Update doc/user-guide/pandas.rst

84b476d

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Update doc/user-guide/pandas.rst

ffe3a73

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

review comments

5a6009f

loco-philippe added 2 commits May 14, 2024 00:13

Update doc.yml

1082288

Update doc.yml

dd7970d

loco-philippe added 4 commits May 14, 2024 08:56

Update doc.yml

0113b96

Update doc.yml

5f9468a

Update doc.yml

77345bc

Update doc.yml

06d98d3

loco-philippe and others added 2 commits May 19, 2024 23:36

remove code

356f031

[pre-commit.ci] auto fixes from pre-commit.com hooks

f571bda

for more information, see https://pre-commit.ci

loco-philippe marked this pull request as ready for review May 19, 2024 21:50

mathause requested changes May 21, 2024

View reviewed changes

loco-philippe and others added 5 commits May 21, 2024 15:00

Update doc/user-guide/pandas.rst

a069c11

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Update doc/user-guide/pandas.rst

e214029

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Update ci/requirements/doc.yml

4e8aede

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Update doc/user-guide/pandas.rst

d8db8e9

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Update doc/user-guide/pandas.rst

2b331dc

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

loco-philippe requested a review from mathause May 21, 2024 13:06

mathause approved these changes May 22, 2024

View reviewed changes

mathause added the plan to merge Final call for comments label May 22, 2024

Merge branch 'main' into main

890cb27

mathause enabled auto-merge (squash) May 22, 2024 15:42

Merge branch 'main' into main

8059010

mathause closed this May 29, 2024

auto-merge was automatically disabled May 29, 2024 18:53
Pull request was closed

mathause reopened this May 29, 2024

mathause enabled auto-merge (squash) May 29, 2024 18:53

mathause disabled auto-merge May 30, 2024 07:46

mathause merged commit 9e8ea74 into pydata:main May 30, 2024
14 checks passed

mathause mentioned this pull request Jun 3, 2024

[pre-commit.ci] pre-commit autoupdate #9061

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

loco-philippe commented May 10, 2024

welcome bot commented May 10, 2024

max-sixty commented May 10, 2024

loco-philippe commented May 10, 2024

mathause commented May 10, 2024

mathause left a comment

loco-philippe commented May 13, 2024

max-sixty commented May 13, 2024

mathause commented May 14, 2024

mathause commented May 14, 2024

loco-philippe commented May 14, 2024

loco-philippe commented May 15, 2024

mathause commented May 15, 2024 •

edited

loco-philippe commented May 19, 2024

mathause left a comment

mathause left a comment

loco-philippe commented May 26, 2024

mathause commented May 29, 2024

mathause commented May 30, 2024

welcome bot commented May 30, 2024

loco-philippe commented May 30, 2024

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

Conversation

loco-philippe commented May 10, 2024

welcome bot commented May 10, 2024

max-sixty commented May 10, 2024

loco-philippe commented May 10, 2024

mathause commented May 10, 2024

mathause left a comment

Choose a reason for hiding this comment

loco-philippe commented May 13, 2024

max-sixty commented May 13, 2024

mathause commented May 14, 2024

mathause commented May 14, 2024

loco-philippe commented May 14, 2024

loco-philippe commented May 15, 2024

mathause commented May 15, 2024 • edited

loco-philippe commented May 19, 2024

mathause left a comment

Choose a reason for hiding this comment

mathause left a comment

Choose a reason for hiding this comment

loco-philippe commented May 26, 2024

mathause commented May 29, 2024

mathause commented May 30, 2024

welcome bot commented May 30, 2024

loco-philippe commented May 30, 2024

mathause commented May 15, 2024 •

edited