Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

Merged
merged 28 commits into from
May 30, 2024

Conversation

loco-philippe
Copy link
Contributor

This PR follow the issue #9015 as proposed by @max-sixty.

I added an additional section in the pandas.rst file to provide a third-party pandas interface alternative that is lossless and reversible.

The main contribution is the ability to find the multidimensional structure hidden by the tabular structure.

Copy link

welcome bot commented May 10, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@max-sixty
Copy link
Collaborator

Overall I think this looks reasonable! I don't have strong views on the substance, others should comment if they have views...

We'll need to add ntv_pandas to the docs dependencies for the docs tests to pass.

@loco-philippe
Copy link
Contributor Author

@max-sixty

Thanks for the answer to the question I was going to ask!

Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?

@mathause
Copy link
Collaborator

Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?

Yes, please go ahead.

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. And as mentioned you will need to add the new dependency to https://github.com/pydata/xarray/blob/main/ci/requirements/doc.yml

doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
mathause and others added 5 commits May 13, 2024 10:12
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
@loco-philippe
Copy link
Contributor Author

Error Read the Docs :

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 import ntv_pandas as npd

ModuleNotFoundError: No module named 'ntv_pandas'

<<<-------------------------------------------------------------------------

Is it because the package name (ntv-pandas) is different from the module name (ntv_pandas) ?

@max-sixty
Copy link
Collaborator

ModuleNotFoundError: No module named 'ntv_pandas'

Is it available on conda? Otherwise if only on pip then place it at the bottom in the pip section

@mathause
Copy link
Collaborator

Now you get a syntax error because arr[*idx] is only available in python 3.11 while our docs are in python 3.10 (and your ntv-numpy package is python 3.9+ - you should add tests for your minimal python version and dependencies).

https://github.com/loco-philippe/ntv-numpy/blob/4b57b8cc1bfab749c01ddf7edbc38a9ef53623df/ntv_numpy/xconnector.py#L268-L269

@mathause
Copy link
Collaborator

Checking again - you absolutely need to start testing your packages continuously. I am reluctant to 'endorse' a package that does not have a CI pipeline. Let us know if you need support with that.

@loco-philippe
Copy link
Contributor Author

@mathause

This bug is clearly unacceptable (I'm ashamed) !!

This demonstrates that I now need to take the time to have a robust CI process.
I will first use GitHub Actions to have a build and test pipeline (if you have minimum requirements to respect to endorse packages in Xarray or other advice, I'm interested!)

For the current PR, two solutions:

  • solution 1: I correct the identified bug and I check that all the tests are validated on each of the python versions before making a new commit,
  • solution 2: I integrate the above actions into a CI pipeline before making a new commit

It seems to me that solution 2 is preferable (unless you want to go quickly and in which case I will follow solution 1).

@loco-philippe
Copy link
Contributor Author

The issue is fixed (new version of the ntv_numpy package) and the CI GHA pipeline integrates the tests with python versions 3.10 and 3.11.

I will then add xarray accessors as defined and build a new version.

@mathause
Copy link
Collaborator

mathause commented May 15, 2024

Cool great to hear!

Thinking about this - I would second @keewis idea. Featuring a prominent section without code would give it visibility, limit maintenance burden, and keep the docs environment smaller. (I maintain another smaller package where a third party package is featured in the docs. While I find it super cool that someone created an extension, it has caused above issues for me.)


Maybe this could be something along the lines:

Lossless and reversible conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The previous example shows that the conversion is not reversible (lossy roundtrip) and
that the size of the ``datasets`` increases. To avoid these problems, the third-party 
`ntv-pandas`__ library offers lossless and reversible conversions between 
``Dataset``/ ``DataArray`` and pandas ``DataFrame`` objects.

__ https://github.com/loco-philippe/ntv-pandas

If you have not done so yet, you can showcase the examples in the docs of ntv-pandas.

@loco-philippe loco-philippe marked this pull request as ready for review May 19, 2024 21:50
@loco-philippe
Copy link
Contributor Author

The proposal to have Xarray documentation without code linked to a third party package seems logical to me.

So I modified the PR taking into account @mathause's proposal (with some modifications).

I also added the Xarray accessors in the ntv_numpy package. We now have symmetric methods Dataset.nxr.to_dataframe and DataFrame.npd.to_xarray).

Another question: I haven't found any other tools or methodologies that analyze the structure of a tabular data to extract the multidimensional structure. Do you know any?

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing our suggestions. Need to remove the packages from the doc.yml again. I have some optional minor suggestions, but looks good to go.

doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
doc/user-guide/pandas.rst Outdated Show resolved Hide resolved
ci/requirements/doc.yml Outdated Show resolved Hide resolved
loco-philippe and others added 5 commits May 21, 2024 15:00
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@mathause mathause added the plan to merge Final call for comments label May 22, 2024
@mathause mathause enabled auto-merge (squash) May 22, 2024 15:42
@loco-philippe
Copy link
Contributor Author

Thank you @max-sixty and @mathause for your approval of this PR.

I finally added another example (use case) which I think should address new needs (but I don't know Xarray well enough and I'm interested in your opinion).

Note: JSON, Pandas and Xarray interfaces are built on a neutral format as defined in this notebook. I think this structure is consistent with your roadmap (I'm also interested in your opinion).

@mathause
Copy link
Collaborator

close/ open to trigger the tests again

@mathause mathause closed this May 29, 2024
auto-merge was automatically disabled May 29, 2024 18:53

Pull request was closed

@mathause mathause reopened this May 29, 2024
@mathause mathause enabled auto-merge (squash) May 29, 2024 18:53
@mathause
Copy link
Collaborator

No idea why this does not work, but I don't see this could have something to do with the PR itself. I'll merge manually.

Thanks for your PR and willingness to adopt to our changes!

@mathause mathause disabled auto-merge May 30, 2024 07:46
@mathause mathause merged commit 9e8ea74 into pydata:main May 30, 2024
14 checks passed
Copy link

welcome bot commented May 30, 2024

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@loco-philippe
Copy link
Contributor Author

Thanks also for spending time on this PR.

The changes were useful and relevant, so it was normal to take them into account !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants