Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor swap dims #8911

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Refactor swap dims #8911

wants to merge 2 commits into from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Apr 5, 2024

I've tried here re-implementing swap_dims using rename_dims, drop_indexes and set_xindex. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping a pd.MultiIndex that is promoted to a dimension coordinate in swap-dims auto-magically results in a PandasMultiIndex with both dimension and level coordinates).

@dcherian
Copy link
Contributor

dcherian commented Apr 5, 2024

the pandas multi-index special case

I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.

Would we want to recreate a user's custom index with this swap_dims op? Probably not right?

@dcherian
Copy link
Contributor

dcherian commented Apr 5, 2024

Does this fix #8914
(found by hypothesis)

@benbovy
Copy link
Member Author

benbovy commented Apr 5, 2024

I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.

Yes this is a good model. That's what causes a lot of trouble since the index refactor as we transitioned from a dimension coordinate - with virtual level coordinates - to the model that you describe here.

Would we want to recreate a user's custom index with this swap_dims op? Probably not right?

No I don't think we want this. In both main and this PR no multi-index is recreated:

import pandas as pd
import xarray as xr

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=["foo", "bar"])
coords = xr.Coordinates.from_pandas_multiindex(midx, "x")
coords["y"] = ("x", [1, 2, 3, 4])
ds = xr.Dataset(coords=coords)

ds.swap_dims(x="y")
# main: multi-index "x", "foo", "bar" is dropped (although the repr still shows "MultiIndex" for "x" values)
# this PR: ValueError, cannot remove index "x" which would corrupt index built from coordinates "x", "foo", "bar"

The error raised in this PR might make sense, or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.

However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})

ds2
# <xarray.Dataset> Size: 64B
# Dimensions:  (x: 4)
# Coordinates:
#     y        (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#   * x        (x) int64 32B 1 2 3 4
# Data variables:
#     *empty*


ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions:  (y: 4)
# Coordinates:
#   * y        (y) object 32B MultiIndex
#   * foo      (y) object 32B 'a' 'a' 'b' 'b'
#   * bar      (y) int64 32B 0 1 0 1
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#    *empty*

This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.

I'd be in favor of treating any multi-index passed directly as a single coordinate with tuples (#8140), i.e.,

ds2.swap_dims(x="y")
# Dimensions:  (x: 4)
# Coordinates:
#   * y        (y) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#     *empty*

with a warning raised.

@dcherian
Copy link
Contributor

dcherian commented Apr 5, 2024

On main i see no warning:
image

which looks like

or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.

I think this behaviour is good ^: we should preserve all variables, just destroy any fancy structure

swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

I support deprecating and deleting this behaviour

@max-sixty
Copy link
Collaborator

max-sixty commented Apr 17, 2024

However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})

ds2
# <xarray.Dataset> Size: 64B
# Dimensions:  (x: 4)
# Coordinates:
#     y        (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#   * x        (x) int64 32B 1 2 3 4
# Data variables:
#     *empty*


ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions:  (y: 4)
# Coordinates:
#   * y        (y) object 32B MultiIndex
#   * foo      (y) object 32B 'a' 'a' 'b' 'b'
#   * bar      (y) int64 32B 0 1 0 1
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#    *empty*

This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.

I've been a fan of MultiIndexes and simple pandas compatibility, but definitely agree that the "a 1D array of tuples magically becomes a MultiIndex" behavior would be fine / good to remove...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants