Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_base_variable: coerce multiindex data to numpy array #8888

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Mar 29, 2024

@slevang this should also make work your test case added in #8809. I haven't added it here, instead I added a basic check that should be enough.

I don't really understand why the serialization backends (zarr?) do not seem to work with the PandasMultiIndexingAdapter.__array__() implementation, which should normally coerce the multi-index levels into numpy arrays as needed. Anyway, I guess that coercing it early like in this PR doesn't hurt and may avoid the confusion of a non-indexed, isolated coordinate variable that still wraps a pandas.MultiIndex.

Ensure that reset index variable doesn't wrap any multiindex anymore.
@benbovy benbovy requested a review from dcherian March 29, 2024 10:12
@benbovy benbovy added this to In progress in Explicit Indexes via automation Mar 29, 2024
@slevang
Copy link
Contributor

slevang commented Mar 29, 2024

Thanks @benbovy, this seems good, but still doesn't fix my original issue in #8809. See comment there for more detail.

@dcherian
Copy link
Contributor

This consistency check is still broken though, I pushed it to this branch.

import numpy as np
import xarray as xr

# ND DataArray that gets stacked along a multiindex
da = xr.DataArray(np.ones((3, 3)), coords={"dim1": [1, 2, 3], "dim2": [4, 5, 6]})
da = da.stack(feature=["dim1", "dim2"])

# Extract just the stacked coordinates for saving in a dataset
ds = xr.Dataset(data_vars={"feature": da.feature})
xr.testing.assertions._assert_internal_invariants(ds.reset_index(["feature", "dim1", "dim2"]), check_default_indexes=False) # succeeds
xr.testing.assertions._assert_internal_invariants(ds.reset_index(["feature"]), check_default_indexes=False) # fails, but no warning either

@benbovy
Copy link
Member Author

benbovy commented Mar 29, 2024

Wow it took me some time to figure that out:

ds = xr.Dataset(data_vars={"feature": da.feature})

So it detects the multi-index from da.feature, then assigns it to the feature variable, auto-promotes the later to a coordinate and finally auto-creates coordinates and indexes for the multi-index levels. That's a lot happening under the hood! The internal logic for handling this is complicated, very fragile and actually still buggy (in this case Xarray wrongly creates two Xarray indexes for the level coordinates and for the "feature" dimension coordinate respectively, so reset_index won't work as expected).

This is being addressed / discussed in #8140.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Explicit Indexes
  
In progress
Development

Successfully merging this pull request may close these issues.

resetting multiindex may be buggy
3 participants