Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass variable name to encode_zarr_variable #8809

Closed
wants to merge 6 commits into from

Conversation

slevang
Copy link
Contributor

@slevang slevang commented Mar 6, 2024

The change from #8672 mostly fixed the issue of serializing a reset multiindex in the backends, but there was an additional niche issue that turned up in xeofs that was causing serialization to still fail on the zarr backend.

The issue is that zarr is the only backend that uses a custom version of encode_cf_variable called encode_zarr_variable, and the way this gets called we don't pass through the name of the variable before running ensure_not_multiindex.

As a minimal fix, this PR just passes name through as an additional arg to the general encode_variable function. See @benbovy's comment that maybe we should actually unwrap the level coordinate in reset_index and clean up the checks in ensure_not_multiindex, but I wasn't able to get that working easily.

The exact workflow this turned up in involves DataTree and looks like this:

import numpy as np
import xarray as xr
from datatree import DataTree

# ND DataArray that gets stacked along a multiindex
da = xr.DataArray(np.ones((3, 3)), coords={"dim1": [1, 2, 3], "dim2": [4, 5, 6]})
da = da.stack(feature=["dim1", "dim2"])

# Extract just the stacked coordinates for saving in a dataset
ds = xr.Dataset(data_vars={"feature": da.feature})

# Reset the multiindex, which should make things serializable
ds = ds.reset_index("feature")
dt1 = DataTree()
dt2 = DataTree(name="feature", data=ds)
dt1["foo"] = dt2
# Somehow in this step, dt1.foo.feature.dim1.variable becomes an IndexVariable again
print(type(dt1.foo.feature.dim1.variable))

# Works
dt1.to_netcdf("test.nc", mode="w")
# Fails
dt1.to_zarr("test.zarr", mode="w")

But we can reproduce in xarray with the test added here.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't know as much as others about the code, this looks good to me

@slevang
Copy link
Contributor Author

slevang commented Mar 26, 2024

Anyone else familiar with the backends have any objections to this, @dcherian @jhamman ?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @slevang! One comment on backward compat but otherwise looks good.

xarray/backends/common.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

dcherian commented Mar 28, 2024

I think the real error is that resetting the MultiIndex doesn't clear out the index for the levels:

Using

ds = ds.reset_index(["feature", "dim1", "dim2"])

makes it all work. I don't think this change is needed.

See #8887. I don't understand why only Zarr errors out though.

@slevang
Copy link
Contributor Author

slevang commented Mar 29, 2024

Here's an approximate stack of function calls that leads to this being an issue, and why it only applies to zarr:

# ZarrStore inherits from:
class AbstractWritableDataStore:
    def encode():
        variables = {k: self.encode_variable(v) for k, v in variables.items()}
            encode_zarr_variable(variable) # <- NO NAME PASSED HERE
                conventions.encode_cf_variable(var, name=name) # <- name is now always None
                    ensure_not_multiindex(var, name=name) # <- fails

# NetCDF4DataStore/H5NetCDFArrayWrapper inherit from:
class WritableCFDataStore(AbstractWritableDataStore):
    # which overrides encode:
    def encode():
        conventions.cf_encoder(variables, attributes)
            new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} # <- NAME IS PASSED HERE
                ensure_not_multiindex(var, name=name) # <- all good

I don't know the background on why this structure exists, but it does lead to a subtle but fundamental difference in the multiindex check specifically.

#8888 interestingly fixes the test I added here, but still doesn't fix the zarr serialization issue with DataTree in my original post. Not sure why.

@benbovy
Copy link
Member

benbovy commented Mar 29, 2024

but still doesn't fix the zarr serialization issue with DataTree in my original post. Not sure why.

The bug likely happens when creating a new dataset by passing a DataArray object wrapping a multi-index (i.e., ds = xr.Dataset(data_vars={"feature": da.feature})), as explained in #8888 (comment).

I would advise against doing that if the goal is to pass the whole thing, i.e., the multi-index and both its dimension and level coordinate variables (I'd like to deprecate it in #8140). Instead I'd strongly suggest passing all multi-index coordinates explicitly with ds = xr.Dataset(coords=da.coords).

@slevang
Copy link
Contributor Author

slevang commented Mar 29, 2024

Ah that does it, thanks!

Happy to either leave this up or close it then - understand this was a pretty niche issue.

@slevang
Copy link
Contributor Author

slevang commented Apr 3, 2024

I'll close this since it's probably unnecessary but seems like we should try to push #8140 through to prevent this sort of confusion in the future.

@slevang slevang closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants