Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle extra indexes for zarr region writes #8904

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

slevang
Copy link
Contributor

@slevang slevang commented Apr 2, 2024

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

Small follow up to #8877. If we're going to drop the indices anyways for region writes, we may as well not raise if they are still in the dataset. This makes the user experience of region writes simpler:

ds = xr.tutorial.open_dataset("air_temperature")
ds.to_zarr("test.zarr")
region = {"time": slice(0, 10)}
# This fails unless we remember to ds.drop_vars(["lat", "lon"])
ds.isel(**region).to_zarr("test.zarr", region=region)

I find this annoying because I often have a dataset with a bunch of unrelated indexes and have to remember which ones to drop, or use some verbose set logic. I thought #8877 might have already done this, but not quite. By just reordering the point at which we drop indices, we can now skip this. We still raise if data vars are passed that don't overlap with the region.

cc @dcherian

@slevang slevang changed the title allow indices unrelated to the region since we drop them anyways Handle extra indexes for zarr region writes Apr 2, 2024
@slevang slevang mentioned this pull request Apr 2, 2024
4 tasks
@dcherian
Copy link
Contributor

dcherian commented Apr 2, 2024

Alternatively perhaps #8877 is too liberal and we should only drop indexes that are in region.

In general, I think we end up regretting this sort of implicit behaviour. It confuses a /lot/ of people.

What do you think of keeping the error but adding a nice copy-pasteable .drop_vars([...]) call to the error message?

cc @max-sixty

@slevang
Copy link
Contributor Author

slevang commented Apr 2, 2024

Alternatively perhaps #8877 is too liberal and we should only drop indexes that are in region.

Perhaps. Curious to hear other perspectives on this, and whether there are applications out there where people actually want to write indexes while doing region writes. I can't think of any. It's also always unsafe for parallel writes with region="auto" because we have workers reading from these same indexes as they're being overwritten.

Personally I find it more confusing that I have to drop the otherwise essential coordinate labels on my arrays just to get a region write to work. My MO with all other xarray ops is to keep and utilize these whenever possible, so this feels like a departure from that.

The current error message already returns the suggestion to add .drop_vars(['lat', 'lon']), but I don't really like running code until it breaks to extract this. Instead I try to remember to add something like ds.drop_vars([dim for dim in ds.dims if dim not in region]).to_zarr(region=region)

@max-sixty
Copy link
Collaborator

It's also always unsafe for parallel writes with region="auto" because we have workers reading from these same indexes as they're being overwritten.

Yes this!


I both agree that the existing error is annoying and that magic behavior can be really confusing.

At the cost of a small amount of perf and some additional code, would checking that the indexes match be a good balance? So it's only raising an error if the index on disk differs from the version in memory.

@slevang
Copy link
Contributor Author

slevang commented Apr 2, 2024

Or what about an opt-in kwarg, clean_for_region (someone please come up with a better name if we go this route)? If provided, we assume the user knows what they want and we smoothly handle dropping components of the dataset that don't overlap with the region, potentially data vars as well as indexes.

@max-sixty
Copy link
Collaborator

Actually I think #8460 might dominate this — by returning the dataset to write, it's moving the proposed magic out of .to_zarr into a single place where the inputs & outputs are very clear.

So my preference would be to try and push that through.

If that were to merge, would this still be helpful?

@slevang
Copy link
Contributor Author

slevang commented Apr 3, 2024

I agree that #8460 should generally be the recommended way once that's merged. I think I do have some patterns where that doesn't totally apply though and I'm still going to need the .drop_vars line. Pseudocode something like:

x = DataArray(dims=["time", "lat", "lon"])
y = DataArray(dims=["time", "lat", "lon"])
template = DataArray(dims=["lat", "lon", "parameter"])
xr.initialize_zarr(template, path)

# logic similar to:
# https://github.com/pydata/xarray/blob/a529f1d5b03279e88e3703b5a02a784838533d2c/xarray/tests/test_backends.py#L5708
blocks = get_block_slices() 
for region in blocks:
    result = model.fit(x.sel(region), y.sel(region))
    # I could assign result.data to its region in the template here, 
    # but doesn't really make sense because I want to flush to disk and discard
    result.to_zarr(path, region=region)

@slevang
Copy link
Contributor Author

slevang commented Apr 3, 2024

I guess my main point with this PR is that #8877 already introduced the majority of the "magic" to drop all indexes. If we're doing that, I don't see any reason to raise when non-region indexes are passed.

But if we think #8877 went overboard, then indeed this change doesn't make sense and we should probably revert to doing something like:

only drop indexes that are in region

@max-sixty
Copy link
Collaborator

I guess my main point with this PR is that #8877 already introduced the majority of the "magic" to drop all indexes. If we're doing that, I don't see any reason to raise when non-region indexes are passed.

Yes fair!

In that case, I would be +0.5 on merging this, at least until #8460 merges...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants