Handle extra indexes for zarr region writes #8904

slevang · 2024-04-02T14:34:00Z

Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

Small follow up to #8877. If we're going to drop the indices anyways for region writes, we may as well not raise if they are still in the dataset. This makes the user experience of region writes simpler:

ds = xr.tutorial.open_dataset("air_temperature")
ds.to_zarr("test.zarr")
region = {"time": slice(0, 10)}
# This fails unless we remember to ds.drop_vars(["lat", "lon"])
ds.isel(**region).to_zarr("test.zarr", region=region)

I find this annoying because I often have a dataset with a bunch of unrelated indexes and have to remember which ones to drop, or use some verbose set logic. I thought #8877 might have already done this, but not quite. By just reordering the point at which we drop indices, we can now skip this. We still raise if data vars are passed that don't overlap with the region.

cc @dcherian

dcherian · 2024-04-02T15:11:24Z

Alternatively perhaps #8877 is too liberal and we should only drop indexes that are in region.

In general, I think we end up regretting this sort of implicit behaviour. It confuses a /lot/ of people.

What do you think of keeping the error but adding a nice copy-pasteable .drop_vars([...]) call to the error message?

cc @max-sixty

slevang · 2024-04-02T17:17:24Z

Alternatively perhaps #8877 is too liberal and we should only drop indexes that are in region.

Perhaps. Curious to hear other perspectives on this, and whether there are applications out there where people actually want to write indexes while doing region writes. I can't think of any. It's also always unsafe for parallel writes with region="auto" because we have workers reading from these same indexes as they're being overwritten.

Personally I find it more confusing that I have to drop the otherwise essential coordinate labels on my arrays just to get a region write to work. My MO with all other xarray ops is to keep and utilize these whenever possible, so this feels like a departure from that.

The current error message already returns the suggestion to add .drop_vars(['lat', 'lon']), but I don't really like running code until it breaks to extract this. Instead I try to remember to add something like ds.drop_vars([dim for dim in ds.dims if dim not in region]).to_zarr(region=region)

max-sixty · 2024-04-02T17:30:08Z

It's also always unsafe for parallel writes with region="auto" because we have workers reading from these same indexes as they're being overwritten.

Yes this!

I both agree that the existing error is annoying and that magic behavior can be really confusing.

At the cost of a small amount of perf and some additional code, would checking that the indexes match be a good balance? So it's only raising an error if the index on disk differs from the version in memory.

slevang · 2024-04-02T18:01:36Z

Or what about an opt-in kwarg, clean_for_region (someone please come up with a better name if we go this route)? If provided, we assume the user knows what they want and we smoothly handle dropping components of the dataset that don't overlap with the region, potentially data vars as well as indexes.

max-sixty · 2024-04-03T00:05:17Z

Actually I think #8460 might dominate this — by returning the dataset to write, it's moving the proposed magic out of .to_zarr into a single place where the inputs & outputs are very clear.

So my preference would be to try and push that through.

If that were to merge, would this still be helpful?

slevang · 2024-04-03T14:20:37Z

I agree that #8460 should generally be the recommended way once that's merged. I think I do have some patterns where that doesn't totally apply though and I'm still going to need the .drop_vars line. Pseudocode something like:

x = DataArray(dims=["time", "lat", "lon"])
y = DataArray(dims=["time", "lat", "lon"])
template = DataArray(dims=["lat", "lon", "parameter"])
xr.initialize_zarr(template, path)

# logic similar to:
# https://github.com/pydata/xarray/blob/a529f1d5b03279e88e3703b5a02a784838533d2c/xarray/tests/test_backends.py#L5708
blocks = get_block_slices() 
for region in blocks:
    result = model.fit(x.sel(region), y.sel(region))
    # I could assign result.data to its region in the template here, 
    # but doesn't really make sense because I want to flush to disk and discard
    result.to_zarr(path, region=region)

slevang · 2024-04-03T14:23:46Z

I guess my main point with this PR is that #8877 already introduced the majority of the "magic" to drop all indexes. If we're doing that, I don't see any reason to raise when non-region indexes are passed.

But if we think #8877 went overboard, then indeed this change doesn't make sense and we should probably revert to doing something like:

only drop indexes that are in region

max-sixty · 2024-04-03T19:20:35Z

I guess my main point with this PR is that #8877 already introduced the majority of the "magic" to drop all indexes. If we're doing that, I don't see any reason to raise when non-region indexes are passed.

Yes fair!

In that case, I would be +0.5 on merging this, at least until #8460 merges...

allow indices unrelated to the region since we drop them anyways

523f18e

slevang changed the title ~~allow indices unrelated to the region since we drop them anyways~~ Handle extra indexes for zarr region writes Apr 2, 2024

slevang mentioned this pull request Apr 2, 2024

Add initialize_zarr #8460

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle extra indexes for zarr region writes #8904

Handle extra indexes for zarr region writes #8904

slevang commented Apr 2, 2024

dcherian commented Apr 2, 2024

slevang commented Apr 2, 2024

max-sixty commented Apr 2, 2024

slevang commented Apr 2, 2024

max-sixty commented Apr 3, 2024

slevang commented Apr 3, 2024 •

edited by max-sixty

slevang commented Apr 3, 2024

max-sixty commented Apr 3, 2024

Handle extra indexes for zarr region writes #8904

Are you sure you want to change the base?

Handle extra indexes for zarr region writes #8904

Conversation

slevang commented Apr 2, 2024

dcherian commented Apr 2, 2024

slevang commented Apr 2, 2024

max-sixty commented Apr 2, 2024

slevang commented Apr 2, 2024

max-sixty commented Apr 3, 2024

slevang commented Apr 3, 2024 • edited by max-sixty

slevang commented Apr 3, 2024

max-sixty commented Apr 3, 2024

slevang commented Apr 3, 2024 •

edited by max-sixty