Optimize writes to existing Zarr stores. #8875

dcherian · 2024-03-25T15:32:47Z

We need to read existing variables to make sure we append or write to a region with the right encoding. Currently we decode all arrays in a Zarr group. Instead only decode those arrays for which we require encoding information.

We need to read existing variables to make sure we append or write to a region with the right encoding. Currently we request all arrays in a Zarr group. Instead only request those arrays for which we require encoding information.

dcherian · 2024-03-25T17:52:58Z

xarray/backends/zarr.py

@@ -623,7 +623,12 @@ def store(
            # avoid needing to load index variables into memory.
            # TODO: consider making loading indexes lazy again?
            existing_vars, _, _ = conventions.decode_cf_variables(
-                self.get_variables(), self.get_attrs()
+                {


feels like we should also be skipping this for mode="w" to allow overwriting the existing encoding?

Can you expand on what you mean? Is that because we would have just written these vars so already know their values/schema?

mode="w" means overwrite so we shoudn't care about encoding on disk, no?

I added a test, the encoding does get update as expected with mode="w", so presumably zarr is nuking the store with mode="w".

* main: (26 commits) [pre-commit.ci] pre-commit autoupdate (pydata#8900) Bump the actions group with 1 update (pydata#8896) New empty whatsnew entry (pydata#8899) Update reference to 'Weighted quantile estimators' (pydata#8898) 2024.03.0: Add whats-new (pydata#8891) Add typing to test_groupby.py (pydata#8890) Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867) Check for aligned chunks when writing to existing variables (pydata#8459) Add dt.date to plottable types (pydata#8873) Optimize writes to existing Zarr stores. (pydata#8875) Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886) Don't allow overwriting indexes with region writes (pydata#8877) Migrate datatree.py module into xarray.core. (pydata#8789) warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874) groupby: Dispatch quantile to flox. (pydata#8720) Opt out of auto creating index variables (pydata#8711) Update docs on view / copies (pydata#8744) Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869) numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865) upstream-dev CI: Fix interp and cumtrapz (pydata#8861) ...

Optimize writes to existing Zarr stores.

eb37aed

We need to read existing variables to make sure we append or write to a region with the right encoding. Currently we request all arrays in a Zarr group. Instead only request those arrays for which we require encoding information.

dcherian force-pushed the optimize-zarr-appends branch from 9bc43dd to eb37aed Compare March 25, 2024 15:44

dcherian requested review from shoyer and max-sixty and removed request for shoyer March 25, 2024 17:48

dcherian commented Mar 25, 2024

View reviewed changes

max-sixty approved these changes Mar 26, 2024

View reviewed changes

dcherian and others added 2 commits March 28, 2024 09:39

Add test

e0a3e10

Merge branch 'main' into optimize-zarr-appends

0dc71d4

dcherian added the plan to merge Final call for comments label Mar 28, 2024

fix test

6d89099

dcherian merged commit 5bf2cf4 into pydata:main Mar 29, 2024
29 checks passed

dcherian deleted the optimize-zarr-appends branch March 29, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize writes to existing Zarr stores. #8875

Optimize writes to existing Zarr stores. #8875

dcherian commented Mar 25, 2024 •

edited

dcherian Mar 25, 2024

jhamman Mar 25, 2024

dcherian Mar 28, 2024

dcherian Mar 28, 2024

Optimize writes to existing Zarr stores. #8875

Optimize writes to existing Zarr stores. #8875

Conversation

dcherian commented Mar 25, 2024 • edited

dcherian Mar 25, 2024

Choose a reason for hiding this comment

jhamman Mar 25, 2024

Choose a reason for hiding this comment

dcherian Mar 28, 2024

Choose a reason for hiding this comment

dcherian Mar 28, 2024

Choose a reason for hiding this comment

dcherian commented Mar 25, 2024 •

edited