Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr: Optimize appending #8998

Merged
merged 7 commits into from May 10, 2024
Merged

Zarr: Optimize appending #8998

merged 7 commits into from May 10, 2024

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented May 3, 2024

Builds on #8997

Continues moving checks one level down to ZarrStore.store where we already have a bunch of checks, and a loop over existing variables in the store.

return region


def _validate_datatypes_for_zarr_append(zstore, dataset):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to backends/zarr.py

existing dtype.
"""

existing_vars = zstore.get_variables()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes this get_variables call by instead running the check in ZarrStore.store where we are already requesting existing variables, and doing checks.

if self._mode in ["a", "a-", "r+"]:
_validate_datatypes_for_zarr_append(
vn, existing_vars[vn], variables[vn]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the check runs here where we have both the variables in the store, and the new variables to be written.

@@ -1721,28 +1685,12 @@ def to_zarr(
)

if mode in ["a", "a-", "r+"]:
_validate_datatypes_for_zarr_append(zstore, dataset)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to ZarrStore.store

raise ValueError(
f"variable {var_name!r} already exists, but encoding was provided"
)
if mode == "r+":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to ZarrStore.store

@@ -612,26 +640,58 @@ def store(
import zarr

existing_keys = tuple(self.zarr_group.array_keys())

if self._mode == "r+":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks moved from backends/api.py

xarray/backends/zarr.py Outdated Show resolved Hide resolved
@@ -696,7 +756,6 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No

for vn, v in variables.items():
name = _encode_variable_name(vn)
check = vn in check_encoding_set
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reducing some indirection.

@dcherian dcherian marked this pull request as ready for review May 7, 2024 15:53
"``mode='r+'``. To allow writing new variables, set ``mode='a'``."
)

if self._append_dim is not None and self._append_dim not in existing_keys:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._append_dim not in existing_keys is a new addition. We needn't parse all arrays in the store for any append dimensions that we know to exist in the store.

@dcherian dcherian requested review from max-sixty and jhamman May 7, 2024 16:04
f"append_dim={append_dim!r} does not match any existing "
f"dataset dimensions {existing_dims}"
)
if encoding and mode in ["a", "a-", "r+"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now skip this request for existing array names if encoding was not provided.

@jhamman
Copy link
Member

jhamman commented May 9, 2024

@dcherian -- this looks great! The logic makes sense and things seem to work. Could you share some diagnostics on what this will mean in terms of traffic between Xarray and the Zarr store? Also, appreciating that this is just a refactor, do you think a new test would help avoid "uneccessary" traffic in the future?

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (Though I don't have that much context down at this level...)

@dcherian dcherian mentioned this pull request May 9, 2024
@TomNicholas TomNicholas added the topic-zarr Related to zarr storage library label May 9, 2024
@dcherian
Copy link
Contributor Author

dcherian commented May 9, 2024

Pushed a regression test that covers #8997, #8893, and this PR.

It'd be nice to add this as an asv benchmark, but my concern is that we won't catch a regression since it doesn't always run

modified.to_zarr(store, mode="a", append_dim="x")
# v2024.03.0: {'iter': 6, 'contains': 2, 'setitem': 5, 'getitem': 10, 'listdir': 6, 'list_prefix': 0}
# 6057128b: {'iter': 5, 'contains': 2, 'setitem': 5, 'getitem': 10, "listdir": 5, "list_prefix": 0}
expected = {
Copy link
Contributor Author

@dcherian dcherian May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamman 👁️ , compares release to current main (6057128)

ds.to_zarr(store, region={"x": slice(None)})
# v2024.03.0: {'iter': 5, 'contains': 2, 'setitem': 1, 'getitem': 6, 'listdir': 5, 'list_prefix': 0}
# 6057128b: {'iter': 4, 'contains': 2, 'setitem': 1, 'getitem': 5, 'listdir': 4, 'list_prefix': 0}
expected = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamman 👁️

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @dcherian!

@dcherian dcherian merged commit 6fe1234 into pydata:main May 10, 2024
25 of 28 checks passed
@dcherian dcherian deleted the cleanup-zarr-append branch May 10, 2024 16:58
andersy005 added a commit that referenced this pull request May 12, 2024
* main:
  Add whatsnew entry for #8974 (#9022)
  Zarr: Optimize appending (#8998)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants