Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement group.members #1726

Merged
merged 20 commits into from Apr 24, 2024
Merged

implement group.members #1726

merged 20 commits into from Apr 24, 2024

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Mar 27, 2024

Implements Group.children for v3.

Tests are passing with the LocalStore. I still need to mock up a remote store for testing RemoteStore, which will be easy to do with the pre-existing v2 tests fixtures (when this is done I will remove the in progress label).

I would appreciate a critical eye on how I implemented children, because I don't have a lot of experience with async iteration in python and maybe I did something that could be easily improved on.

A few questions, that don't necessarily need answers in this PR:

  • Do we have to call this property children? For a variety of reasons I prefer members, and that's what I settled on over in the zom zep.
  • Should we guarantee a return order for children? I think natural sort order on the keys of the subarrays / subgroups is good. This requires that we get all the sub-array / subgroup keys at once, but I think that's how the storage APIs work anyway.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

closes #1753

@d-v-b d-v-b requested a review from jhamman March 27, 2024 11:50
@d-v-b d-v-b added in progress Someone is currently working on this V3 Related to compatibility with V3 spec labels Mar 27, 2024
@pep8speaks
Copy link

pep8speaks commented Mar 27, 2024

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-04-24 13:19:26 UTC

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 27, 2024

Seems like RemoteStore isn't finished yet, so it is excluded from the set of fixtures used for tests in this PR (currently, LocalStore and MemoryStore).

   def remote_store():
>       return RemoteStore()
E       TypeError: Can't instantiate abstract class RemoteStore with abstract methods get_partial_values, list, list_dir, list_prefix, set_partial_values

Implementing RemoteStore completely is out of scope for this PR, so I'm removing the in progress label.

@d-v-b d-v-b removed the in progress Someone is currently working on this label Mar 27, 2024
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on @d-v-b. Left you some open-ended questions but this seems like its off to a good start.

Comment on lines 309 to 310
# and scoped to specific zarr versions
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zattrs"), subkeys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# and scoped to specific zarr versions
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zattrs"), subkeys)
# and scoped to specific zarr versions
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zarray"), subkeys)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will .zarray be an immediate subkey of a group? By contrast, we do expect to find .zattrs under the prefix of any v2 group with attributes.

)

raise ValueError(msg)
subkeys = await self.store_path.store.list_dir(self.store_path.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #1686 (comment), this method should also return an async generator. Just calling that out as a change that will come here soon, what you've done is fine for now.

What I would love to see is something like:

async for subkey in self.store_path.store.list_regex(self.store_path.path, pattern=pattern):
    yield await self.getitem(subkey)

except KeyError:
# keyerror is raised when `subkey``names an object in the store
# in which case `subkey` cannot be the name of a sub-array or sub-group.
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something here but if subkey was just found in the store, why would we be getting a KeyError? This pass seems like it could silently ignore something that we care about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "foo" is the name of a random non-zarr-related file in the store, group.getitem("foo") raises KeyError, because "foo" is not the name of a sub-array or sub-group. So in this case, a KeyError just means there's a random file in there, that we want to ignore. I'm not sure that there are sources of KeyError (or any other exception) that we would want to handle here?

tests/conftest.py Outdated Show resolved Hide resolved
src/zarr/v3/group.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 4, 2024

Should we guarantee a return order for children? I think natural sort order on the keys of the subarrays / subgroups is good. This requires that we get all the sub-array / subgroup keys at once, but I think that's how the storage APIs work anyway.

I think this is answered (in the negative) by the need to support asynchronous iteration over keys, and the fact that cloud backends (e.g., s3) will reasonably paginate requests for lists of keys.

…(name, AsyncArray / AsyncGroup) pairs; Group.members repackages these into a dict.
@d-v-b d-v-b changed the title implement group.children implement ---group.children--- group.members Apr 11, 2024
@d-v-b d-v-b changed the title implement ---group.children--- group.members implement group.members Apr 11, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 11, 2024

the name of this method was discussed in an offline discussion in the biweekly refactoring meeting, and the conclusion was that we will switch the name from children to members. The v2 codebase uses both terms internally, but only the term members is used in the Group API. The terms "member" or "members" appears 13 times there, while "children" or "child" are not used.

async_iterator = await coroutine
return [item async for item in async_iterator]
# async_iterator = await coroutine
return [item async for item in coroutine()]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DahnJ I had to revert some of your changes from #1692 here in order to get the code to run. The type signatures are currently incorrect, but the code works, so we should either adjust the annotations to match the actual types, or adjust the implementation (and the actual types) to match the annotation.

@jhamman, I know you have some ideas about the direction for this

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 12, 2024

a design question: in the latest changes to this PR, Group.members returns a tuple of (str, Group | Array) pairs. This is a departure from returning dict[str, Group | Array], but it facilitates streaming individual members instead of prematurely collecting them. I am wondering if "members returns a dict" vs "members returns a tuple of tuples" is a false dichotomy -- can we just do both, by making Group.members a MutableMapping? (this relates to the discussion over in #1787)

@jhamman jhamman added this to the 3.0.0.alpha milestone Apr 22, 2024
@d-v-b d-v-b requested a review from jhamman April 24, 2024 15:15
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in! It unlocks a bunch of work on the hierarchy. I left a few comments around future optimizations / directions but this is a good first step.

async def group_keys(self) -> AsyncGenerator[str, None]:
async for key, value in self.members():
if isinstance(value, AsyncGroup):
yield key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future optimization here would be to avoid constructing the AsyncGroup and instead just generate the key based on filtering the metadata doc. As it is currently written, we construct all the members as AsyncGroup | AsyncArray, only to return their name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we could allow Group.members to filter to only='groups' so that we never need construct arrays in these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think checking the metadata should be much cheaper than instantiating AsyncArray / AsyncGroup. I will have to think about what those changes would mean for the AsyncGroup.members API, since that's what group_keys relies on.

# self._async_group.children()
# )
# return [Array(obj) if isinstance(obj, AsyncArray) else Group(obj) for obj in _children]
def members(self) -> tuple[tuple[str, Array | Group], ...]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue to think about the return type here. I could see a (frozen) dict being a more ergonomic return type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a fan of something immutable and dict-like here. I think we can use #1787 to continue brainstorming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b d-v-b merged commit 57d6ace into zarr-developers:v3 Apr 24, 2024
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants