Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechunk group to group #79

Open
valpesendorfer opened this issue Jan 15, 2021 · 2 comments
Open

Rechunk group to group #79

valpesendorfer opened this issue Jan 15, 2021 · 2 comments

Comments

@valpesendorfer
Copy link

Hello,

I'm experimenting with multi-group zarrs, where each group represents a separate tile. The tile themselves are structured the same way (same dimensions, variables, etc) and are created with xarray.

Here's a small example of how it can look like:

/
 ├── h19v07
 │   ├── band (1200, 1200, 5) int16
 │   ├── time (5,) int64
 │   ├── x (1200,) float64
 │   └── y (1200,) float64
 ├── h19v08
 │   ├── band (1200, 1200, 5) int16
 │   ├── time (5,) int64
 │   ├── x (1200,) float64
 │   └── y (1200,) float64
 ├── h20v07
 │   ├── band (1200, 1200, 5) int16
 │   ├── time (5,) int64
 │   ├── x (1200,) float64
 │   └── y (1200,) float64
 └── h20v08
     ├── band (1200, 1200, 5) int16
     ├── time (5,) int64
     ├── x (1200,) float64
     └── y (1200,) float64

I was wondering if there's a way to either

  • rechunk the band array of each group into a new zarr that has the same structure, or
  • if there's any standard way to rechunk a group to another group

The only way I was successful in rechunking the groups was in iterating over each one, and running rechunk individually with the group path (i.e. target.zarr/group) set as target_store.

Thanks

@rabernat
Copy link
Member

Hi @valpesendorfer! 👋 Thanks for this interesting issue.

You're correct: right now, we only support Zarr arrays or flat groups (no groups within group). The challenge here is that Zarr support infinitely deep nesting of groups. It's hard for me to think of the right behavior.

So I have a question for you. What API syntax (call to rechunk) would you like to see here? Specifically, how would you specify target_chunks for these nested groups?

@valpesendorfer
Copy link
Author

valpesendorfer commented Jan 15, 2021

👋 and may I add, thanks for all your work.

First, good to know I'm not missing something essential, I've started with zarrs only recently.

To answer your Q, I don't know exactly ...

But ideally, the manual work I need to do in iterating over groups would be taken care of by rechunker - and since this all is expected to be executed on a dask cluster (in my case), this could also be more efficient / faster as all the tasks can be passed at once. I've not tried yet to generate all the plans first and then execute them on the cluster ... I was too worried about existing temporary storage killing the processes.

In this specific case, I don't want to do anything fancy, meaning I want to keep the same structure, just with the band array re-chunked.

So to specify target_chunks, this is what I currently do for each group:

target_chunks = {
    'band': (256, 256, 5),
}

This could be extended to each group that has an array band.

Or, more verbosely, using a nested dictionary with a group : array : chunk syntax, something like

target_chunks = {

    "/full/path/to/group1": {
        "band": (256, 256, 5),
    },

    "/full/path/to/group2": {
        "band": (256, 256, 5),
    }
}

Which could be generated dynamically:

zarr_raw = zarr.open("raw.zarr", mode="r") 

target_chunks = { group: {"band": (256, 256, 5)} for group in zarr_raw.group_keys()}

Edit

Ok, I see now the code above doesn't work for nested groups ... that would require something smarter. I've only worked with groups as exemplified above, and I don't think there's any reason for me to go deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants