Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test out Xarray Tensorstore Backend #198

Open
jacobbieker opened this issue May 15, 2023 · 12 comments
Open

Test out Xarray Tensorstore Backend #198

jacobbieker opened this issue May 15, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@jacobbieker
Copy link
Member

https://github.com/google/xarray-tensorstore

Detailed Description

We have been trying to speed up access from zarr for quite awhile. Tensorstore might help, and Google recently made public a backend for xarray that uses Tensorstore.

@JackKelly

@jacobbieker jacobbieker added the enhancement New feature or request label May 15, 2023
@JackKelly
Copy link
Member

@assafshouval
Copy link

I would like to take this issue. I'm new here. how these stuff works?
Where to start look at?

@jacobbieker
Copy link
Member Author

Hi! That's great! For this, the primary place that things would need to be updated would be the files in ocf_datapipes/load, as this issue is primarily concerned with opening up and reading the Zarrs with tensorstore. The two main ones you would want to look at would be

for satellite data, which is available on GCP here if you want some more to test with, other than the unit tests which include a bit of satellite data. And for NWP data sources, primarily the ICON data:
def open_icon_eu(zarr_path) -> xr.Dataset:
We have an archive of ICON data in Zarr format on HuggingFace here where you can download an example or two to try things out. Although there is also some data included in this repo for the unit tests, incase you want to try with that instead too.

Ideally, it should be possible to just swap out the xr.open_zarr with xarray_tensorstore.open_zarr with minimal changes. There are some caveats though:

  1. From some initial testing, TensorStore possibly doesn't work very well with our compression algorithm ocf_blosc2 @dfulu might have more info
  2. We don't want to use the xarray-tensorstore.read() until we have cropped and picked out examples to use, otherwise it seems like it might read into memory the whole dataset, which for most of our data sources is multiple TBs.

@assafshouval
Copy link

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time').
dataset = dataset.drop_duplicates("time").sortby("time")
I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534
I'm wondering whether I don't have the dependencies right...

@jacobbieker
Copy link
Member Author

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

Hi, thanks for looking into this! If you can open the zarr, which you can since you got to that line, the dependencies should not be the problem then, I don't think. The TensorStore implementation might not be mature enough for this then? But not really sure. I thought xarray-tensorstore uses Zarr-python for the metadata, so the sorting should work with that, but yeah, sorry its not much help.

@assafshouval
Copy link

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

I thought xarray-tensorstore uses Zarr-python for the metadata, so the sorting should work with that, but yeah, sorry its not much help.

Yeah, they do use zarr-python, but the problem is when trying to deep copy the dataset.
I'll invest a little more time exploring this, and see if I can advance this further, if not I'll leave it for now.
Thanks

@shoyer
Copy link

shoyer commented Aug 23, 2023

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

This should be fixed in the latest 0.1.1 release of Xarray-Tensorstore.

@assafshouval
Copy link

indeed, it works. thanks @shoyer.
@jacobbieker I have more questions:
a. Do you have a good example that would beneficial to bench-mark it?
b. It doesn't support opening multi-file dataset as in:
dataset = xr.open_mfdataset(zarr_path, **openmf_kwargs), for example in sattelite.py, line 45. I don't how much this scenario is worth investing time in, but first maybe worth to benchmark the first case ...

@jacobbieker
Copy link
Member Author

Hi, yeah, a good benchmark would be a single satellite zarr, like this one: gs://public-datasets-eumetsat-solar-forecasting/satellite/EUMETSAT/SEVIRI_RSS/v4/2023_hrv.zarr

And okay, thanks, yeah, if it is a lot faster, we can probably fine a workaround to that issue.

@jacobbieker
Copy link
Member Author

Update from my testing: Tensorstore does not support compressors not on this list https://google.github.io/tensorstore/driver/zarr/index.html#json-driver/zarr/Compressor and so can't open most of the OCF Zarrs which are compressed with Blosc2

@peterdudfield
Copy link
Contributor

Should we close this?

@jacobbieker
Copy link
Member Author

I think we should leave it open for now, I'll close the issues related to adding support though. @dfulu was going to add a small example notebook and such showing the testing results to have it recorded here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants