Test out Xarray Tensorstore Backend #198

jacobbieker · 2023-05-15T09:51:56Z

https://github.com/google/xarray-tensorstore

Detailed Description

We have been trying to speed up access from zarr for quite awhile. Tensorstore might help, and Google recently made public a backend for xarray that uses Tensorstore.

@JackKelly

JackKelly · 2023-05-15T10:13:25Z

SGTM!

Relevant links:

And reasons to believe that TensorStore might be faster than zarr-python:

Documentation: Performance Comparison google/tensorstore#49 (comment)

assafshouval · 2023-08-16T15:40:12Z

I would like to take this issue. I'm new here. how these stuff works?
Where to start look at?

jacobbieker · 2023-08-16T15:55:10Z

Hi! That's great! For this, the primary place that things would need to be updated would be the files in ocf_datapipes/load, as this issue is primarily concerned with opening up and reading the Zarrs with tensorstore. The two main ones you would want to look at would be

ocf_datapipes/ocf_datapipes/load/satellite.py

Line 56 in 0871498

def open_sat_data(

for satellite data, which is available on GCP here if you want some more to test with, other than the unit tests which include a bit of satellite data. And for NWP data sources, primarily the ICON data:

ocf_datapipes/ocf_datapipes/load/nwp/providers/icon.py

Line 8 in 0871498

def open_icon_eu(zarr_path) -> xr.Dataset:

We have an archive of ICON data in Zarr format on HuggingFace here where you can download an example or two to try things out. Although there is also some data included in this repo for the unit tests, incase you want to try with that instead too.

Ideally, it should be possible to just swap out the xr.open_zarr with xarray_tensorstore.open_zarr with minimal changes. There are some caveats though:

From some initial testing, TensorStore possibly doesn't work very well with our compression algorithm ocf_blosc2 @dfulu might have more info
We don't want to use the xarray-tensorstore.read() until we have cropped and picked out examples to use, otherwise it seems like it might read into memory the whole dataset, which for most of our data sources is multiple TBs.

assafshouval · 2023-08-20T09:40:51Z

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time').
dataset = dataset.drop_duplicates("time").sortby("time")
I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534
I'm wondering whether I don't have the dependencies right...

jacobbieker · 2023-08-21T10:00:54Z

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

Hi, thanks for looking into this! If you can open the zarr, which you can since you got to that line, the dependencies should not be the problem then, I don't think. The TensorStore implementation might not be mature enough for this then? But not really sure. I thought xarray-tensorstore uses Zarr-python for the metadata, so the sorting should work with that, but yeah, sorry its not much help.

assafshouval · 2023-08-21T10:08:41Z

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

I thought xarray-tensorstore uses Zarr-python for the metadata, so the sorting should work with that, but yeah, sorry its not much help.

Yeah, they do use zarr-python, but the problem is when trying to deep copy the dataset.
I'll invest a little more time exploring this, and see if I can advance this further, if not I'll leave it for now.
Thanks

shoyer · 2023-08-23T21:16:09Z

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

This should be fixed in the latest 0.1.1 release of Xarray-Tensorstore.

assafshouval · 2023-09-08T09:01:52Z

indeed, it works. thanks @shoyer.
@jacobbieker I have more questions:
a. Do you have a good example that would beneficial to bench-mark it?
b. It doesn't support opening multi-file dataset as in:
dataset = xr.open_mfdataset(zarr_path, **openmf_kwargs), for example in sattelite.py, line 45. I don't how much this scenario is worth investing time in, but first maybe worth to benchmark the first case ...

jacobbieker · 2023-09-08T09:10:56Z

Hi, yeah, a good benchmark would be a single satellite zarr, like this one: gs://public-datasets-eumetsat-solar-forecasting/satellite/EUMETSAT/SEVIRI_RSS/v4/2023_hrv.zarr

And okay, thanks, yeah, if it is a lot faster, we can probably fine a workaround to that issue.

jacobbieker · 2023-11-17T06:06:22Z

Update from my testing: Tensorstore does not support compressors not on this list https://google.github.io/tensorstore/driver/zarr/index.html#json-driver/zarr/Compressor and so can't open most of the OCF Zarrs which are compressed with Blosc2

peterdudfield · 2023-11-27T11:01:13Z

Should we close this?

jacobbieker · 2023-11-27T11:12:21Z

I think we should leave it open for now, I'll close the issues related to adding support though. @dfulu was going to add a small example notebook and such showing the testing results to have it recorded here.

jacobbieker added the enhancement New feature or request label May 15, 2023

jacobbieker mentioned this issue Aug 15, 2023

Add support for TensorStore for Zarr opening #224

Closed

jacobbieker mentioned this issue Nov 17, 2023

Add Tensorstore support #242

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test out Xarray Tensorstore Backend #198

Test out Xarray Tensorstore Backend #198

jacobbieker commented May 15, 2023

JackKelly commented May 15, 2023

assafshouval commented Aug 16, 2023

jacobbieker commented Aug 16, 2023

assafshouval commented Aug 20, 2023

jacobbieker commented Aug 21, 2023

assafshouval commented Aug 21, 2023

shoyer commented Aug 23, 2023

assafshouval commented Sep 8, 2023

jacobbieker commented Sep 8, 2023

jacobbieker commented Nov 17, 2023

peterdudfield commented Nov 27, 2023

jacobbieker commented Nov 27, 2023

Test out Xarray Tensorstore Backend #198

Test out Xarray Tensorstore Backend #198

Comments

jacobbieker commented May 15, 2023

Detailed Description

JackKelly commented May 15, 2023

assafshouval commented Aug 16, 2023

jacobbieker commented Aug 16, 2023

assafshouval commented Aug 20, 2023

jacobbieker commented Aug 21, 2023

assafshouval commented Aug 21, 2023

shoyer commented Aug 23, 2023

assafshouval commented Sep 8, 2023

jacobbieker commented Sep 8, 2023

jacobbieker commented Nov 17, 2023

peterdudfield commented Nov 27, 2023

jacobbieker commented Nov 27, 2023