New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An experimental rasterio-based Zarr storage class #2623
base: main
Are you sure you want to change the base?
Conversation
def __init__(self, dataset): | ||
self.dataset = dataset | ||
chunk_height, chunk_width = self.dataset.block_shapes[0] | ||
self._data = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._data
is the "static" part of this store's data mapping.
self.dataset.width, | ||
), | ||
"chunks": ( | ||
1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this first revision, I chose to put each band in its own chunk. Not ideal for all situations, I'm sure. I've got a lot to learn about zarr in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For uncompressed flat binary data, you can break the bytes into chunks however you want. Perhaps this could be a user configurable parameter.
rasterio/zarr.py
Outdated
def __getitem__(self, key): | ||
if key in self._data: | ||
return self._data[key] | ||
elif key.startswith(Path(self.dataset.name).name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where we intercept raster chunk lookups and translate them into rasterio windowed reads. A bit like kerchunk's byte offsets.
boundless=True, | ||
) | ||
return chunk | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else clause is so important! Without it, you get a sorta baffling error about path '' contains an array
😅
with rasterio.open(path_rgb_byte_tif) as dataset: | ||
store = RasterioStore(dataset) | ||
z = zarr.group(store) | ||
assert (z["RGB.byte.tif"][:] == dataset.read()).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect fidelity is always reassuring 😄
Ugh, |
My initial thought is that this functionality is targeted towards a specific file format and is not generally useful for rasterio users or require a lower level interface to GDAL. Maybe this would be a better fit in a new python package: Side note: Not hard set on the opinion, just sharing thoughts to see what you think. |
Thought this information may be helpful as a reference: xarray provides mechanisms for writing to Zarr ref. With rioxarray installed, you can use the Simple example: import xarray
xds = xarray.open_dataset("file.tif", engine="rasterio")
xds.to_zarr(...) |
@snowman2 yeah, a new package might be appropriate. This is only ~60 lines of code now, but could grow to support subdatasets, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super exciting! A few superficial comments.
self.dataset.width, | ||
), | ||
"chunks": ( | ||
1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For uncompressed flat binary data, you can break the bytes into chunks however you want. Perhaps this could be a user configurable parameter.
rasterio/zarr.py
Outdated
"filters": None, | ||
} | ||
).encode("utf-8"), | ||
Path(self.dataset.name).name + "/.zattrs": json.dumps({}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about metadata? Is there any metadata that would be useful to expose as Zarr attrs?
pass | ||
|
||
def __len__(self): | ||
return len(self._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably incorrect, since it doesn't include the keys.
return len(self._data) | ||
|
||
def __iter__(self): | ||
return iter(self._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. A store should expose the metadata keys AND the chunk keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, chunk keys could be generated as needed. But it looked like array access was going to work without these, so I skipped that for now.
Note that WarpedVRTs must be padded to multiples of 512
Here's something fairly novel (I think, though rio-xarray probably also provides this) and not supported by kerchunk, a raster reprojected lazily from EPSG:32618 (UTM) to EPSG:4326 (long/lat), and expressed as a Zarr array. import math
import matplotlib.pyplot as plt
import numpy
import rasterio
from rasterio.vrt import WarpedVRT
from rasterio.zarr import RasterioStore
import zarr
with rasterio.open("tests/data/RGB.byte.tif") as dataset:
# WarpedVRTs don't support unbounded reads and must be padded.
vrt_height = 512 * int(math.ceil(dataset.height / 512))
vrt_width = 512 * int(math.ceil(dataset.width / 512))
with WarpedVRT(dataset, crs="EPSG:4326", height=vrt_height, width=vrt_width) as vrt:
store = RasterioStore(vrt)
print(vrt.name)
z = zarr.group(store)
print(list(store))
rgb = z[vrt.name][:]
plt.imshow(numpy.moveaxis(rgb, 0, -1))
plt.show() |
For my future self, the xarray example (with rioxarray): import rasterio
from rasterio.vrt import WarpedVRT
import xarray
with rasterio.open("tests/data/RGB.byte.tif") as dataset:
with WarpedVRT(dataset, crs="EPSG:4326") as vrt:
ds = xarray.open_dataset(vrt, engine="rasterio", mask_and_scale=False)
ds.band_data.plot.imshow(rgb="band") |
@sgillies Hope you don't mind I tweaked the xarray example. |
Access to good ole flat rasters via a hierarchical data API could be handy if you want to zarr-ify some GeoTIFFs, JP2s, or VRTs. That's what this Zarr storage adapter does. This work is related to #1759.
RasterioStore represents a flat raster as a group with one array and translates flat raster blocks or strips into zarr chunks.