Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An experimental rasterio-based Zarr storage class #2623

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

sgillies
Copy link
Member

@sgillies sgillies commented Oct 27, 2022

Access to good ole flat rasters via a hierarchical data API could be handy if you want to zarr-ify some GeoTIFFs, JP2s, or VRTs. That's what this Zarr storage adapter does. This work is related to #1759.

RasterioStore represents a flat raster as a group with one array and translates flat raster blocks or strips into zarr chunks.

import matplotlib.pyplot as plt
import numpy
import rasterio
from rasterio.zarr import RasterioStore
import zarr

with rasterio.open("tests/data/RGB.byte.tif") as dataset:
    store = RasterioStore(dataset)
    z = zarr.group(store)
    rgb = z["RGB.byte.tif"][:]
    plt.imshow(numpy.moveaxis(rgb, 0, -1))
    plt.show()

Figure_1

@sgillies sgillies self-assigned this Oct 27, 2022
def __init__(self, dataset):
self.dataset = dataset
chunk_height, chunk_width = self.dataset.block_shapes[0]
self._data = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._data is the "static" part of this store's data mapping.

self.dataset.width,
),
"chunks": (
1,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this first revision, I chose to put each band in its own chunk. Not ideal for all situations, I'm sure. I've got a lot to learn about zarr in production.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For uncompressed flat binary data, you can break the bytes into chunks however you want. Perhaps this could be a user configurable parameter.

rasterio/zarr.py Outdated
def __getitem__(self, key):
if key in self._data:
return self._data[key]
elif key.startswith(Path(self.dataset.name).name):
Copy link
Member Author

@sgillies sgillies Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where we intercept raster chunk lookups and translate them into rasterio windowed reads. A bit like kerchunk's byte offsets.

boundless=True,
)
return chunk
else:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else clause is so important! Without it, you get a sorta baffling error about path '' contains an array 😅

with rasterio.open(path_rgb_byte_tif) as dataset:
store = RasterioStore(dataset)
z = zarr.group(store)
assert (z["RGB.byte.tif"][:] == dataset.read()).all()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect fidelity is always reassuring 😄

@sgillies
Copy link
Member Author

sgillies commented Oct 27, 2022

Ugh, Failed to build asciitree. I swear this worked fine on my machine.

@snowman2
Copy link
Member

My initial thought is that this functionality is targeted towards a specific file format and is not generally useful for rasterio users or require a lower level interface to GDAL. Maybe this would be a better fit in a new python package: rio-zarr.

Side note: Not hard set on the opinion, just sharing thoughts to see what you think.

@snowman2
Copy link
Member

Thought this information may be helpful as a reference:

xarray provides mechanisms for writing to Zarr ref. With rioxarray installed, you can use the rasterio engine to open rasters as an xarray.Dataset ref and can write to Zarr from there.

Simple example:

import xarray

xds = xarray.open_dataset("file.tif", engine="rasterio")
xds.to_zarr(...)

@sgillies
Copy link
Member Author

@snowman2 yeah, a new package might be appropriate. This is only ~60 lines of code now, but could grow to support subdatasets, etc.

Copy link

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super exciting! A few superficial comments.

self.dataset.width,
),
"chunks": (
1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For uncompressed flat binary data, you can break the bytes into chunks however you want. Perhaps this could be a user configurable parameter.

rasterio/zarr.py Outdated
"filters": None,
}
).encode("utf-8"),
Path(self.dataset.name).name + "/.zattrs": json.dumps({}),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about metadata? Is there any metadata that would be useful to expose as Zarr attrs?

pass

def __len__(self):
return len(self._data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably incorrect, since it doesn't include the keys.

return len(self._data)

def __iter__(self):
return iter(self._data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. A store should expose the metadata keys AND the chunk keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, chunk keys could be generated as needed. But it looked like array access was going to work without these, so I skipped that for now.

Note that WarpedVRTs must be padded to multiples of 512
@sgillies
Copy link
Member Author

sgillies commented Oct 31, 2022

Here's something fairly novel (I think, though rio-xarray probably also provides this) and not supported by kerchunk, a raster reprojected lazily from EPSG:32618 (UTM) to EPSG:4326 (long/lat), and expressed as a Zarr array.

import math

import matplotlib.pyplot as plt
import numpy
import rasterio
from rasterio.vrt import WarpedVRT
from rasterio.zarr import RasterioStore
import zarr

with rasterio.open("tests/data/RGB.byte.tif") as dataset:
    # WarpedVRTs don't support unbounded reads and must be padded.
    vrt_height = 512 * int(math.ceil(dataset.height / 512))
    vrt_width = 512 * int(math.ceil(dataset.width / 512))
    with WarpedVRT(dataset, crs="EPSG:4326", height=vrt_height, width=vrt_width) as vrt:
        store = RasterioStore(vrt)
        print(vrt.name)
        z = zarr.group(store)
        print(list(store))
        rgb = z[vrt.name][:]
        plt.imshow(numpy.moveaxis(rgb, 0, -1))
        plt.show()

Figure_2

@sgillies
Copy link
Member Author

sgillies commented Oct 31, 2022

For my future self, the xarray example (with rioxarray):

import rasterio
from rasterio.vrt import WarpedVRT
import xarray

with rasterio.open("tests/data/RGB.byte.tif") as dataset:
    with WarpedVRT(dataset, crs="EPSG:4326") as vrt:
        ds = xarray.open_dataset(vrt, engine="rasterio", mask_and_scale=False)
        ds.band_data.plot.imshow(rgb="band")

image

@snowman2
Copy link
Member

@sgillies Hope you don't mind I tweaked the xarray example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants