Generating references from files in S3 (using kerchunk + fsspec) #61

TomNicholas · 2024-03-27T22:59:57Z

Presumably kerchunk's backends opener functions can already do this? In which case we just need to pass the right kwargs / fsspec whatevers to those inside open_virtual_dataset.

The text was updated successfully, but these errors were encountered:

norlandrhagen · 2024-03-28T22:35:16Z

Wondering how we want to package up kwargs for all the different kerchunk file readers so we can pipe in things like storage_options to get s3 reading to work.

netcdf3, netcdf4, tiff etc. all have a few options. ex:

SingleHdf5ToZarr:

  h5f: 'BinaryIO | str | h5py.File | h5py.Group',
  url: str = None,
  spec=1,
  inline_threshold=500,
  storage_options=None,
  error='warn',
  vlen_encode='embed',
  out=None,


NetCDF3ToZarr:

def __init__(
    self,
    filename,
    storage_options=None,
    inline_threshold=100,
    max_chunk_size=0,
    out=None,
    **kwargs)

tiff_to_zarr:

    urlpath: str
        Location of input TIFF
    remote_options: dict
        pass these to fsspec when opening urlpath
    target: str
        Write JSON to this location. If not given, no file is output
    target_options: dict
        pass these to fsspec when opening target

etc..

TomNicholas · 2024-03-29T00:03:43Z

Urgh that's messy - I'm not familiar at all with these so would defer to your judgement. But generally they should all go in one dict so that they can be cleanly separated from Xarray kwargs once open_virtual_dataset is replaced by xr.open_dataset(..., engine='virtualizarr').

TomNicholas · 2024-03-29T20:49:23Z

Whilst I think the approach in #67 is fine for now, is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk?

rabernat · 2024-03-29T21:17:22Z

is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk?

It's straightforward to abstract this via a wrapper class. Just imagine the API you'd like to use from the virtualizarr perspective, create an ABC interface for it, and then make an fsspec implementation.

norlandrhagen · 2024-03-29T22:47:21Z

@TomNicholas that would be nice! For a stop-gap, happy to continue working on #67 or if you want to pause and look for other file reader paths sounds good to.

TomNicholas · 2024-04-01T15:15:58Z

I really have no idea, since I've not worked with fsspec very much at all. Input welcome.

I guess from the virtualizarr perspective I just want to be able to pass in a filepath / S3 url directly to the open_virtual_dataset function, and not have to create a bunch of weird filesystem mapper objects first. I also don't want to have to pass a bunch of specific reader options or whatever (why doesn't that part "just work"...?) I don't think we need to support reading from anything other than posix filesystems and s3 urls anyway. The main thing to avoid here is just copying kerchunk's fsspec arguments into open_virtualizarr.

abarciauskas-bgse · 2024-04-02T20:53:37Z

@TomNicholas curious about this comment #61 (comment)

is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk?

since the current implementation relies on reading files using kerchunk - are you thinking about changing that implementation entirely?

If keeping the current implementation, which relies on kerchunk modules, I think #67 makes sense as that PR just builds on that integration via the additional parameters kerchunk needs to open remote files.

TomNicholas · 2024-04-02T21:25:24Z

since the current implementation relies on reading files using kerchunk - are you thinking about changing that implementation entirely?

Well looking forward, we want to eventually use zarr chunk manifests for storing the byte ranges on disk, which leaves the only part of this library that actually depends on kerchunk being the kerchunk backends for reading the byte ranges. It's at least interesting to imagine what we could implement instead of using kerchunk to read into a format that we have to convert before using anyway.

EDIT: IIUC with zarr v3 and the rust object-store create we also wouldn't need fsspec to actually read data from disk/object storage either. (I wonder if it's possible to use the same crate to read just byte ranges?...)

I'm only floating this as an idea, not a concrete plan though.

abarciauskas-bgse · 2024-04-02T21:41:25Z

IIUC with zarr v3 and the rust object-store create we also wouldn't need fsspec to actually read data from disk/object storage

I would say it's ok to postpone S3 support until some of those things you mentioned (zarr chunk manifests, zarr v3 rust object-store) exists - wdyt @norlandrhagen

Is there already WIP rust object-store @TomNicholas

TomNicholas · 2024-04-02T23:19:32Z

The rust crate exists, but you'll have to ask @jhamman about the status of attempts to use it to load data into python.

TomNicholas · 2024-04-09T22:00:03Z

@abarciauskas-bgse actually see zarr-developers/zarr-python#1661

TomNicholas · 2024-04-10T15:14:07Z

So the rust object-store crate looks like a potential solution to the problem of reading bytes from known byte ranges from S3 without using fsspec. That's great for reading from zarr stores created by virtualizarr.

But to actually use object-store to read the netCDF file to determine the correct byte ranges to put in the manifest is a different requirement. That requires re-creating the logic in kerchunk.hdf.SingleHdfToZarr, which uses h5py and fsspec like this:

if isinstance(h5f, str):
    fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
    input_file = fs.open(path, "rb")
    url = h5f
    _h5f = h5py.File(input_file, mode="r")

Is there any way to do this without using fsspec?

TomNicholas · 2024-04-10T15:22:33Z

Answering my own question, how about this ro3 file open mode on the h5py docs? Says it enables read-only access to HDF5 files in the AWS S3 or S3-compatible object stores. Maybe we could use this to implement our own S3-compatible netCDF opener using h5py but not fsspec...?

https://docs.h5py.org/en/stable/high/file.html#file-drivers

rabernat · 2024-04-10T15:22:46Z

But to actually use object-store to read the netCDF file to determine the correct byte ranges to put in the manifest is a different requirement.

Is this within scope for VirtaliZarr right now? Generating the references is a very separate problem from storing / manipulating them. I'd recommend continuing to reply on Kerchunk for the reference generation.

No, it would not be easy to use h5py in this way without fsspec. But it is possible to re-implement the logic for finding the chunks from scratch in a much more efficient way. See hidefix for example.

TomNicholas · 2024-04-10T15:29:10Z

Is this within scope for VirtualiZarr right now?

I agree it's fairly orthogonal, and not part of an MVP (which would rely on fsspec and kerchunk both for generating references and reading actual bytes later). But if we're thinking about how to create references that don't require fsspec to read bytes from (i.e. zarr stores with chunk manifests), then it's also interesting to think about how to generate those references without fsspec.

But there is less of an obvious motivation for doing so other than reliability / performance / design simplicity (whereas zarr stores without fsspec get us an actual feature: language-agnostic data access).

rabernat · 2024-04-10T15:32:52Z

From my point of view, generating the references is a one-time cost. Yes, it could be more efficient. But that's generally not the bottleneck for people today. Our goal is to make it more flexible to manipulate those reference once they are generated.

TomNicholas · 2024-04-10T16:22:12Z

I'm less concerned about efficiency and more about reliability / understandability.

Our goal is to make it more flexible to manipulate those reference once they are generated.

Yes that's the main goal, but also (a) being more confident that the generated references are correct and (b) not using dependencies that are more complex than necessary (i.e. fsspec) would be nice secondary goals too.

TomNicholas · 2024-04-10T19:50:12Z

I think we've gone off track from the original issue here, so I've opened #78 to specifically discuss generating references from data in S3 without using fsspec (as opposed to reading bytes from known references to data in S3).

I'll leave this issue to discuss the generating of references from files in S3 using kerchunk + fsspec.

TomNicholas · 2024-04-15T18:48:55Z

@abarciauskas-bgse @norlandrhagen it would be great to try to move forward with using kerchunk + fsspec for now (i.e. PR #67) with the understanding that eventually we would like to replace those dependencies with the ideas in #78. That would allow us to make this package useful for real cases faster.

TomNicholas · 2024-05-16T04:11:03Z

I think this was closed by @norlandrhagen in #67, with the other ideas being tracked in #78.

TomNicholas mentioned this issue Mar 27, 2024

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Closed

2 tasks

TomNicholas added the enhancement New feature or request label Mar 27, 2024

TomNicholas changed the title ~~We need a way to read from S3 - perhaps this should be a separate issue. I wrote in a workaround but probably need a more thought-out solution~~ Reading from S3 Mar 27, 2024

norlandrhagen mentioned this issue Mar 29, 2024

Adding reader_options kwargs to open_virtual_dataset. #67

Merged

TomNicholas mentioned this issue Apr 10, 2024

Generating references without kerchunk #78

Open

TomNicholas changed the title ~~Reading from S3~~ Generating references from files in S3 (using kerchunk + fsspec) Apr 10, 2024

jbusecke mentioned this issue Apr 25, 2024

Real world use case: Virtualizarring CMIP6 data #93

Open

TomNicholas closed this as completed May 16, 2024

TomNicholas added the references generation Reading byte ranges from archival files label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating references from files in S3 (using kerchunk + fsspec) #61

Generating references from files in S3 (using kerchunk + fsspec) #61

TomNicholas commented Mar 27, 2024 •

edited

norlandrhagen commented Mar 28, 2024

TomNicholas commented Mar 29, 2024

TomNicholas commented Mar 29, 2024

rabernat commented Mar 29, 2024

norlandrhagen commented Mar 29, 2024

TomNicholas commented Apr 1, 2024

abarciauskas-bgse commented Apr 2, 2024

TomNicholas commented Apr 2, 2024 •

edited

abarciauskas-bgse commented Apr 2, 2024

TomNicholas commented Apr 2, 2024

TomNicholas commented Apr 9, 2024

TomNicholas commented Apr 10, 2024

TomNicholas commented Apr 10, 2024 •

edited

rabernat commented Apr 10, 2024

TomNicholas commented Apr 10, 2024

rabernat commented Apr 10, 2024

TomNicholas commented Apr 10, 2024

TomNicholas commented Apr 10, 2024

TomNicholas commented Apr 15, 2024 •

edited

TomNicholas commented May 16, 2024

Generating references from files in S3 (using kerchunk + fsspec) #61

Generating references from files in S3 (using kerchunk + fsspec) #61

Comments

TomNicholas commented Mar 27, 2024 • edited

norlandrhagen commented Mar 28, 2024

TomNicholas commented Mar 29, 2024

TomNicholas commented Mar 29, 2024

rabernat commented Mar 29, 2024

norlandrhagen commented Mar 29, 2024

TomNicholas commented Apr 1, 2024

abarciauskas-bgse commented Apr 2, 2024

TomNicholas commented Apr 2, 2024 • edited

abarciauskas-bgse commented Apr 2, 2024

TomNicholas commented Apr 2, 2024

TomNicholas commented Apr 9, 2024

TomNicholas commented Apr 10, 2024

TomNicholas commented Apr 10, 2024 • edited

rabernat commented Apr 10, 2024

TomNicholas commented Apr 10, 2024

rabernat commented Apr 10, 2024

TomNicholas commented Apr 10, 2024

TomNicholas commented Apr 10, 2024

TomNicholas commented Apr 15, 2024 • edited

TomNicholas commented May 16, 2024

TomNicholas commented Mar 27, 2024 •

edited

TomNicholas commented Apr 2, 2024 •

edited

TomNicholas commented Apr 10, 2024 •

edited

TomNicholas commented Apr 15, 2024 •

edited