Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating references from files in S3 (using kerchunk + fsspec) #61

Closed
TomNicholas opened this issue Mar 27, 2024 · 20 comments
Closed
Labels
enhancement New feature or request references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Owner

TomNicholas commented Mar 27, 2024

Raised by @abarciauskas-bgse in #60.

Presumably kerchunk's backends opener functions can already do this? In which case we just need to pass the right kwargs / fsspec whatevers to those inside open_virtual_dataset.

@TomNicholas TomNicholas added the enhancement New feature or request label Mar 27, 2024
@TomNicholas TomNicholas changed the title We need a way to read from S3 - perhaps this should be a separate issue. I wrote in a workaround but probably need a more thought-out solution Reading from S3 Mar 27, 2024
@norlandrhagen
Copy link
Collaborator

Wondering how we want to package up kwargs for all the different kerchunk file readers so we can pipe in things like storage_options to get s3 reading to work.

netcdf3, netcdf4, tiff etc. all have a few options. ex:

SingleHdf5ToZarr:

  h5f: 'BinaryIO | str | h5py.File | h5py.Group',
  url: str = None,
  spec=1,
  inline_threshold=500,
  storage_options=None,
  error='warn',
  vlen_encode='embed',
  out=None,


NetCDF3ToZarr:

def __init__(
    self,
    filename,
    storage_options=None,
    inline_threshold=100,
    max_chunk_size=0,
    out=None,
    **kwargs)

tiff_to_zarr:

    urlpath: str
        Location of input TIFF
    remote_options: dict
        pass these to fsspec when opening urlpath
    target: str
        Write JSON to this location. If not given, no file is output
    target_options: dict
        pass these to fsspec when opening target

etc..

@TomNicholas
Copy link
Owner Author

Urgh that's messy - I'm not familiar at all with these so would defer to your judgement. But generally they should all go in one dict so that they can be cleanly separated from Xarray kwargs once open_virtual_dataset is replaced by xr.open_dataset(..., engine='virtualizarr').

@TomNicholas
Copy link
Owner Author

Whilst I think the approach in #67 is fine for now, is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk?

@rabernat
Copy link
Collaborator

is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk?

It's straightforward to abstract this via a wrapper class. Just imagine the API you'd like to use from the virtualizarr perspective, create an ABC interface for it, and then make an fsspec implementation.

@norlandrhagen
Copy link
Collaborator

@TomNicholas that would be nice! For a stop-gap, happy to continue working on #67 or if you want to pause and look for other file reader paths sounds good to.

@TomNicholas
Copy link
Owner Author

I really have no idea, since I've not worked with fsspec very much at all. Input welcome.

I guess from the virtualizarr perspective I just want to be able to pass in a filepath / S3 url directly to the open_virtual_dataset function, and not have to create a bunch of weird filesystem mapper objects first. I also don't want to have to pass a bunch of specific reader options or whatever (why doesn't that part "just work"...?) I don't think we need to support reading from anything other than posix filesystems and s3 urls anyway. The main thing to avoid here is just copying kerchunk's fsspec arguments into open_virtualizarr.

@abarciauskas-bgse
Copy link
Collaborator

@TomNicholas curious about this comment #61 (comment)

is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk?

since the current implementation relies on reading files using kerchunk - are you thinking about changing that implementation entirely?

If keeping the current implementation, which relies on kerchunk modules, I think #67 makes sense as that PR just builds on that integration via the additional parameters kerchunk needs to open remote files.

@TomNicholas
Copy link
Owner Author

TomNicholas commented Apr 2, 2024

since the current implementation relies on reading files using kerchunk - are you thinking about changing that implementation entirely?

Well looking forward, we want to eventually use zarr chunk manifests for storing the byte ranges on disk, which leaves the only part of this library that actually depends on kerchunk being the kerchunk backends for reading the byte ranges. It's at least interesting to imagine what we could implement instead of using kerchunk to read into a format that we have to convert before using anyway.

EDIT: IIUC with zarr v3 and the rust object-store create we also wouldn't need fsspec to actually read data from disk/object storage either. (I wonder if it's possible to use the same crate to read just byte ranges?...)

I'm only floating this as an idea, not a concrete plan though.

@abarciauskas-bgse
Copy link
Collaborator

IIUC with zarr v3 and the rust object-store create we also wouldn't need fsspec to actually read data from disk/object storage

I would say it's ok to postpone S3 support until some of those things you mentioned (zarr chunk manifests, zarr v3 rust object-store) exists - wdyt @norlandrhagen

Is there already WIP rust object-store @TomNicholas

@TomNicholas
Copy link
Owner Author

The rust crate exists, but you'll have to ask @jhamman about the status of attempts to use it to load data into python.

@TomNicholas
Copy link
Owner Author

@abarciauskas-bgse actually see zarr-developers/zarr-python#1661

@TomNicholas
Copy link
Owner Author

So the rust object-store crate looks like a potential solution to the problem of reading bytes from known byte ranges from S3 without using fsspec. That's great for reading from zarr stores created by virtualizarr.

But to actually use object-store to read the netCDF file to determine the correct byte ranges to put in the manifest is a different requirement. That requires re-creating the logic in kerchunk.hdf.SingleHdfToZarr, which uses h5py and fsspec like this:

if isinstance(h5f, str):
    fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
    input_file = fs.open(path, "rb")
    url = h5f
    _h5f = h5py.File(input_file, mode="r")

Is there any way to do this without using fsspec?

@TomNicholas
Copy link
Owner Author

TomNicholas commented Apr 10, 2024

Answering my own question, how about this ro3 file open mode on the h5py docs? Says it enables read-only access to HDF5 files in the AWS S3 or S3-compatible object stores. Maybe we could use this to implement our own S3-compatible netCDF opener using h5py but not fsspec...?

https://docs.h5py.org/en/stable/high/file.html#file-drivers

@rabernat
Copy link
Collaborator

But to actually use object-store to read the netCDF file to determine the correct byte ranges to put in the manifest is a different requirement.

Is this within scope for VirtaliZarr right now? Generating the references is a very separate problem from storing / manipulating them. I'd recommend continuing to reply on Kerchunk for the reference generation.

No, it would not be easy to use h5py in this way without fsspec. But it is possible to re-implement the logic for finding the chunks from scratch in a much more efficient way. See hidefix for example.

@TomNicholas
Copy link
Owner Author

Is this within scope for VirtualiZarr right now?

I agree it's fairly orthogonal, and not part of an MVP (which would rely on fsspec and kerchunk both for generating references and reading actual bytes later). But if we're thinking about how to create references that don't require fsspec to read bytes from (i.e. zarr stores with chunk manifests), then it's also interesting to think about how to generate those references without fsspec.

But there is less of an obvious motivation for doing so other than reliability / performance / design simplicity (whereas zarr stores without fsspec get us an actual feature: language-agnostic data access).

@rabernat
Copy link
Collaborator

From my point of view, generating the references is a one-time cost. Yes, it could be more efficient. But that's generally not the bottleneck for people today. Our goal is to make it more flexible to manipulate those reference once they are generated.

@TomNicholas
Copy link
Owner Author

I'm less concerned about efficiency and more about reliability / understandability.

Our goal is to make it more flexible to manipulate those reference once they are generated.

Yes that's the main goal, but also (a) being more confident that the generated references are correct and (b) not using dependencies that are more complex than necessary (i.e. fsspec) would be nice secondary goals too.

@TomNicholas
Copy link
Owner Author

I think we've gone off track from the original issue here, so I've opened #78 to specifically discuss generating references from data in S3 without using fsspec (as opposed to reading bytes from known references to data in S3).

I'll leave this issue to discuss the generating of references from files in S3 using kerchunk + fsspec.

@TomNicholas TomNicholas changed the title Reading from S3 Generating references from files in S3 (using kerchunk + fsspec) Apr 10, 2024
@TomNicholas
Copy link
Owner Author

TomNicholas commented Apr 15, 2024

@abarciauskas-bgse @norlandrhagen it would be great to try to move forward with using kerchunk + fsspec for now (i.e. PR #67) with the understanding that eventually we would like to replace those dependencies with the ideas in #78. That would allow us to make this package useful for real cases faster.

@TomNicholas
Copy link
Owner Author

I think this was closed by @norlandrhagen in #67, with the other ideas being tracked in #78.

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

4 participants