h5coro

A cloud optimized Python package for reading HDF5 data stored in S3

Origin and Purpose

h5coro is a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. The project has its roots in the development of an on-demand science data processing system called SlideRule, where a new C++ implementation of the HDF5 specification was developed for performant read access to Earth science datasets stored in AWS S3. Over time, user's of SlideRule began requesting the ability to performantly read HDF5 and NetCDF files out of S3 from their own Python scripts. The result is h5coro: the re-implementation in Python of the core HDF5 reading logic that exists in SlideRule. Since then, h5coro has become its own project, which will continue to grow and diverge in functionality from its parent implementation. For more information on SlideRule and the organization behind h5coro, see https://slideruleearth.io.

h5coro is optimized for reading HDF5 data in high-latency high-throughput environments. It accomplishes this through a few key design decisions:

All reads are concurrent. Each dataset and/or attribute read by h5coro is performed in its own thread.
Intelligent range gets are used to read as many dataset chunks as possible in each read operation. This drastically reduces the number of HTTP requests to S3 and means there is no longer a need to re-chunk the data (it actually works better on smaller chunk sizes due to the granularity of the request).
Block caching is used to minimize the number of GET requests made to S3. S3 has a large first-byte latency (we've measured it at ~60ms on our systems), which means there is a large penalty for each read operation performed. h5coro performs all reads to S3 as large block reads and then maintains data in a local cache for access to smaller amounts of data within those blocks.
The system is serverless and does not depend on any external services to read the data. This means it scales naturally as the user application scales, and it reduces overall system complexity.
No metadata repository is needed. The structure of the file are cached as they are read so that successive reads to other datasets in the same file will not have to re-read and re-build the directory structure of the file.

Limitations

For a full list of which parts of the HDF5 specification h5coro implements, see the compatibility section at the end of this readme. The major limitations currently present in the package are:

The code only implements a subset of the HDF5 specification. h5coro has been shown to work on a number of different datasets, but depending on the version of the HDF5 C library used to write the file, and what options were used during its creation, it is very possible that some part of h5coro will need to be updated to support reading it. Hopefully, over time as more of the spec is implemented, this will become less of a problem.
It is a read-only library and has no functionality to write HDF5 data.

Installation

The simplest way to install h5coro is by using the conda package manager.

    conda install -c conda-forge h5coro

Alternatively, you can also install h5coro using pip.

    pip install h5coro

xarray backend

To use h5coro as a backend to xarray, simply install both xarray and h5coro in your current environment. h5coro will automatically be recognized by xarray, so you can use it like any other xarray engine:

import xarray as xr
h5ds = xr.open_dataset("file.h5", engine="h5coro")

You can see what backends are available in xarray using:

xr.backends.list_engines()

Example Usage

# (1) import
from h5coro import h5coro, s3driver

# (2) create
h5obj = h5coro.H5Coro(f'{my_bucket}/{path_to_hdf5_file}', s3driver.S3Driver)

# (3) read
datasets = [{'dataset': '/path/to/dataset1', 'hyperslice': []},
            {'dataset': '/path/to/dataset2', 'hyperslice': [324, 374]}]
promise = h5obj.readDatasets(datasets=datasets, block=True)

# (4) display
for variable in promise:
    print(f'{variable}: {promise[variable]}')

(1) Importing h5coro

h5coro: the main module implementing the HDF5 reader object

s3driver: the driver used to read HDF5 data from S3

(2) Create h5coro Object

The call to h5coro.H5Coro creates a reader object that opens up the HDF5 file, reads the start of the file, and is then ready to accept read requests.

The calling application must have credentials to access the object in the specified S3 bucket. h5coro uses boto3, so any credentials supplied via the standard AWS methods will work. If credentials need to be supplied externally, then in the call to h5coro.H5Coro pass in an argument credentials as a dictionary with the following three fields: "aws_access_key_id", "aws_secret_access_key", "aws_session_token".

(3) Read with h5coro Object

The H5Coro.read function takes a list of dictionary objects that describe the datasets that need to be read in parallel.

If the block parameter is set to True, then the code will wait for all of the datasets to be read before returning; otherwise, the code will return immediately and not until the dataset within the reader object is access will the code block.

(4) Display the Datasets

The h5coro promise is a dictionary of numpy arrays containing the values of the variables read, along with some additional logic that provides the ability to block while waiting for the data to be populated.

Licensing

h5coro is licensed under the 3-clause BSD license found in the LICENSE file at the root of this source tree.

Contribute

We welcome and invite contributions from anyone at any career stage and with any amount of coding experience towards the development of h5coro. We appreciate any and all contributions made towards the development of the project. You will be recognized for your work by being listed as one of the project contributors.

Ways to Contribute

Fixing typographical or coding errors
Submitting bug reports or feature requests through the use of GitHub issues
Improving documentation and testing
Sharing use cases and examples (such as Jupyter Notebooks)
Providing code for everyone to use

Requesting a Feature

Check the project issues tab to see if the feature has already been suggested. If not, please submit a new issue describing your requested feature or enhancement. Please give your feature request both a clear title and description. Please let us know in your description if this is something you would like to contribute to the project.

Reporting a Bug

Check the project issues tab to see if the problem has already been reported. If not, please submit a new issue so that we are made aware of the problem. Please provide as much detail as possible when writing the description of your bug report. Providing detailed information and examples will help us resolve issues faster.

Contributing Code or Examples

We follow a standard Forking Workflow for code changes and additions. Submitted code goes through a review and comment process by the project maintainers.

General Guidelines

Make each pull request as small and simple as possible
Commit messages should be clear and describe the changes
Larger changes should be broken down into their basic components and integrated separately
Bug fixes should be their own pull requests with an associated GitHub issue
Write a descriptive pull request message with a clear title
Please be patient as reviews of pull requests can take time

Steps to Contribute

Fork the repository to your personal GitHub account by clicking the “Fork” button on the project main page. This creates your own server-side copy of the repository.
Either by cloning to your local system or working in GitHub Codespaces, create a work environment to make your changes.
Add the original project repository as the upstream remote. While this step isn’t a necessary, it allows you to keep your fork up to date in the future.
Create a new branch to do your work.
Make your changes on the new branch.
Push your work to GitHub under your fork of the project.
Submit a Pull Request from your forked branch to the project repository.

Compatibility

Format Element	Supported	Contains	Missing
Field Sizes	Yes	1, 2, 4, 8, bytes
Superblock	Partial	Version 0, 2	Version 1, 3
Base Address	Yes
B-Tree	Partial	Version 1	Version 2
Group Symbol Table	Yes	Version 1
Local Heap	Yes	Version 0
Global Heap	No		Version 1
Fractal Heap	Yes	Version 0
Shared Object Header Message Table	No		Version 0
Data Object Headers	Yes	Version 1, 2
Shared Message	No		Version 1
NIL Message	Yes	Unversioned
Dataspace Message	Yes	Version 1
Link Info Message	Yes	Version 0
Datatype Message	Partial	Version 1	Version 0, 2, 3
Fill Value (Old) Message	No		Unversioned
Fill Value Message	Partial	Version 2, 3	Version 1
Link Message	Yes	Version 1
External Data Files Message	No		Version 1
Data Layout Message	Partial	Version 3	Version 1, 2
Bogus Message	No		Unversioned
Group Info Message	No		Version 0
Filter Pipeline Message	Yes	Version 1, 2
Attribute Message	Partial	Version 1, 2, 3	Shared message support for v3
Object Comment Message	No		Unversioned
Object Modification Time (Old) Message	No		Unversioned
Shared Message Table Message	No		Version 0
Object Header Continuation Message	Yes	Version 1, 2
Symbol Table Message	Yes	Unversioned
Object Modification Time Message	No		Version 1
B-Tree ‘K’ Value Message	No		Version 0
Driver Info Message	No		Version 0
Attribute Info Message	No		Version 0
Object Reference Count Message	No		Version 0
Compact Storage	Yes
Continuous Storage	Yes
Chunked Storage	Yes
Fixed Point Type	Yes
Floating Point Type	Yes
Time Type	No
String Type	Yes
Bit Field Type	No
Opaque Type	No
Compound Type	No
Reference Type	No
Enumerated Type	No
Variable Length Type	No
Array Type	No
Deflate Filter	Yes
Shuffle Filter	Yes
Fletcher32 Filter	No
Szip Filter	No
Nbit Filter	No
Scale Offset Filter	No

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.github		.github
benchmarks		benchmarks
data		data
h5coro		h5coro
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.sh		RELEASE.sh
environment.yml		environment.yml
examples.ipynb		examples.ipynb
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

License

SlideRuleEarth/h5coro

Folders and files

Latest commit

History

Repository files navigation