-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use 3 numpy arrays for manifest internally #107
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy). Unfortunately, the references have been deleted, because the whole dataset is now also available as zarr. I may have the chance sometime to regenerate them, if it's important. |
Also, a super-simple arrow- or awkward-like string representation as contiguous numpy arrays could look something like
|
That's useful context for #104, thanks Martin! |
…rlying numpy arrays
…s/VirtualiZarr into numpy_arrays_manifest
Supercedes #39 as a way to close #33, the difference being that this uses 3 separate numpy arrays to store the path strings, byte offsets, and byte range lengths (rather than trying to put them all in one numpy array with a structured dtype). Effectively implements (2) in #104.
Relies on numpy 2.0 (which is currently only available as a release candidate).