Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 3 numpy arrays for manifest internally #107

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Owner

Supercedes #39 as a way to close #33, the difference being that this uses 3 separate numpy arrays to store the path strings, byte offsets, and byte range lengths (rather than trying to put them all in one numpy array with a structured dtype). Effectively implements (2) in #104.

Relies on numpy 2.0 (which is currently only available as a release candidate).

@TomNicholas TomNicholas added enhancement New feature or request performance labels May 10, 2024
@TomNicholas TomNicholas marked this pull request as draft May 10, 2024 16:55
@martindurant
Copy link

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

Unfortunately, the references have been deleted, because the whole dataset is now also available as zarr. I may have the chance sometime to regenerate them, if it's important.

@martindurant
Copy link

martindurant commented May 10, 2024

Also, a super-simple arrow- or awkward-like string representation as contiguous numpy arrays could look something like

class String:
    def __init__(self, offsets, data) -> None:
        self.offsets = offsets
        self.data = data

    def __getitem__(self, item):
        if isinstance(item, int):
            return self.data[self.offsets[item]: self.offsets[item + 1]].decode()
        else:
            return String(self.offsets.__getitem__(item), self.data)

>>> s = String(np.array([0, 5, 10]), b"HelloWorld")
>>> s[1]
'World'

@TomNicholas
Copy link
Owner Author

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

That's useful context for #104, thanks Martin!

@TomNicholas TomNicholas mentioned this pull request May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In-memory representation of chunks: array instead of a dict?
2 participants