Use 3 numpy arrays for manifest internally #107

TomNicholas · 2024-05-10T16:54:37Z

Supercedes #39 as a way to close #33, the difference being that this uses 3 separate numpy arrays to store the path strings, byte offsets, and byte range lengths (rather than trying to put them all in one numpy array with a structured dtype). Effectively implements (2) in #104.

Relies on numpy 2.0 (which is currently only available as a release candidate).

…uctured array

for more information, see https://pre-commit.ci

martindurant · 2024-05-10T17:37:05Z

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

Unfortunately, the references have been deleted, because the whole dataset is now also available as zarr. I may have the chance sometime to regenerate them, if it's important.

martindurant · 2024-05-10T17:40:04Z

Also, a super-simple arrow- or awkward-like string representation as contiguous numpy arrays could look something like

class String:
    def __init__(self, offsets, data) -> None:
        self.offsets = offsets
        self.data = data

    def __getitem__(self, item):
        if isinstance(item, int):
            return self.data[self.offsets[item]: self.offsets[item + 1]].decode()
        else:
            return String(self.offsets.__getitem__(item), self.data)

>>> s = String(np.array([0, 5, 10]), b"HelloWorld")
>>> s[1]
'World'

TomNicholas · 2024-05-10T20:04:11Z

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

That's useful context for #104, thanks Martin!

…rlying numpy arrays

…s/VirtualiZarr into numpy_arrays_manifest

TomNicholas added 15 commits March 17, 2024 16:13

change entries property to a structured array, add from_dict

0c445fd

fix validation

3bc483f

equals method

20f2ded

re-implemented concatenation through concatenation of the wrapped str…

be8af12

…uctured array

fixed manifest.from_kerchunk_dict

bd8ad22

fixed kerchunk tests

385290d

Merge branch 'main' into structured_array_manifest

309019a

Merge branch 'main' into structured_array_manifest

4132b32

Merge branch 'main' into structured_array_manifest

830dccc

change private attributes to 3 numpy arrays

c0180cc

add from_arrays method

e93d3b8

to and from dict working again

3913143

fix dtype comparisons

6a5d996

depend on numpy release candidate

8a77a0a

get concatenation and stacking working

a95117f

TomNicholas added enhancement New feature or request performance labels May 10, 2024

TomNicholas marked this pull request as draft May 10, 2024 16:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

b45b160

for more information, see https://pre-commit.ci

TomNicholas mentioned this pull request May 10, 2024

[WIP] Structured array for manifest #39

Closed

2 tasks

remove manifest-level tests of concatenation

7410a66

TomNicholas added 6 commits May 11, 2024 13:18

generalized create_manifestarray fixture

e1e8bf7

added tests of broadcasting

7e97e74

made basic broadcasting tests pass by calling np.broadcast_to on unde…

96b2841

…rlying numpy arrays

generalize fixture for creating scalar ManifestArrays

00c1757

improve regression test for expanding scalar ManifestArray

06180b3

remove now-unneeded scalar broadcasting logic

dae048b

Merge branch 'numpy_arrays_manifest' of https://github.com/TomNichola…

6da98ea

…s/VirtualiZarr into numpy_arrays_manifest

TomNicholas mentioned this pull request May 13, 2024

Write to parquet #110

Merged

Merge branch 'main' into numpy_arrays_manifest

94080f8

ayushnag mentioned this pull request May 13, 2024

Reading from dmrcp index files? #85

Open

TomNicholas mentioned this pull request May 21, 2024

Serverless parallelization of reference generation #123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use 3 numpy arrays for manifest internally #107

Use 3 numpy arrays for manifest internally #107

TomNicholas commented May 10, 2024

martindurant commented May 10, 2024

martindurant commented May 10, 2024 •

edited

TomNicholas commented May 10, 2024

Use 3 numpy arrays for manifest internally #107

Are you sure you want to change the base?

Use 3 numpy arrays for manifest internally #107

Conversation

TomNicholas commented May 10, 2024

martindurant commented May 10, 2024

martindurant commented May 10, 2024 • edited

TomNicholas commented May 10, 2024

martindurant commented May 10, 2024 •

edited