Chunked sparse arrays (for v2) #16

ivirshup · 2023-03-02T17:30:12Z

I would like to bring up the discussion of chunking again, as something to be addressed in v2.

I see two main strategies here:

Chunk the storage level arrays
Chunk the sparse arrays

I'm think chunking the arrays themselves is the way to go – as it enables more flexibility. I think this is also a pretty safe option, as others have gone for it in the past (see: "prior art" at bottom).

I'm going to go through what the current set-up looks like, and how these two approaches could work. I haven't thought a ton about how these approaches work with n dimensional or hypersparse formats.

@eriknw and @jim22k, would be interested to hear your perspective from experience with metagraph and dask-graphblas especially.

Above is a representation of the different chunking strategies. Each strategy depicts the storage arrays for a CSR or CSC matrix. Color denotes a "logical chunk" : a range of values along the compressed dimension, e.g. contiguous set of rows for CSR. The solid line illustrates how the arrays are subdivided into chunks.

Fixed size chunks

First is fixed size chunks in storage, basically what is being done right now. Here, chunk sizes are fixed for each storage array. This means that "logical chunks" have no set correspondence with storage chunks.

Advantages

It already exists, no further work needed

Disadvantages

For any selection of rows, I need to read the pointer array before being able to know which chunks of the indice and data array I should retrieve.
For any selection of rows, it is very unlikely the storage chunk boundaries will coincide with logical chunk boundaries.
- For distributed processing, this means each node will almost always be retrieving more data than it needs. How much data is tunable with chunk size – though this tuning is not straightforward.
- For single node iterative processing, this significantly complicates the code. In practice I have needed to manage my own chunk cache. Knowing that each row sits within one chunk would remove that need.

Logical chunking of storage arrays

One solution here is to make the storage chunk boundaries match up with the logical boundaries. A major downside is that this requires the storage backend to allow variable sized chunks. This exists for arrow, and for hdf5 via virtual datasets (example). I would like this ability to exist in zarr v3 (see: ZEP 3), but it doesn't yet.

Advantages

Very similar to fixed chunk storage, a lot of code that works on one representation should "just work" on the other
This could be orthogonal to this specification, though in practice I think you'd want to tell people if they have simple access to their chunked data.

Disadvantages

Limited by chunking model of underlying store. Possible in HDF5 via virtual, not yet possible in zarr.
Keep limitations of existing formats (e.g. still optimized for "entire row", "entire column" selection)

Logical chunking of sparse arrays

The third solution is to copy the chunking solutions from dense arrays. That is, we do not rely on the array stores native chunking mechanisms, and instead manage chunks of sparse arrays ourselves.

In this model, a chunked sparse array is a collection of other sparse arrays which would need to be concatenated to recreate an in memory un-chunked representation.

This solution isn't necessarily mutually exclusive with "Logical chunking of storage arrays", and the two could complement each other if each sparse array chunk is large enough to have storage array level chunking.

Advantages

"Free" stacking. I can create a group with appropriate metadata, then symlink (or kerchunk) in the appropriate sparse array groups, and now I have a chunked sparse array, created without needing to duplicate or read from arrays.
Can represent index spaces larger than index dtype.
Not just optimizing on row/ column selection. E.g. could use CSR, but have some chunks along column dimension. Which is useful for sparse arrays with block structure. Can be useful for block structure and sparse data with spatial interpretation.

Disadvantages

More complicated implementation
More expensive conversion to in-memory arrays
Enough rope to hang ourselves with
- Easy to make it even more complicated (e.g. hilbert curve, kd-tree like), enough that implementation becomes difficult
- Different "chunks" are different kinds of sparse array

Prior art (far from complete)

tiledb
- Docs on data layout and tiling
- Supports both fixed tile size and hilbert curve tiling
metagraph
cooler – pyramidal + chunked sparse arrays on top of hdf5

The text was updated successfully, but these errors were encountered:

ivirshup · 2023-09-14T13:32:49Z

@eriknw, any chance you took a picture of the white board from our discussion yesterday?

ivirshup mentioned this issue Sep 1, 2023

Expose a Zarr interface to Tiled bluesky/tiled#562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked sparse arrays (for v2) #16

Chunked sparse arrays (for v2) #16

ivirshup commented Mar 2, 2023

ivirshup commented Sep 14, 2023

Chunked sparse arrays (for v2) #16

Chunked sparse arrays (for v2) #16

Comments

ivirshup commented Mar 2, 2023

Fixed size chunks

Advantages

Disadvantages

Logical chunking of storage arrays

Advantages

Disadvantages

Logical chunking of sparse arrays

Advantages

Disadvantages

Prior art (far from complete)

ivirshup commented Sep 14, 2023