Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked sparse arrays (for v2) #16

Open
ivirshup opened this issue Mar 2, 2023 · 1 comment
Open

Chunked sparse arrays (for v2) #16

ivirshup opened this issue Mar 2, 2023 · 1 comment

Comments

@ivirshup
Copy link
Contributor

ivirshup commented Mar 2, 2023

I would like to bring up the discussion of chunking again, as something to be addressed in v2.

I see two main strategies here:

  • Chunk the storage level arrays
  • Chunk the sparse arrays

I'm think chunking the arrays themselves is the way to go – as it enables more flexibility. I think this is also a pretty safe option, as others have gone for it in the past (see: "prior art" at bottom).

I'm going to go through what the current set-up looks like, and how these two approaches could work. I haven't thought a ton about how these approaches work with n dimensional or hypersparse formats.

@eriknw and @jim22k, would be interested to hear your perspective from experience with metagraph and dask-graphblas especially.

Pasted image 20230302165058

Above is a representation of the different chunking strategies. Each strategy depicts the storage arrays for a CSR or CSC matrix. Color denotes a "logical chunk" : a range of values along the compressed dimension, e.g. contiguous set of rows for CSR. The solid line illustrates how the arrays are subdivided into chunks.

Fixed size chunks

First is fixed size chunks in storage, basically what is being done right now. Here, chunk sizes are fixed for each storage array. This means that "logical chunks" have no set correspondence with storage chunks.

Advantages

  • It already exists, no further work needed

Disadvantages

  • For any selection of rows, I need to read the pointer array before being able to know which chunks of the indice and data array I should retrieve.
  • For any selection of rows, it is very unlikely the storage chunk boundaries will coincide with logical chunk boundaries.
    • For distributed processing, this means each node will almost always be retrieving more data than it needs. How much data is tunable with chunk size – though this tuning is not straightforward.
    • For single node iterative processing, this significantly complicates the code. In practice I have needed to manage my own chunk cache. Knowing that each row sits within one chunk would remove that need.

Logical chunking of storage arrays

One solution here is to make the storage chunk boundaries match up with the logical boundaries. A major downside is that this requires the storage backend to allow variable sized chunks. This exists for arrow, and for hdf5 via virtual datasets (example). I would like this ability to exist in zarr v3 (see: ZEP 3), but it doesn't yet.

Advantages

  • Very similar to fixed chunk storage, a lot of code that works on one representation should "just work" on the other
  • This could be orthogonal to this specification, though in practice I think you'd want to tell people if they have simple access to their chunked data.

Disadvantages

  • Limited by chunking model of underlying store. Possible in HDF5 via virtual, not yet possible in zarr.
  • Keep limitations of existing formats (e.g. still optimized for "entire row", "entire column" selection)

Logical chunking of sparse arrays

The third solution is to copy the chunking solutions from dense arrays. That is, we do not rely on the array stores native chunking mechanisms, and instead manage chunks of sparse arrays ourselves.

In this model, a chunked sparse array is a collection of other sparse arrays which would need to be concatenated to recreate an in memory un-chunked representation.

This solution isn't necessarily mutually exclusive with "Logical chunking of storage arrays", and the two could complement each other if each sparse array chunk is large enough to have storage array level chunking.

Advantages

  • "Free" stacking. I can create a group with appropriate metadata, then symlink (or kerchunk) in the appropriate sparse array groups, and now I have a chunked sparse array, created without needing to duplicate or read from arrays.
  • Can represent index spaces larger than index dtype.
  • Not just optimizing on row/ column selection. E.g. could use CSR, but have some chunks along column dimension. Which is useful for sparse arrays with block structure. Can be useful for block structure and sparse data with spatial interpretation.

Pasted image 20230302181617

Disadvantages

  • More complicated implementation
  • More expensive conversion to in-memory arrays
  • Enough rope to hang ourselves with
    • Easy to make it even more complicated (e.g. hilbert curve, kd-tree like), enough that implementation becomes difficult
    • Different "chunks" are different kinds of sparse array

Prior art (far from complete)

@ivirshup
Copy link
Contributor Author

@eriknw, any chance you took a picture of the white board from our discussion yesterday?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant