Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding sparse array support #245

Open
elyall opened this issue May 25, 2023 · 15 comments
Open

Adding sparse array support #245

elyall opened this issue May 25, 2023 · 15 comments

Comments

@elyall
Copy link

elyall commented May 25, 2023

Zarr would benefit from formal support of storing sparse arrays, though I'm not sure what this would look like and the process for getting there (e.g. placing the functionality in the core spec, creating a numcodec, and/or creating a ZEP to get there). But from some of the linked comments below it does seem very achievable.

If I'm wrong anywhere, please correct me. I'm not an expert on these packages.

Example use case: parallel processing of single cell RNA sequencing (scRNA-seq) data across a cluster.

Single cell sequencing data produces large arrays, often sparse, and often too large to store in memory. They would benefit from chunked, sparse, on-disk storage to empower chunked parallel processing on a Dask or Ray cluster in an efficient manner.

The current scRNA-seq analysis stack utilizes annData to wrap the data and scanpy to analyze it. annData was built upon HDF5 files but there has been an effort to utilize zarr and dask (with the current caveat seemingly that the central matrix is a dense np.ndarray due to zarr's limitation). scanpy doesn't fully support dask arrays at the moment (e.g. scverse/scanpy#2491) but since dask arrays follow the numpy API that shouldn't be hard to fix. Dask supports sparse arrays and Ray can be used as a plug-in scheduler for dask. In practice I've found anndata is able to load in a scipy sparse csr matrix from an hdf5 file.

Some previous efforts and discussions I've found

Proposed test code to set goalpost

We can try to implement all five of the array formats I've come across (included below), or stick to sparse.COO which follows the numpy API most completely, is more powerful by being n-dimensional, and is easy to chunk as the coordinates are exact (though less space efficient than csr or csc). Also sparse.COO is easily converted in memory to other sparse array types.

import zarr
import sparse
import numpy as np

# generate data
sparse_array = sparse.random((100,100), density=0.1, random_state=42)
sparse_arrays = [sparse_array, sparse.DOK.from_coo(sparse_array), sparse.GCXS.from_coo(sparse_array), sparse_array.tocsr(), sparse_array.tocsc()]

for sparse_array in sparse_arrays:
    zarr.save("test.zarr", sparse_array, chunk=(20,20))  # write
    read_array = zarr.load("test.zarr")  # read

    # test
    assert type(read_array)==type(sparse_array)
    assert read_array.shape==sparse_array.shape
    if isinstance(read_array, np.ndarray):
        assert (sparse_array==read_array).all()  # np.ndarray doesn't implement .nnz() method
    else:
        assert (sparse_array!=read_array).nnz==0  # not all sparse formats implement .all() method

My ultimate use case

I'd like to analyze this dataset on a (local/remote) ray cluster using scanpy's default workflow at the start.

Some relevant stakeholders

@ivirshup, @alimanfoo, @rabernat, @jakirkham, @daletovar, @MSanKeys963

@elyall
Copy link
Author

elyall commented May 25, 2023

I also meant to include @joshua-gould

@rabernat
Copy link
Contributor

I think this is both a great idea and pretty straightforward to implement. Before thinking about any spec changes or extension, I would work on a prototype implementation.

I think this can be implemented in two steps in zarr-python relatively simply.

  1. Implement a sparse codec in numcodecs. The codec's job would be to take an array as input (sparse or dense) and serialize it to bytes using any of the sparse formats you described, and then to reverse this operation on the way out of storage. Ideally the codec could be pipelined with other codecs such as zstd for compression.
  2. Ensure you can wire that through to a sparse array type using the newly introduced meta_array parameter (see Getitems: support meta_array zarr-python#1131)

Based on what is learned from this exercise, we can think about what, if any, spec changes or extensions would be required to "formally" support sparse.

@rabernat
Copy link
Contributor

rabernat commented May 25, 2023

p.s. I actually think that the sparse GCXS format is an excellent sparse encoding, superseding both CSR and CSC.

@ivirshup
Copy link

ivirshup commented May 25, 2023

Hey! I will try and write up a longer reponse later, but am at a hackathon. Just wanted to add a few points

(with the current caveat seemingly that the central matrix is a dense np.ndarray due to zarr's limitation).

This is not the case, anndata sparse storage works well with zarr. The limitation here has been dask support for sparse arrays, which is also something that is being worked on.

However, some promising developments: https://scverse.zulipchat.com/#narrow/stream/315789-data-structures/topic/dask.20in.20anndata/near/355745689

@rabernat, I would be interested in talking about this in more detail, but don't think sparse arrays will work well with a codec. A lot of this has previously been discussed in #62. My current opinion is that structures like sparse, ragged, or dataframes make much more sense as conventions than in the core zarr spec.

I'd also note that anndata is collaborating with graphblas team (and others) to generate a more general binary sparse representation built on top of things like zarr and hdf5. You can check that out here. v1 won't contain n-dimensional CSF/ GCXS support, but a spec for those kind of arrays is looking pretty mature. You can find this here:

https://github.com/GraphBLAS/binsparse-specification

@ivirshup
Copy link

Ideally the codec could be pipelined with other codecs such as zstd for compression.

I think this is one of the key problem with having sparse as a codec. Codecs make a lot of sense for arrays which are one contiguous chunks of memory that needs to be read with a single dtype. A CSR array is really 3 distinct chunks of memory, each of which can need to be interpreted with a different dtype. I think this will be a problem for any codec more complicated than "decompress arbitrary buffer" which seems like the directions codecs are headed in v3, e.g. endianness, transpose, delta...

@rabernat
Copy link
Contributor

rabernat commented May 25, 2023

I see your point @ivirshup and I recognize you have thought much more deeply about this than me! 😀

I think this will be a problem for any codec more complicated than "decompress arbitrary buffer"

In the ratified version of the V3 spec there are three types of codecs (see #222):

Each codec has an *encoded representation* and a *decoded representation*;
each of these two representations are defined to be either:
- a multi-dimensional array of some shape and data type, or
- a byte string.
Based on the input and output representations for the encode transform,
codecs can be classified as one of three kinds:
- ``array -> array``
- ``array -> bytes``
- ``bytes -> bytes``

The spec does not explicitly say that array has to be a dense numpy-style array. So in principle we can have a codec that takes an arbitrary buffer (however you want to serialize the spare array) and returns a sparse array (e.g. sparse.COO or sparse.GCXS). This seems like an implementation detail which would be language specific. In my mind, using such a sparse codec, together with meta_array should be enough to trigger this sort of path in zarr-python.

@jbms
Copy link
Contributor

jbms commented May 25, 2023

The spec does not explicitly say that array has to be a dense numpy-style array. So in principle we can have a codec that takes an arbitrary buffer (however you want to serialize the spare array) and returns a sparse array (e.g. sparse.COO or sparse.GCXS). This seems like an implementation detail which would be language specific. In my mind, using such a sparse codec, together with meta_array should be enough to trigger this sort of path in zarr-python.

Yes, I think the sparse codec could be handled as an "array -> bytes" codec under the current v3 spec with no changes required. The in-memory representation of arrays is an implementation detail and not part of the spec.

Exactly how easily this can be implemented in zarr-python or zarrita remains to be seen, though.

One potential challenge with simply using meta_array is that if the read or write request does not exactly correspond to a single chunk, then zarr-python needs to manipulate the chunks with array indexing operations. For example, if a read request intersects multiple chunks, zarr-python creates an empty array for the result, and then copies in the relevant portion of each chunk. As far as I can see, of the sparse formats supported by the sparse library, only "dictionary of keys" (DOK) supports writes. I expect, therefore, that zarr-python would need some custom logic to assemble multiple sparse arrays together in order to support in-memory formats other than DOK.

@elyall
Copy link
Author

elyall commented May 26, 2023

@ivirshup you are indeed the expert on this subject. While I'm very new to this, I'm happy to help where I can.

anndata sparse storage works well with zarr. The limitation here has been dask support for sparse arrays, which is also something that is being worked on.

News to me. The website really makes it sound like sparse implementing a subset of the NumPy API solved this problem. I will look into this more.

My current opinion is that structures like sparse, ragged, or dataframes make much more sense as conventions than in the core zarr spec.

I agree completely though that's a much taller task; I posted here thinking this was a good place to start. binsparse-specification looks promising. Is there a repo for writing/reading arrays to/from zarr or hdf5 files with this spec yet?


@rabernat, naively writing a codec seems approachable to me as the process seems to have good documentation. There are questions on implementation though that raises the question whether it is actually doable:

  • As @ivirshup said, the pointers, indices, and values often have different dtypes and therefore should be stored as independent (chunked) objects. This does seem similar to an xarray DataArray with multiple indexes along an axis and I know their Dataset has a to_zarr method...
  • Since the array won't be uniformly sparse, you either have to have the chunks have variable length on disk or save a separate object of pointers that gets read in first and specifies the starting index of each chunk so that the slice operation knows which chunk(s) to load in.
  • As @jbms points out, writing to sparse arrays is not trivial considering the whole point is to not allocate memory to values equalling the fill value. If a value needs to be added to memory I imagine it would cause a number of issues for disk memory allocation unless some extra space is included in each chunk (though this does have parallels to expanding an array along a non-contiguous dimension in memory).

@elyall
Copy link
Author

elyall commented May 26, 2023

Also @ivirshup, I'm still processing "anndata sparse storage works well with zarr". I need to play with it more and dive deeper into the code.

@jbms
Copy link
Contributor

jbms commented May 26, 2023

  • As @ivirshup said, the pointers, indices, and values often have different dtypes and therefore should be stored as independent (chunked) objects. This does seem similar to an xarray DataArray with multiple indexes along an axis and I know their Dataset has a to_zarr method...

In zarr v3, a sparse "array -> bytes" codec would be responsible for generating a sequence of bytes to represent the sparse array. It might do this by first converting the sparse array into a collection of dense arrays using one of the standard sparse array representations, then encoding each of these dense arrays, and then concatenating them together. It may make sense for this sparse array codec to have as configuration options separate sequences of codecs that can be applied to each of the dense arrays. Zarr v3 also allows you to specify additional "bytes -> bytes" codecs at the top-level that would apply to the concatenated result. For example:

"codecs": 
   [{"name": "sparse_gcxs",
     "configuration": {
       "ro_codecs": [{"name": "endian", "endian": "little"}, {"name": "blosc"}],
       "co_codecs": [...],
       "vl_codecs": [...]
     }
  }]

Normally when working with a sparse array you will need all of the dense arrays, so storing them all together in a single chunk makes sense. However, I suppose in some cases you might only need to access e.g. the values array and not the coordinates array, for example if applying the same operation to every value independently. That would not be possible if they are all stored in a single chunk. To be able to split the chunk over multiple keys in the underlying storage would require some changes to the zarr v3 codec model, though.

  • Since the array won't be uniformly sparse, you either have to have the chunks have variable length on disk or save a separate object of pointers that gets read in first and specifies the starting index of each chunk so that the slice operation knows which chunk(s) to load in.

We already have chunks taking up a variable amount of space on disk when using compression with regular dense arrays.

  • As @jbms points out, writing to sparse arrays is not trivial considering the whole point is to not allocate memory to values equalling the fill value. If a value needs to be added to memory I imagine it would cause a number of issues for disk memory allocation unless some extra space is included in each chunk (though this does have parallels to expanding an array along a non-contiguous dimension in memory).

To be clear, there is a difference between in-memory representation and on-disk representation. The existing sparse array formats are well-suited for on-disk representation, since even with dense arrays, zarr has to re-write chunks in their entirety when they are modified (assuming any sort of compression is used, and in practice likely will even if no compression is being used).

The in-memory representation that makes sense would depend on specific details of the implementation. I don't think there are any fundamental challenges there, but it may be necessary to make changes to the zarr-python code to support it.

@rabernat
Copy link
Contributor

I expect, therefore, that zarr-python would need some custom logic to assemble multiple sparse arrays together in order to support in-memory formats other than DOK

Just a [slightly off topic] thought...it would be very cool if pydata/sparse could implement a compressed-sparse-blocks format. There are several examples of this out there, although it looks like they are always assuming 2D matrices:

This would map very well to a Zarr, with each chunk forming a block.

@joshmoore
Copy link
Member

@ivirshup can say more but my understanding from SciPy was that https://github.com/GraphBLAS/binsparse-specification should be ready for investigations in Zarr-land soon.

@taddyb
Copy link

taddyb commented Feb 15, 2024

Has there been any progress made towards this in the past several months? @rabernat @joshua-gould @elyall @jbms

Based on the above links to https://github.com/ivirshup/binsparse-python, there hasn't been a binsparse-python commit in ~6 months, so it may be wise to proceed ahead without it. As it's already been mentioned above, there is good dask support for sparse arrays which is a starting point.

While there are complications with variable sizing of some sparse arrays (CSR for example) it make sense to at least start with an implementation to support sparse.COO since it's relatively straightforward.

I'm hoping to take a stab at this over the coming weeks.

@ivirshup
Copy link

ivirshup commented Feb 19, 2024

You should check out https://github.com/graphblas/binsparse-specification, which is more up to date. I haven't had time recently to update my initial python code, but hopefully will get to it in March.

Currently we are working more on doing stuff with sparse in dask, but I would not call the level of support "good".

@taddyb
Copy link

taddyb commented Apr 3, 2024

As an update to this thread, I've been working with @ivirshup on updating the initial binsparse-py code for this use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants