Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a Zarr interface to Tiled #562

Open
danielballan opened this issue Aug 31, 2023 · 15 comments
Open

Expose a Zarr interface to Tiled #562

danielballan opened this issue Aug 31, 2023 · 15 comments

Comments

@danielballan
Copy link
Member

We could add a dedicated route, like /zarr/v3/ that exposes Tiled's contents as Zarr, such that fsspec would "just work" with it.

Open questions:

  • What to do about structures like table, sparse, and awkward that do not (yet, at least) fit cleanly into Zarr's data model?
  • While small container structures map perfects onto Zarr groups, Tiled also supports extremely wide containers (~1M entries) and exposes a paginated API. Is there an analog in Zarr?

To start, one option is to simple filter out nodes that do not map cleanly into Zarr, exposing only those nodes that do.

@danielballan
Copy link
Member Author

Inspired by discussion with @joshmoore at SciPy 2023, but I think not written down until now.

@joshmoore
Copy link

  • What to do about structures like table, sparse, and awkward that do not (yet, at least) fit cleanly into Zarr's data model?

My assumption is that these would be "layouts" of zarr arrays. @ivirshup can say more on sparse and likely awkward. table I assume could be columnar style with the number of arrays matching the number of columns.

  • While small container structures map perfects onto Zarr groups, Tiled also supports extremely wide containers (~1M entries) and exposes a paginated API. Is there an analog in Zarr?

I don't think I know understand well enough what you mean by wide containers. If this would be 1M zarrays in a single zgroup, that's doable, but you would never want to list the zgroup. Is this where pagination comes in? If it's a wide array, then I don't think there's a particular issue, because the pagination would map to chunking, right?

@ivirshup
Copy link

ivirshup commented Sep 1, 2023

I think I actually talked with @danielballan a bit about this in Seattle in May. In anndata, we've got on disk formats for tables and sparse built on top of hdf5/ zarr (description). What Josh describes is basically how we do tables, plus a little bit for more complicated column dtypes like categoricals and nullable bool/ int.

Tools like vitessce can provide nice visualizations based on these formats.

IIRC, tiled's sparse support is based off of chunks which each contain sparse data in COO format? That's not quite how we've been doing sparse in anndata, but I would like to support something more like this, but with support for at least CSR, CSC. I've written up a bit of my thoughts on this here: GraphBLAS/binsparse-specification#16.

Our awkward storage isn't too developed. We only have read and write, no partial access.

@trevormanz has also been interested in the idea of a server-backed zarr store.

@danielballan
Copy link
Member Author

OK, so a "layout" is a standard Zarr on-disk format with a special interpretation layered on top---i.e. "You should interpret this group of arrays as a table and assume that they have equal length?" Does Zarr have an official way to encode a layout?

Yes, "wide contianers" would be like 1M zarrays in a single zgroup---too large to list in a single request. Tiled provides filtering (search) and paginated access to make that tractable.

Yes, tiled's sparse support is has chunked-array semantics (like dask.array) where each chunk is transmitted as a table of COO data. We have built in a path for other sparse layouts (CSR, CSC) but not yet implemented.

Isaac meant to tag @manzt, I think.

@manzt
Copy link
Contributor

manzt commented Sep 7, 2023

@trevormanz has also been interested in the idea of a server-backed zarr store.

Yes! Thanks for tagging me. I made simple-zarr-server to expose any Python Zarr store (MutableMapping) over HTTP. Would be really valuable to have an even wider Zarr-compat layer with something like tiled. (I bet you could also easily go the other way: expose an existing Zarr store as one of tiled's other endpoints.)

@danielballan
Copy link
Member Author

Nice! I don't think I'd seen that one yet. I'm working on a branch to update on our docs on "How Tiled Fits Into the Ecosystem". I'll include simple-zarr-server in the list of data services we know about. This looks like a great lightweight option. (Aside: Do you use websockets? I see it in the CLI help string but could not immediately find it in the source. We should chat.)

Yes, we do go the other way: exposing a Zarr store with Tiled's existing endpoints, layering on the things that Tiled gets you for the extra weight:

  • Fast search on top-level metadata (which is copied into a SQL database)
  • AuthN/AuthZ if you need it
  • Transcoding --- e.g. access a slice of array as PNG or JSON

@manzt
Copy link
Contributor

manzt commented Sep 7, 2023

Aside: Do you use websockets? I see it in the CLI help string but could not immediately find it in the source. We should chat.

It doesn't, oops! I just copied the CLI from uvicorn since it just forwards args to creating the server. Sorry for the confusion.

We should chat.

Definitely! Let's find a time.

Fast search on top-level metadata (which is copied into a SQL database)
AuthN/AuthZ if you need it
Transcoding --- e.g. access a slice of array as PNG or JSONTranscoding --- e.g. access a slice of array as PNG or JSON

Wow, awesome. It's been a while since I check in on things here. The transcoding is something I've always wanted access to from a web app.

One thing I've been thinking about for a while is letting a zarr client "request" preferred encodings through something like request headers or query params.

# cat a basic zarr store
curl -sL https://my-zarr-service.com/data.zarr/.zarray 
# { "dtype": "<u8", "shape": [10000, 10000], "chunks": [1024, 1024], "compression": ... }

# provide "preferred" overrides
curl -sL https://my-zarr-service.com/data.zarr/.zarray?dtype=%3Cu2&chunk_x=256&chunk_y=256&compression=gzip
# { "dtype": "<u2", "shape": [10000, 10000], "chunks": [256, 256], "compression": { "codec_id": "gzip" ...} }

@ivirshup
Copy link

ivirshup commented Sep 7, 2023

Isaac meant to tag @manzt, I think.

😅, yes. Thanks for figuring that out.

OK, so a "layout" is a standard Zarr on-disk format with a special interpretation layered on top---i.e. "You should interpret this group of arrays as a table and assume that they have equal length?" Does Zarr have an official way to encode a layout?

anndata defines this for itself. This has been discussed a bit around zarr, but nothing has been formalized yet: https://zarr.dev/zeps/draft/ZEP0004.html.

@danielballan
Copy link
Member Author

@manzt We have exactly the same vision. Choosing the format and compression encoding works now, via HTTP content negotiation headers, e.g.

curl -H 'Accept: application/json' 'https://tiled-demo.blueskyproject.io/api/v1/array/full/generated/medium_image?slice=:5,:5

We also support a custom query parameter ?format for contexts where headers cannot be set, like shareable links.

curl 'https://tiled-demo.blueskyproject.io/api/v1/array/full/generated/medium_image?format=json&slice=:5,:5'

We have not addressed re-chunking or requesting a coarser dtype, but these ideas have been raised and are certainly in scope.

I've sent you an email.

Thanks for the reference, @ivirshup. That ZEP was not yet on our radar.

I like the idea of specified higher-level interpretations for Zarr data. It would cover some of our use cases, though not all of them. For example, we sometimes handle very wide tables---a snapshot of the state of a large amount of scientific hardware before and after an experiment---which can be 2 rows long and hundreds of columns wide. I believe this is not a good fit for a Zarr group, performance-wise, but it's a great fit for Arrow or Parquet.

@ivirshup
Copy link

ivirshup commented Sep 7, 2023

For example, we sometimes handle very wide tables---a snapshot of the state of a large amount of scientific hardware before and after an experiment---which can be 2 rows long and hundreds of columns wide. I believe this is not a good fit for a Zarr group, performance-wise, but it's a great fit for Arrow or Parquet.

This is like pd.DataFrame(np.random.rand(2, 10_000))?

With my understanding of parquet and arrow, this would also be a bad case since there's an overhead per column. However, using a DirectoryStore with anndata's tabular encoding would definitely be far worse. I think if it was a ZipStore or something it may not be so different. I have also been wondering if we could shard across zarr arrays, letting anndata's table representation have something more like parquet's row-groups.

This may be a bit off topic though, so happy to refocus.

@danielballan
Copy link
Member Author

Actually I think this may be relevant. If we expose Tiled as "a Zarr", i.e. add a /zarr/v3/{path} endpoint that is Zarr-like, what should we do with tables? If we expose them as Zarr groups, will they be so slow it would be better just to omit them?

df = pd.DataFrame(np.random.rand(2, 1000))

%timeit df.to_parquet('test.parquet')
66.9 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

g = zarr.open_group("test.zarr", "w")
%timeit for column in df: g[str(column)] = np.asarray(df[column])
740 ms ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But I don't want to dwell too much on what may be an edge case. A best-effort that we can kick the tires on may be the place to begin.

Thanks for GraphBLAS/binsparse-specification#16, by the way, a great overview of options. I believe Tiled is doing "Logical chunking of sparse arrays" at the API level. Internally, for the one mode of storage we currently support, we are also doing "Logical chunking of storage arrays", but this not a strongly-held design choice, and alternatives could be added without any major changes.

@ivirshup
Copy link

ivirshup commented Sep 8, 2023

Honestly, 10x performance drop is way better than I was expecting 😆

But also, this case is another order of magnitude faster to write to json, so I strongly think its an edge case.

In [10]: %time df.to_json("test.json")
CPU times: user 1.69 ms, sys: 1.63 ms, total: 3.32 ms
Wall time: 3.53 ms

If we expose them as Zarr groups, will they be so slow it would be better just to omit them?

I think it would be fine to expose them. The way anndata does tables will only be reasonable for columnar access (or pretty large chunks of contiguous rows) once you have larger number of rows.


Thanks for GraphBLAS/binsparse-specification#16, by the way, a great overview of options.

Thanks!

I believe Tiled is doing "Logical chunking of sparse arrays" at the API level.

Am I remembering correctly that you were doing global indices as opposed to chunk local indices + an offset? And was that at storage or API level?

@danielballan
Copy link
Member Author

danielballan commented Sep 8, 2023

Yeah, that's fair enough, for sure.


At an API level, we support:

/array/full/{path}?slice=...  # global indices

or

/array/block/{path}?block=0,0,0&slice=...`  # block-local indices

where block is a block index like dask array has.

Storage uses block-local indices within each Parquet file. There is a convenience constructor for building local blocks from a global reference frame.

Did that answer your question? It's been awhile since sparse has been on the "front burner" of my brain.


Reading more through the anndata links, I see how it shows clear patterns for presenting all of the structure families currently supported by Tiled as Zarr. This seems like a great place to start. Thanks for doing all the work. :-D

@dylanmcreynolds
Copy link
Contributor

As we dip our toes into zarr we would love to have an endpoint that could serve zarr directories and files in a similar fashion as a plain old web server could, enforces tiled authN/authZ, while still also exposing those data sets in the tiled ArrayCleint.

@danielballan
Copy link
Member Author

Great. I think that is exactly what we intend with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants