Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Zarr an OLAP Database? #290

Open
alxmrs opened this issue Mar 18, 2024 · 3 comments
Open

Is Zarr an OLAP Database? #290

alxmrs opened this issue Mar 18, 2024 · 3 comments

Comments

@alxmrs
Copy link

alxmrs commented Mar 18, 2024

I've been doing some background research for a project I've been working on. I came across the definition of an OLAP DB and OLAP Cube, and I can't help but see the similarities to Zarr.

https://en.wikipedia.org/wiki/OLAP_cube

Consider the operations section of this wikipedia page:

  • slicing & dicing is just like array indexing and slicing.
  • roll-ups seem like the overviews Zarr extension considered in GeoZarr and other bioinformatics projects.
  • drill down/up doesn't seem to have a mapping to Zarr -- maybe xarray-datatree is most similar?
  • pivots are either queries on Zarr (say, with Xarray) or else rechunk operations to support faster queries.

This makes me wonder if Zarr could be (mis)used as a traditional DB, say, to handle analytics and business use cases. Furthermore, maybe the literature around OLAP DBs could inspire improvements to Zarr as a format.

xref: alxmrs/xarray-sql#47

@d-v-b
Copy link
Contributor

d-v-b commented Mar 18, 2024

I'm not too familiar with the OLAP conceptualization, but I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table".

With this in mind, then Zarr can be described as a "columnar database" with a few performance / storage optimizations for large columns, but it's not a database that competes with something like DuckDB, since Zarr doesn't have good support for variable-length types.

All that being said, if the world of databases is a superset of the world of N-dimensional arrays, then it's almost certainly the case that we can use tools from database theory / software to advance Zarr.

@alxmrs
Copy link
Author

alxmrs commented Mar 18, 2024 via email

@d-v-b
Copy link
Contributor

d-v-b commented Mar 18, 2024

since Zarr doesn't have good support for variable-length types.

Can you explain this a bit more? Would this be like string or varchar
support?

Zarr is designed for numeric types that are a fixed size, e.g. uint8; there's some effort towards supporting variable length strings in Zarr, but it's not a common use case. Compared to the types supported by a typical relational database, the numeric types Zarr focuses on is a tiny subset -- e.g., look at the types supported by postgres.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants