Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extendable datasets #186

Open
tannerbitz opened this issue Oct 5, 2021 · 3 comments
Open

Support extendable datasets #186

tannerbitz opened this issue Oct 5, 2021 · 3 comments

Comments

@tannerbitz
Copy link

No description provided.

@constantinpape
Copy link
Owner

constantinpape commented Oct 5, 2021

Thanks for opening this, @tannerbitz. Copying the relevant part form the mails:

I have a question regarding the use of your z5 library as an alternative to hdf5. Is there a way to extend a dataset during runtime, in my case, as a use for logging? Or possibly to link datasets together? Any help would be greatly appreciated!
I'm looking to use both languages, C++ for embedded device use, and Python for later data analysis.
I'm open to using either data format, n5 or zarr. Based on your docs, it looks like I may be limited to little endian format for zarr, so maybe I should be leaning towards n5 as I'm not sure if the embedded devices in question will be little or big endian natively?

In theory both extending or linking datasets is possible (and not very complex due to the simple layout of zarr / n5 files).
However, this is not implemented in z5 natively yet (mostly because I am not quite sure what would be the best interface for this).
There are different routes to implement this:

  • extending datasets:
    • simple: just make the dimensions large enough that your data will always fit; due to chunking this does not incur any performance / storage downsides; if you need to know the actual dimensions you need to keep track of this in your code and can serialize this in the attributes. If you stop extending the dataset at some point you can then also crop the dimensions to the actual size.
    • advanced: extend the shape in the attributes if your write access extends the current dimensions (more complex because you may need to rewrite cropped border chunks)
  • linking datasets:
    • simple: load all the data into memory and write a new dataset
    • advanced: add up the shapes of the initial datasets and link or move the chunks (again, might need some extra treatment for cropped border chunks)

I don't have much time to integrate any of this into the library right now, but I could quickly code up a proto-type for any of these options in python; and it should also be relatively easy to translate this to C++.

Let me know what you think.

@clbarnes
Copy link
Contributor

clbarnes commented Oct 6, 2021

I forget, do the zarr/N5 specs support dataset reshaping? They'd handle the edge chunks quite differently.

Otherwise, on the application side you could set aside a metadata key like "followed_by" which tells you the address of the next array, and you could fill them up like rotating log files.

@constantinpape
Copy link
Owner

I forget, do the zarr/N5 specs support dataset reshaping? They'd handle the edge chunks quite differently.

I think there are some discussions going on in zarr, but there is not a canonical way to do it yet.

Otherwise, on the application side you could set aside a metadata key like "followed_by" which tells you the address of the next array, and you could fill them up like rotating log files.

Good point, that would also work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants