Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor to use transcodingstreams.jl API ? #135

Open
bjarthur opened this issue Feb 9, 2024 · 2 comments
Open

refactor to use transcodingstreams.jl API ? #135

bjarthur opened this issue Feb 9, 2024 · 2 comments

Comments

@bjarthur
Copy link
Member

bjarthur commented Feb 9, 2024

i see in src/compressors.jl that support for zlib exists. have you considered refactoring zarr.jl to access this compressor via the transcodingstream API instead? then you could get lz4, zstd, etc. too for free.

i ask, because blosc1.jl is not thread safe (in my hands anyway), and i'd like to read from multiple zarr chunks simultaneously. codecZstd.jl is thread safe i believe, and i'd like to switch my zarr data set to use it. i'm not sure about codecZlib.jl.

note that currently transcodingstreams.jl does not support blosc, however there is an issue suggesting it do so, with code in a link: JuliaIO/Blosc.jl#79. so zarr.jl wouldn't have to drop support for blosc1 were that issue followed through.

i actually attempted to add zstd myself to zarr.jl this morning through a package extension directly using codecZstd. first started refactoring the existing blosc and zlib code to use extensions, but then realized that structs defined in extension modules cannot be exported. so the user would not have access to e.g. BloscCompressor. but i think a refactoring to use transcodingstreams.jl wouldn't even need an extension. i'm filing this issue because i want to get a zarr.jl dev's opinion before proceeding further.

@nhz2
Copy link
Member

nhz2 commented Feb 9, 2024

The underlying blosc c library can be used in a thread safe way using the
blosc_compress_ctx and blosc_decompress_ctx functions.
That's what I am using in my experimental zarr implementation.
https://github.com/medyan-dev/SmallZarrGroups.jl/blob/b68316868caa543c3170d64d9828a072989acd8f/src/compression.jl#L19

@meggart
Copy link
Collaborator

meggart commented Feb 23, 2024

see in src/compressors.jl that support for zlib exists. have you considered refactoring zarr.jl to access this compressor via the transcodingstream API instead? then you could get lz4, zstd, etc. too for free.

I have considered this in the very beginning, but as you mention the main road blocker for this was the missing blosc support and blosc was what was used by almost all zarr dataset I was working with. A re-factoring towards transcodingstream api would be great and could lead to speedups as we could already start decompressing while the data is downloaded. This should be made easier now that we rely on Channels to pass data down the chain from storage -> decompression -> dump into output.

Regarding using blosc1 context api: I have experimented with this myself but never got this working in multiple threads. For some reason I was still running into frequent segmentation faults despite being sure to only use ctx versions of the c api. Probably I was doing something awfully wrong and it would be interesting to look at the implementation in SmallZarrGroups.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants