Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blosc Compression-Performance Tips #231

Open
dwerner95 opened this issue Feb 7, 2023 · 3 comments
Open

Blosc Compression-Performance Tips #231

dwerner95 opened this issue Feb 7, 2023 · 3 comments

Comments

@dwerner95
Copy link

Hey All,

I've been playing around with the blosc compression implemented in 0.8.0 and i have some questions regarding the performance.

So basically, for my 400 MB csv-file that i convert to a HDF5 file i see compression of ~84%, which is amazing, however the performance doesn't seem to be affected at all.
Looking at this graph of the official HDF5 website:
graph of performance over size

I should see an enormous boost in the throughput.
Looking a bit at my CPU usage, the program is only using a single thread, despite me setting the number of blosc threads with blosc_set_nthreads
Additionally, blosc_get_nthreads only returns one single thread, which makes me think if there is an additional flag that needs to be set?

Overall, i wished there was some kind of performance-guide on this topic, is that something that would be possible to include in the documentation?

Best wishes,
Dominik

@mulimoen
Copy link
Collaborator

mulimoen commented Feb 8, 2023

Performance tuning is difficult and what works for one dataset might not work for another. Is your data similar to the one used for the graph?

I'll have a look at the blosc bug when I am back at a computer.

@dwerner95
Copy link
Author

All my datasets are identical. Each HDF5 file consists of at least 3 Datasets, one 1D ndarray and two 2D ndarrays with additional 1D ndarrays. All of them have the same size in the first dimension. Not entirely sure what data they plot in the image, but i would imagine that arrays are the easiest to compress (?).

I found an issue in the h5py github about this. It seems like that even if the number of threads is set, the program chooses to use serial compression if the chunk-size is insufficiently large. However, even if i set the chunk size to the size of the array i still don't see any improvement.

@mulimoen
Copy link
Collaborator

mulimoen commented May 29, 2024

I can trace it back to https://github.com/Blosc/c-blosc/blob/d306135aaf378ade04cd4d149058c29036335758/blosc/blosc.c#L913. One can force a block size by calling e.g. blosc_sys::blosc_set_blocksize(256) which enables parallel compression. (no idea is such a small block size makes sense, it should likely be much, much larger)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants