Data compression metrics for FMRI? #2

matthew-brett · 2023-08-26T20:25:16Z

matthew-brett
Aug 26, 2023
Maintainer

November 12, 2021, 12:32pm

Has anyone done, or come across, metrics for different compression
methods on standard FMRI / MRI / PET data?

I was thinking of comparisons of Deflate / zip, Zlib / gz, Lzma /
Lzma2 / xz (see [1] below), Blosc, Bzip2 / bz2, szip (via
libaec), LZ4, ZFP,
ZStandard.

It would also be very fine to get some metrics for read / write speed
using these same metrics.

I'm betting we could do better than .gz, but I'm wondering how much
better. And how important it is that we allow flexibility in compression
methods in a data access API.

[1] Xz format inadequate for long-term archiving](http://lzip.nongnu.org/xz_inadequate.html){.inline-onebox}

matthew-brett · 2023-08-26T20:26:57Z

matthew-brett
Aug 26, 2023
Maintainer Author

@ satra at November 12, 2021, 2:02pm

i haven't seen any on neuroimaging data. there are some nice blog posts on compression comparison (e.g., on genotype compression), indeed doing a domain/data-specific evaluation would be great. also chunk size is another meta parameter in such a benchmark, as some formats+compressors can handle indexing into arbitrary locations.

i did this recently for some microscopy data (in both zarr and hdf5) and found that the optimal things in the post did not apply to the data i had at hand. in my use case blosc+zstd did significantly better than blosc+lz4.

i would worry a little about read/write speeds as that would depend heavily on the storage backend (nfs, lustre, rdma over IB, s3, etc.,.). of course given a backend like a local nvme disk, one could look at relative i/o rates for different types of comparison. but those differences may be different for different backends.

at this point it would be easy to consider the data in openneuro with
many different repetition times and voxel spacing as a good source for
an evaluation.

another thing to keep in mind is how much MATLAB support exists.

2 replies

matthew-brett Aug 26, 2023
Maintainer Author

November 12, 2021, 3:08pm

Yes, the storage backend would definitely have to go into the metrics - at very least a comparison between local and cloud, but ideally local fast, local slow, NFS / SMB / cloud.

I agree that Matlab etc support is an issue - but a somewhat separate one. For example, we might find an ideal format for storage for analysis, perhaps another for archiving, and another quite different format for sharing - where Matlab access would be essential for the last, but not necessarily for the first two.

But the compression etc metrics would feed into our decision about whether it is worth having these separate formats. For example, if we found an excellent compression tradeoff that would only work for a format that Matlab can't easily read, then we might be more tempted to have a different analysis and sharing format, with tools to convert between the two, and where Matlab folks might have their own preferred idempotent format for analysis.

matthew-brett Aug 26, 2023
Maintainer Author

@ fangq at December 2, 2021, 3:19pm

my zmat toolbox supports zlib/gzip/lzip/lzma/lz4/lz4hc for matlab/octave, I also maintain this package for Fedora and Debian/Ubuntu (as octave-zmat and a C library):

https://github.com/fangq/zmat

matthew-brett · 2023-08-26T20:28:55Z

matthew-brett
Aug 26, 2023
Maintainer Author

@ effigies at November 12, 2021, 4:09pm

Another consideration is random access. We obviously have indexed_gzip in Python, and similar exist for bzip2 and zstd, but if a compression format facilitates efficient random access by default, that would help adoption of a format in other languages. Containers with a built-in concept of chunking can mitigate the penalty for poor random access, though.

1 reply

matthew-brett Aug 26, 2023
Maintainer Author

@ njvack at November 12, 2021, 4:36pm

WRT random access... there may be middle grounds where you can access, say, a slice in a volume, or a volume in a timeseries, or the timeseries of one voxel at random, without necessarily being able to trivially access any single point.

Data ordering and chunking do matter in uncompressed data as well, of course, but it is a larger consideration for compressed data.

matthew-brett · 2023-08-26T20:37:16Z

matthew-brett
Aug 26, 2023
Maintainer Author

@ neurolabusc at November 16, 2021, 2:55pm

For raw data, CT scans are often 12-bit Bits Stored (0028,0101), while for MRI 16-bit is becoming increasingly common. For these datatypes, neighboring voxels show much less variability for the most significant byte than the least significant byte. With the exception of BLOSC, this redundancy is not considered by the compression formats you describe. See my comments here. Therefore, swizzling the data can make a dramatic impact. While most scientists prefer scalar values, scanner manufacturers use RGB triplets for derived perfusion and diffusion metrics, and these would also benefit from Analyze-style planar storage (RRR...RGGG...GBBBB...B) versus NIfTI triplets (RGBRGB...). While the original question was regarding MRI/PET, this issue is also seen for indexed triangle meshes (e.g. GIfTI).

You may want to look at my Python scripts pigz bench which allow you to compare all sorts of converters for both compression speed/ratio and decompression speed, generating graphs of the Pareto frontier. By default, it uses the Silesia corpus, but you can specify any corpus you want, my earlier perl script provides a MRI corpus.

The indexed gzip is very nice for 4D NIfTI data. And gzip is really ubiquitous. The classic zlib is not optimized for modern hardware, but both CloudFlare zlib and zlib-ng leverage modern instructions to double single threaded performance. If you want to retain gzip but had really good performance, you should consider libdeflate for compression and either libdeflate or Intel's igzip for decompression. Both demand a lot more RAM. The libdeflate API is simple but inflexible, the Intel API is flexible but alien. For Python users, mgzip provides parallel decompression (though only for gzip files it generates).

DecompressMethod	Min	Mean	Max	mb/s
igzip	9040	9157	9233	492.32
libdeflate	9922	9974	10017	448.59
zlibNGclang	17295	17345	17390	257.34

pigz

1 reply

matthew-brett Aug 26, 2023
Maintainer Author

@ fangq at December 2, 2021, 3:18pm

I had previously posted a comparison, based on a single nifti file, on
this BIDS github thread:

github.com/bids-standard/bids-specification

basically,

lzma compression speed is relatively slow, but decompression is ok;
it has the highest compression ratio;
zip/gzip has a somewhat balanced speed vs compression ratio;
not included in this data, but lz4 is extremely fast in both
compression and decompression, but compression ratio is not as high
as zip/gzip; lz4 decompression is limited by disk reading speed
lz4hc is somewhere between lz4 and zip/gz

I wrote a C wrapper to make these tests (in MATLAB)

also, more thorough comparison in the context of Linux kernel benchmarks
can be found if you google " Boot speed improvements for Ubuntu 19.10
Eoan Ermine"\

the link is broken but I found a copy in archive
https://web.archive.org/web/20191017194258/https://kernel.ubuntu.com/\~cking/boot-speed-eoan-5.3/kernel-compression-method.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open DICOM

Data compression metrics for FMRI? #2

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Open DICOM

Data compression metrics for FMRI? #2

matthew-brett Aug 26, 2023 Maintainer

Replies: 3 comments · 4 replies

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett Aug 26, 2023 Maintainer Author

matthew-brett
Aug 26, 2023
Maintainer

Replies: 3 comments 4 replies

matthew-brett
Aug 26, 2023
Maintainer Author

matthew-brett Aug 26, 2023
Maintainer Author

matthew-brett Aug 26, 2023
Maintainer Author

matthew-brett
Aug 26, 2023
Maintainer Author

matthew-brett Aug 26, 2023
Maintainer Author

matthew-brett
Aug 26, 2023
Maintainer Author

matthew-brett Aug 26, 2023
Maintainer Author