Data compression metrics for FMRI? #2
Replies: 3 comments 4 replies
-
@ satra at November 12, 2021, 2:02pm i haven't seen any on neuroimaging data. there are some nice blog posts on compression comparison (e.g., on genotype compression), indeed doing a domain/data-specific evaluation would be great. also i did this recently for some microscopy data (in both zarr and hdf5) and found that the optimal things in the post did not apply to the data i had at hand. in my use case blosc+zstd did significantly better than blosc+lz4. i would worry a little about read/write speeds as that would depend heavily on the storage backend (nfs, lustre, rdma over IB, s3, etc.,.). of course given a backend like a local nvme disk, one could look at relative i/o rates for different types of comparison. but those differences may be different for different backends. at this point it would be easy to consider the data in openneuro with another thing to keep in mind is how much MATLAB support exists. |
Beta Was this translation helpful? Give feedback.
-
@ effigies at November 12, 2021, 4:09pm Another consideration is random access. We obviously have |
Beta Was this translation helpful? Give feedback.
-
@ neurolabusc at November 16, 2021, 2:55pm For raw data, CT scans are often 12-bit Bits Stored (0028,0101), while for MRI 16-bit is becoming increasingly common. For these datatypes, neighboring voxels show much less variability for the most significant byte than the least significant byte. With the exception of BLOSC, this redundancy is not considered by the compression formats you describe. See my comments here. Therefore, swizzling the data can make a dramatic impact. While most scientists prefer scalar values, scanner manufacturers use RGB triplets for derived perfusion and diffusion metrics, and these would also benefit from Analyze-style planar storage (RRR...RGGG...GBBBB...B) versus NIfTI triplets (RGBRGB...). While the original question was regarding MRI/PET, this issue is also seen for indexed triangle meshes (e.g. GIfTI). You may want to look at my Python scripts pigz bench which allow you to compare all sorts of converters for both compression speed/ratio and decompression speed, generating graphs of the Pareto frontier. By default, it uses the Silesia corpus, but you can specify any corpus you want, my earlier perl script provides a MRI corpus. The indexed gzip is very nice for 4D NIfTI data. And gzip is really ubiquitous. The classic zlib is not optimized for modern hardware, but both CloudFlare zlib and zlib-ng leverage modern instructions to double single threaded performance. If you want to retain gzip but had really good performance, you should consider libdeflate for compression and either libdeflate or Intel's igzip for decompression. Both demand a lot more RAM. The libdeflate API is simple but inflexible, the Intel API is flexible but alien. For Python users, mgzip provides parallel decompression (though only for gzip files it generates).
|
Beta Was this translation helpful? Give feedback.
-
November 12, 2021, 12:32pm
Has anyone done, or come across, metrics for different compression
methods on standard FMRI / MRI / PET data?
I was thinking of comparisons of Deflate / zip, Zlib / gz, Lzma /
Lzma2 / xz (see [1] below), Blosc, Bzip2 / bz2, szip (via
libaec), LZ4, ZFP,
ZStandard.
It would also be very fine to get some metrics for read / write speed
using these same metrics.
I'm betting we could do better than .gz, but I'm wondering how much
better. And how important it is that we allow flexibility in compression
methods in a data access API.
[1] Xz format inadequate for long-term archiving](http://lzip.nongnu.org/xz_inadequate.html){.inline-onebox}
Beta Was this translation helpful? Give feedback.
All reactions