New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to floating point compression following examples of HDF5 and S-102 data #31
Comments
So far, the results are more a puzzle than a benefit. It turns out that splitting the bytes into separate groups was not as effective with the S-102 data as just taking the data in big-endian order. Also, when processing the original ETOPO1, GEBCO, and SRTM data sets, the experimental Deflate compressor is almost never selected as the most effective. But when processing the S-102 sample data, it was always the most effective. I suspect that the reason for the unexpected behavior I am seeing is that the S-102 samples I used so far are all taken from near-shore areas and, thus, include a mix of no-data values (land) and valid values (water). Most of the data sets I used during development either did not include no-data values or only a small number of null-data values. |
After quite a bit of investigation, I've identified a couple of issues. First off, the data compression statistics I reported in the original post were incorrect for the original S-102 bathymetry (HDF5 format) files. My calculations for the size of compressed data were incorrect, making HDF5 compression look better than it actually was. It turns out that the Gridfour floating-point compression was somewhat better than the HDF5 results. One thing worth nothing is that the S-102 standard restricts the options that the author of the data has for selecting HDF5 compression methods to simple DEFLATE on four-byte floating point values (IEEE-754 format) given in big-endian order. HDF5 does support other data compression schemes and some of those may have been more effective had they been authorized for use in S-102 products. I have argued that by splitting the bits out into separate sections based on their meaning, we could obtain better compression. To do so, we separate the sign bit, the exponent (8 bits) and the mantissa (23 bits) into separate sections. Each section has distinct statistical behaviors. My argument was that because the high-order byte combines sign and partial exponent and next highest order byte combines 1 bit of the exponent with the high order bits of the mantissa, it conflated the statistical properties for the groups and weakened the compressibility of the data. Anyway, I've experimented with a few different approaches to the Gridfour compression that improve the compression ratios further. They start by treating the sign bit and exponent part of the floating-point format the same was as the existing float-point compressor (see Lossless Compression for Floating Point Data ). However, they take different approaches to treating the mantissas
The alternate versions tended to work better than the standard Gridfour floating-point compressor in cases where the bathymetry data for the S-102 files was collected over shallow water. In these cases, the data values tended to be in the range 0 to 20 meters. Gridfour's floating-point compressor was originally developed using global oceanographic data sets (GEBCO 2019) that features larger-magnitude data values (typically larger than 1000 meters). Clearly the nature of the two kinds of data products would be different in terms of vertical scale. I hope to look at how this difference relates to the compressibility of mantissa elements in the future. Finally, I've attached a picture showing the layout of a typical S-102 data set. The orange pixels represent "no-data" area (land). The grayscale data represents bathymetry. I hope this helps illustrate some of the ideas involved in the review of the S-102 data. |
The GVRS compression implementation for floating point data usually does better than the standard format supported by HDF5. Recently, I was working with some S-102 format bathymetry products that did not compress as well when I transcribed their data to GVRS (HDF5 is the underlying format used for data in an S-102 product).
Based on preliminary inspection, I believe that HDF5 compressed better than GVRS for the particular data sets I examined because the bathymetry products contained a large number of "discontinuities" between adjacent cells in their raster fields. The existing GVRS compressor assumes that neighboring points tend to have values that are close together.
And it is less effective when this assumption does not apply. The S-102 bathymetry products used the value 1000000 to indicate "no data" or "land". So, when the data transitioned from water to land, there would be a sudden jump in data value. This configuration was not consistent with the expectations of the GVRS compressor. Consequently the output from the GVRS compressor tended to be about 15 percent larger than the output from the HDF5 compressor.
My proposal is to extend the floating-point compressor currently implemented:
The text was updated successfully, but these errors were encountered: