Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to floating point compression following examples of HDF5 and S-102 data #31

Open
gwlucastrig opened this issue Apr 20, 2023 · 2 comments

Comments

@gwlucastrig
Copy link
Owner

The GVRS compression implementation for floating point data usually does better than the standard format supported by HDF5. Recently, I was working with some S-102 format bathymetry products that did not compress as well when I transcribed their data to GVRS (HDF5 is the underlying format used for data in an S-102 product).

Based on preliminary inspection, I believe that HDF5 compressed better than GVRS for the particular data sets I examined because the bathymetry products contained a large number of "discontinuities" between adjacent cells in their raster fields. The existing GVRS compressor assumes that neighboring points tend to have values that are close together.
And it is less effective when this assumption does not apply. The S-102 bathymetry products used the value 1000000 to indicate "no data" or "land". So, when the data transitioned from water to land, there would be a sudden jump in data value. This configuration was not consistent with the expectations of the GVRS compressor. Consequently the output from the GVRS compressor tended to be about 15 percent larger than the output from the HDF5 compressor.

My proposal is to extend the floating-point compressor currently implemented:

  1. HDF5 splits the data into separate groups of bytes before compressing it using the Deflate compressor. All the high-order bytes from each floating point value are grouped together, then the next-highest bytes, et cetera, until the low-order bytes are grouped together.
  2. The GVRS compressor will be extended to test both its current compressor and the HDF5-format. The smallest result will be used.
  3. Currently, the second byte of the compressed packing for floating point values is a "reserved-for-future-use value" that is always set to zero, This value will be re-interpreted to indicate the data format used for compression. The new format will be indicated by the code value of 1.
  4. The GVRS data format document is going to need a significant update to its description of the floating-point compression format.
  5. The floating-point compressor will need to be extended somewhat to collect and report compression statistics for the alternate compressor. This approach will be similar to that used for the integer-based compressors.
@gwlucastrig
Copy link
Owner Author

So far, the results are more a puzzle than a benefit. It turns out that splitting the bytes into separate groups was not as effective with the S-102 data as just taking the data in big-endian order. Also, when processing the original ETOPO1, GEBCO, and SRTM data sets, the experimental Deflate compressor is almost never selected as the most effective. But when processing the S-102 sample data, it was always the most effective.

I suspect that the reason for the unexpected behavior I am seeing is that the S-102 samples I used so far are all taken from near-shore areas and, thus, include a mix of no-data values (land) and valid values (water). Most of the data sets I used during development either did not include no-data values or only a small number of null-data values.

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Dec 6, 2023

After quite a bit of investigation, I've identified a couple of issues. First off, the data compression statistics I reported in the original post were incorrect for the original S-102 bathymetry (HDF5 format) files. My calculations for the size of compressed data were incorrect, making HDF5 compression look better than it actually was. It turns out that the Gridfour floating-point compression was somewhat better than the HDF5 results.

One thing worth nothing is that the S-102 standard restricts the options that the author of the data has for selecting HDF5 compression methods to simple DEFLATE on four-byte floating point values (IEEE-754 format) given in big-endian order. HDF5 does support other data compression schemes and some of those may have been more effective had they been authorized for use in S-102 products. I have argued that by splitting the bits out into separate sections based on their meaning, we could obtain better compression. To do so, we separate the sign bit, the exponent (8 bits) and the mantissa (23 bits) into separate sections. Each section has distinct statistical behaviors. My argument was that because the high-order byte combines sign and partial exponent and next highest order byte combines 1 bit of the exponent with the high order bits of the mantissa, it conflated the statistical properties for the groups and weakened the compressibility of the data.

Anyway, I've experimented with a few different approaches to the Gridfour compression that improve the compression ratios further. They start by treating the sign bit and exponent part of the floating-point format the same was as the existing float-point compressor (see Lossless Compression for Floating Point Data ). However, they take different approaches to treating the mantissas

  1. Use a method similar to the Gridfour integer predictor storing the differences for subsequent mantissa values in the Gridfour M32 encoding and post-processing the results using DEFLATE
  2. Performing simple deflate on the mantissa bytes (23 bits per mantissa given as 3 bytes in big-endian order).

The alternate versions tended to work better than the standard Gridfour floating-point compressor in cases where the bathymetry data for the S-102 files was collected over shallow water. In these cases, the data values tended to be in the range 0 to 20 meters. Gridfour's floating-point compressor was originally developed using global oceanographic data sets (GEBCO 2019) that features larger-magnitude data values (typically larger than 1000 meters). Clearly the nature of the two kinds of data products would be different in terms of vertical scale. I hope to look at how this difference relates to the compressibility of mantissa elements in the future.

Finally, I've attached a picture showing the layout of a typical S-102 data set. The orange pixels represent "no-data" area (land). The grayscale data represents bathymetry. I hope this helps illustrate some of the ideas involved in the review of the S-102 data.

102US00_US4NJ1FF_view

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant