binaryData: Supplied checksum does not match checksum calculated from serialized slice #22598

jagill · 2024-04-24T15:10:07Z

There is a discrepancy between the checksum supplied in the serialized Presto Slice returned by binaryData (introduced in #20932). According to the documentation (https://prestodb.io/docs/current/develop/serialized-page.html), bytes 13..21 (exclusive end index) should be a little-endian encoded int64 that encodes the checksum, calculated from after the header (ie, starting byte offset 21) to the end.

The declared checksum (1893016217) does not match the calculated checksum, or any variant I can find. I've checked the checksums returned by this library with an online tool (https://crccalc.com/). The library is also widely used and thus unlikely to have a significant bug. I've checked with the responses to several different queries.

There's an immediate ambiguity: The stated size (after the header) of the slice (46) does not always equal the size of the slice; in particular for small slices it seems to be padded to 1000 bytes with 0s. I am assuming the checksum is only applied to the size bytes after the header, where size is the payload size from bytes 9..13 (little-endian int32), but it does not equal this, nor when applied to bytes 21..1000 or 0..1000 or 0..(21+46).

Note that all the examples I've come across also parse fine (up to the issue in #22601 ), so I do not think the data is corrupted.

cc @arhimondr @mbasmanova

Your Environment

Presto version used: 0.287-edge18.1
Storage (HDFS/S3/GCS..): None, literal query
Data source and connector used: None, literal query
Deployment (Cloud or On-prem): Meta internal

Expected Behavior

The provided checksum (int64 little-endian in bytes 13..21) should match the crc32 checksum of the size bytes after the header 21..(21 + size).

Current Behavior

The provided checksum does not match any checksum calculation I can find.

Steps to Reproduce

Gist https://gist.github.com/jagill/666925f2833723ca567e1aa487500872 shows how to reproduce it, plus all the various checksums calculated above.

Context

This issue means I cannot verify that the supplied binary data is correct, which could lead to panics in production code and/or silent data corruption.

The text was updated successfully, but these errors were encountered:

jagill · 2024-05-01T01:03:32Z

This bug report is incorrect. The documentation states that the checksum is not just for the bytes after the header, but also has contributions from the codec, the number of rows, and the uncompressed size. Using these (and truncating the bytes after the header by the size header field) gives me the correct checksum.

jagill added the bug label Apr 24, 2024

jagill closed this as completed May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binaryData: Supplied checksum does not match checksum calculated from serialized slice #22598

binaryData: Supplied checksum does not match checksum calculated from serialized slice #22598

jagill commented Apr 24, 2024 •

edited

jagill commented May 1, 2024

binaryData: Supplied checksum does not match checksum calculated from serialized slice #22598

binaryData: Supplied checksum does not match checksum calculated from serialized slice #22598

Comments

jagill commented Apr 24, 2024 • edited

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce

Context

jagill commented May 1, 2024

jagill commented Apr 24, 2024 •

edited