Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binaryData: Supplied checksum does not match checksum calculated from serialized slice #22598

Closed
jagill opened this issue Apr 24, 2024 · 1 comment
Labels

Comments

@jagill
Copy link
Contributor

jagill commented Apr 24, 2024

There is a discrepancy between the checksum supplied in the serialized Presto Slice returned by binaryData (introduced in #20932). According to the documentation (https://prestodb.io/docs/current/develop/serialized-page.html), bytes 13..21 (exclusive end index) should be a little-endian encoded int64 that encodes the checksum, calculated from after the header (ie, starting byte offset 21) to the end.

The declared checksum (1893016217) does not match the calculated checksum, or any variant I can find. I've checked the checksums returned by this library with an online tool (https://crccalc.com/). The library is also widely used and thus unlikely to have a significant bug. I've checked with the responses to several different queries.

There's an immediate ambiguity: The stated size (after the header) of the slice (46) does not always equal the size of the slice; in particular for small slices it seems to be padded to 1000 bytes with 0s. I am assuming the checksum is only applied to the size bytes after the header, where size is the payload size from bytes 9..13 (little-endian int32), but it does not equal this, nor when applied to bytes 21..1000 or 0..1000 or 0..(21+46).

Note that all the examples I've come across also parse fine (up to the issue in #22601 ), so I do not think the data is corrupted.

cc @arhimondr @mbasmanova

Your Environment

  • Presto version used: 0.287-edge18.1
  • Storage (HDFS/S3/GCS..): None, literal query
  • Data source and connector used: None, literal query
  • Deployment (Cloud or On-prem): Meta internal

Expected Behavior

The provided checksum (int64 little-endian in bytes 13..21) should match the crc32 checksum of the size bytes after the header 21..(21 + size).

Current Behavior

The provided checksum does not match any checksum calculation I can find.

Steps to Reproduce

Gist https://gist.github.com/jagill/666925f2833723ca567e1aa487500872 shows how to reproduce it, plus all the various checksums calculated above.

Context

This issue means I cannot verify that the supplied binary data is correct, which could lead to panics in production code and/or silent data corruption.

@jagill jagill added the bug label Apr 24, 2024
@jagill
Copy link
Contributor Author

jagill commented May 1, 2024

This bug report is incorrect. The documentation states that the checksum is not just for the bytes after the header, but also has contributions from the codec, the number of rows, and the uncompressed size. Using these (and truncating the bytes after the header by the size header field) gives me the correct checksum.

@jagill jagill closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

1 participant