Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not try to uncompress pages that are not compressed #103

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

papanikge
Copy link

[This is a sync of a fix already done and tested in a fork]

Relates to #102

Context

Panther is running fraugster/parguet-go in production for some months now ingesting TBs of data.

Some customers reported that they got the following error:

snappy: corrupt input

After some investigation we can see that in the read function of the the V2 DataPage, the flag (already present in DataPageHeaderV2) was not checked.

More context: Parquet files - when compressed - are so in the page layer. Parquet supports compression per page, (as shown from the DataPageHeaderV2 IsCompressed field, which comes directly from the thrift definition). The library detects the compression type (called CompressionCodec) and passes that down to the newBlockReader level. However it still needs to check if that specific page is indeed compressed, and that was missing.

Checks

FWIW, I doubled check this with parquet-go/parquet-goparquet-go/parquet-go and confirmed that they don't try to decompress that.

  • Unit tests added
  • Full test run (and screenshot)
  • Run all unit tests with the race detector on
  • Run the linters locally via golangci-lint run

Ran all the tests with https://github.com/apache/parquet-testing

image

[Note: I can try adding a file into https://github.com/apache/parquet-testing before trying merging this into upstream]

Ran all unit tests with the race detector on

image

Added a unit tests
... that passes with false, but breaks if I do IsCompressed => true because the input is not snappy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant