Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate binary row/column formats Parquet and Avro #74

Open
theferrit32 opened this issue Dec 18, 2023 · 0 comments
Open

Evaluate binary row/column formats Parquet and Avro #74

theferrit32 opened this issue Dec 18, 2023 · 0 comments

Comments

@theferrit32
Copy link
Contributor

The pros of Parquet and Avro over JSON is that they are binary formats with a schema defined at the file level so each record doesn't need to re-serialize all the fields. Their file sizes are smaller than the equivalent gzip-compressed NDJSON. (in my testing Parquet was also smaller than Avro, likely benefitting a lot from columnar compression)

Avro natively supports compressing rows into DEFLATE or SNAPPY blocks. Parquet supports compressing with GZIP or SNAPPY. And BigQuery supports reading those compression types.

Avro is row-based, and thus better supports streaming mode, as each row can be streamed out to the output file one at a time. Compression is applied and flushed to the file once the write buffer / compression block is full.

Parquet is column-based, and thus does not support streaming mode at all. Batching writes to output files has to implemented in application code. But it does significantly beat Avro compression, because the parquet writing code has has full knowledge of all data the file will contain, and because columns tend to have more similar data to each other, so run-length compression over column values will be much more effective than compression over full rows. The downside to Parquet is you can't tell how many records you should put in each output file, so you end up holding a lot of data in memory. If you put too few records into each file and end up with a lot of output files (e.g. 1 million records, putting 1000 records into each will result in 1000 output files, each very very small). And native filesystems don't like having lots of files in one directory, and file space usage is measured in blocks, each of which is generally like 4KiB. So if each output parquet file is less than 4KiB, your total file space usage is actually way higher than the amount of bytes they contain because of internal block fragmentation.

The other downside to these is that even though they have schemas inherent to the file, so we'd maybe be able to use BigQuery's schema detection (https://cloud.google.com/bigquery/docs/schema-detect), I think we'd still need to specify the bigquery schemas for each because the schema inference probably won't be perfect, and we still want some validation that the bigquery schema is exactly what we want. Parquet supports more types than Avro (like date, time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant