You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pros of Parquet and Avro over JSON is that they are binary formats with a schema defined at the file level so each record doesn't need to re-serialize all the fields. Their file sizes are smaller than the equivalent gzip-compressed NDJSON. (in my testing Parquet was also smaller than Avro, likely benefitting a lot from columnar compression)
Avro natively supports compressing rows into DEFLATE or SNAPPY blocks. Parquet supports compressing with GZIP or SNAPPY. And BigQuery supports reading those compression types.
Avro is row-based, and thus better supports streaming mode, as each row can be streamed out to the output file one at a time. Compression is applied and flushed to the file once the write buffer / compression block is full.
Parquet is column-based, and thus does not support streaming mode at all. Batching writes to output files has to implemented in application code. But it does significantly beat Avro compression, because the parquet writing code has has full knowledge of all data the file will contain, and because columns tend to have more similar data to each other, so run-length compression over column values will be much more effective than compression over full rows. The downside to Parquet is you can't tell how many records you should put in each output file, so you end up holding a lot of data in memory. If you put too few records into each file and end up with a lot of output files (e.g. 1 million records, putting 1000 records into each will result in 1000 output files, each very very small). And native filesystems don't like having lots of files in one directory, and file space usage is measured in blocks, each of which is generally like 4KiB. So if each output parquet file is less than 4KiB, your total file space usage is actually way higher than the amount of bytes they contain because of internal block fragmentation.
The other downside to these is that even though they have schemas inherent to the file, so we'd maybe be able to use BigQuery's schema detection (https://cloud.google.com/bigquery/docs/schema-detect), I think we'd still need to specify the bigquery schemas for each because the schema inference probably won't be perfect, and we still want some validation that the bigquery schema is exactly what we want. Parquet supports more types than Avro (like date, time)
The text was updated successfully, but these errors were encountered:
The pros of Parquet and Avro over JSON is that they are binary formats with a schema defined at the file level so each record doesn't need to re-serialize all the fields. Their file sizes are smaller than the equivalent gzip-compressed NDJSON. (in my testing Parquet was also smaller than Avro, likely benefitting a lot from columnar compression)
Avro natively supports compressing rows into DEFLATE or SNAPPY blocks. Parquet supports compressing with GZIP or SNAPPY. And BigQuery supports reading those compression types.
Avro is row-based, and thus better supports streaming mode, as each row can be streamed out to the output file one at a time. Compression is applied and flushed to the file once the write buffer / compression block is full.
Parquet is column-based, and thus does not support streaming mode at all. Batching writes to output files has to implemented in application code. But it does significantly beat Avro compression, because the parquet writing code has has full knowledge of all data the file will contain, and because columns tend to have more similar data to each other, so run-length compression over column values will be much more effective than compression over full rows. The downside to Parquet is you can't tell how many records you should put in each output file, so you end up holding a lot of data in memory. If you put too few records into each file and end up with a lot of output files (e.g. 1 million records, putting 1000 records into each will result in 1000 output files, each very very small). And native filesystems don't like having lots of files in one directory, and file space usage is measured in blocks, each of which is generally like 4KiB. So if each output parquet file is less than 4KiB, your total file space usage is actually way higher than the amount of bytes they contain because of internal block fragmentation.
The other downside to these is that even though they have schemas inherent to the file, so we'd maybe be able to use BigQuery's schema detection (https://cloud.google.com/bigquery/docs/schema-detect), I think we'd still need to specify the bigquery schemas for each because the schema inference probably won't be perfect, and we still want some validation that the bigquery schema is exactly what we want. Parquet supports more types than Avro (like date, time)
The text was updated successfully, but these errors were encountered: