Add bzip2 support #21

robvadai · 2017-03-29T10:32:27Z

This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.

S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.

robvadai linked a pull request Apr 3, 2017 that will close this issue

Added BZip2 support #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bzip2 support #21

Add bzip2 support #21

robvadai commented Mar 29, 2017 •

edited

Add bzip2 support #21

Add bzip2 support #21

Comments

robvadai commented Mar 29, 2017 • edited

robvadai commented Mar 29, 2017 •

edited