Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bzip2 support #21

Open
robvadai opened this issue Mar 29, 2017 · 0 comments · May be fixed by #24
Open

Add bzip2 support #21

robvadai opened this issue Mar 29, 2017 · 0 comments · May be fixed by #24

Comments

@robvadai
Copy link

robvadai commented Mar 29, 2017

This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.

S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.

@robvadai robvadai linked a pull request Apr 3, 2017 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant