Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression in reduce side combine #686

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lyogavin
Copy link
Contributor

@lyogavin lyogavin commented Jul 8, 2013

One of the scalability problem we saw is when processing huge data, we need to have large number of reduce splits, which makes the memory overhead of shuffle writers becomes the bottleneck as stated in #685. More reduce splits we have, more stream objects we have in shuffle writers, the memory overhead of the internal buffers of the steam objects and file descriptor would become the bottleneck. Also as stated in https://spark-project.atlassian.net/browse/SPARK-751, large number of small blocks also makes the perf much worse.

The essential reason why we need to break down to many pieces is in the reduce side combining all the data of one partition need to be put into a hash map, this map is hold in memory through the whole process. In this patch, we compress the hash map for combine. The compression ratio for our production data can be around 30x so enabling compression in reduce side combination can significantly reduce the memory footprint thus reduce the number of reduce splits needed.

We tested the overhead of compression:
540M data in 4-node cluster, 1GB ram each node, the testing process is a simple groupBy followed by (s => s +" "). 399s without compression, 414s with compression. So just around 3.5% overhead.

@AmplabJenkins
Copy link

Thank you for your pull request. An admin will review this request soon.

1 similar comment
@AmplabJenkins
Copy link

Thank you for your pull request. An admin will review this request soon.

@rxin
Copy link
Member

rxin commented Oct 24, 2013

Hi Gavin,

Thanks - I think this will be useful for scenarios that use a lot of memory (reduce doesn't really reduce anything).

I think this can be done entirely without changing any existing code. All we need to to implement a new RDD transformation that implements this compression reduce, and use that transformation instead of combineByKey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants