Compression in reduce side combine #686

lyogavin · 2013-07-08T21:23:28Z

One of the scalability problem we saw is when processing huge data, we need to have large number of reduce splits, which makes the memory overhead of shuffle writers becomes the bottleneck as stated in #685. More reduce splits we have, more stream objects we have in shuffle writers, the memory overhead of the internal buffers of the steam objects and file descriptor would become the bottleneck. Also as stated in https://spark-project.atlassian.net/browse/SPARK-751, large number of small blocks also makes the perf much worse.

The essential reason why we need to break down to many pieces is in the reduce side combining all the data of one partition need to be put into a hash map, this map is hold in memory through the whole process. In this patch, we compress the hash map for combine. The compression ratio for our production data can be around 30x so enabling compression in reduce side combination can significantly reduce the memory footprint thus reduce the number of reduce splits needed.

We tested the overhead of compression:
540M data in 4-node cluster, 1GB ram each node, the testing process is a simple groupBy followed by (s => s +" "). 399s without compression, 414s with compression. So just around 3.5% overhead.

AmplabJenkins · 2013-07-08T21:28:25Z

Thank you for your pull request. An admin will review this request soon.

AmplabJenkins · 2013-08-05T21:33:48Z

Thank you for your pull request. An admin will review this request soon.

rxin · 2013-10-24T01:23:09Z

Hi Gavin,

Thanks - I think this will be useful for scenarios that use a lot of memory (reduce doesn't really reduce anything).

I think this can be done entirely without changing any existing code. All we need to to implement a new RDD transformation that implements this compression reduce, and use that transformation instead of combineByKey.

compression in reduce side combine

e484921

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression in reduce side combine #686

Compression in reduce side combine #686

lyogavin commented Jul 8, 2013

AmplabJenkins commented Jul 8, 2013

AmplabJenkins commented Aug 5, 2013

rxin commented Oct 24, 2013

Compression in reduce side combine #686

Are you sure you want to change the base?

Compression in reduce side combine #686

Conversation

lyogavin commented Jul 8, 2013

AmplabJenkins commented Jul 8, 2013

AmplabJenkins commented Aug 5, 2013

rxin commented Oct 24, 2013