Group by upload: use repartition to increase parallelism #601

better365 · 2023-10-31T03:42:33Z

Summary

The group by upload input rdd has less number of partitions with compact size. It can leads to executor OOM while converting to chronon row.

Use the default parallelism to improve scalability.

Tested with Relevance team's upload job. The running time got reduced from 40+ mins to less than 15mins.

The downside is that repartition will trigger a shuffle.

Why / Goal

Improve performance.

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested

Checklist

Documentation update

Reviewers

@nikhilsimha @hzding621

…d-shuffle

hzding621 · 2023-10-31T17:27:27Z

spark/src/main/scala/ai/chronon/spark/JoinBase.scala

@@ -190,6 +190,8 @@ abstract class JoinBase(joinConf: api.Join,
    // all lazy vals - so evaluated only when needed by each case.
    lazy val partitionRangeGroupBy = genGroupBy(unfilledRange)

+    println(s"debug count ${partitionRangeGroupBy.inputDf.count()}")


nit: remove this?

ah good catch

hzding621 · 2023-10-31T17:27:44Z

spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala

+    // shuffle point: the input rdd has less number of partitions due to compact size
+    // when rows are converted to chronon rows, the size increases
+    // so we repartition it to reduce memory overhead and improve performance
+    val keyedInputRddRepartitioned = if (inputPartition < (parallelism / 10)) {


do we need to make this 10 configurable?

yeah we can make it configurable

Signed-off-by: Pengyu Hou <3771747+better365@users.noreply.github.com>

nikhilsimha

We should not make this default behavior

nikhilsimha · 2023-10-31T21:41:19Z

spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala

+    // so we repartition it to reduce memory overhead and improve performance
+    val keyedInputRddRepartitioned = if (inputPartition < (parallelism / 10)) {
+      keyedInputRdd
+        .repartition(parallelism)


I think this needs to be configurable (OPT_IN) before merging - we are going to add a shuffle step to ALL the upload jobs.

By default it should be opt-out

Sounds good. Let me make it configurable.

better365 added 10 commits October 19, 2023 12:24

Setting version to 0.0.55

32d6bce

Setting version to 0.0.56-SNAPSHOT

4d37b0e

Merge branch 'master' of github.com:airbnb/chronon

a219c8b

Merge branch 'master' of github.com:airbnb/chronon

0ae630b

Merge branch 'master' of github.com:airbnb/chronon

c91f7e9

move seq op after shuffle

b4a7781

use repartition instead

427f420

better repartition logic

d4f47c6

Merge branch 'master' of github.com:airbnb/chronon into pengyu--uploa…

b1c484e

…d-shuffle

added missing kryo array type

d773fc9

hzding621 approved these changes Oct 31, 2023

View reviewed changes

clean

b0a8f7c

Signed-off-by: Pengyu Hou <3771747+better365@users.noreply.github.com>

nikhilsimha reviewed Oct 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group by upload: use repartition to increase parallelism #601

Group by upload: use repartition to increase parallelism #601

better365 commented Oct 31, 2023

hzding621 Oct 31, 2023

better365 Oct 31, 2023

hzding621 Oct 31, 2023

better365 Oct 31, 2023

nikhilsimha left a comment •

edited

nikhilsimha Oct 31, 2023

better365 Nov 2, 2023

Group by upload: use repartition to increase parallelism #601

Are you sure you want to change the base?

Group by upload: use repartition to increase parallelism #601

Conversation

better365 commented Oct 31, 2023

Summary

Why / Goal

Test Plan

Checklist

Reviewers

hzding621 Oct 31, 2023

Choose a reason for hiding this comment

better365 Oct 31, 2023

Choose a reason for hiding this comment

hzding621 Oct 31, 2023

Choose a reason for hiding this comment

better365 Oct 31, 2023

Choose a reason for hiding this comment

nikhilsimha left a comment • edited

Choose a reason for hiding this comment

nikhilsimha Oct 31, 2023

Choose a reason for hiding this comment

better365 Nov 2, 2023

Choose a reason for hiding this comment

nikhilsimha left a comment •

edited