Improve the scalability of the join between the LHS and GroupBys by breaking up the join #621

mears-stripe · 2023-11-22T01:53:21Z

Summary

Improve the scalability of the join between the LHS and GroupBys by breaking up the join. Previously, when joining together a large number of GroupBys, the Spark job could get stuck.

Why / Goal

Prevent the Spark job from getting stuck when joining the LHS with a large number of GroupBys.

Test Plan

Checklist

N.A.

Reviewers

nikhilsimha · 2023-11-22T07:02:40Z

spark/src/main/scala/ai/chronon/spark/Join.scala

@@ -67,6 +67,7 @@ class Join(joinConf: api.Join,
    extends JoinBase(joinConf, endPartition, tableUtils, skipFirstHole, mutationScan, showDf) {

  private val bootstrapTable = joinConf.metaData.bootstrapTable
+  private val joinsAtATime = 8


can we make this consume a spark conf param - via tableUtils?

nikhilsimha · 2023-11-22T07:04:46Z

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

@@ -324,6 +324,9 @@ case class TableUtils(sparkSession: SparkSession) {
    df
  }

+  def addJoinBreak(dataFrame: DataFrame): DataFrame =
+    dataFrame.cache()


TableUtils has a cache_level param and a wrap with cache method that does exception handling to release the resources claimed by the cache. I think we should use that here.

nikhilsimha · 2023-11-22T07:04:56Z

spark/src/main/scala/ai/chronon/spark/Join.scala

-              case (partialDf, (rightPart, rightDf)) => joinWithLeft(partialDf, rightDf, rightPart)
+              case (partialDf, ((rightPart, rightDf), i)) =>
+                val next = joinWithLeft(partialDf, rightDf, rightPart)
+                if (((i + 1) % joinsAtATime) == 0) {


if we have 24 parts - there will be 3 cache points - at 8, 16, 24

16 should evict the 8 cache. 24 shouldn't cache since it is the last one.

qiyang0221 · 2023-11-22T18:22:04Z

Does the PR mean we will break up the batch request into mini batch request and fetch them parallel? @nikhilsimha

nikhilsimha · 2023-11-22T21:58:50Z

Does the PR mean we will break up the batch request into mini batch request and fetch them parallel? @nikhilsimha

This basically only applies to spark offline jobs Yang.

mears-stripe · 2023-11-23T01:19:59Z

Does the PR mean we will break up the batch request into mini batch request and fetch them parallel? @nikhilsimha

This basically only applies to spark offline jobs Yang.

I added some details to the PR description.

And sorry, the PR is still a WIP. I'm working on getting the CI setup to work.

mears-stripe added 2 commits November 21, 2023 17:51

Break up joins

eeb0ef6

Fix addJoinBreak

9b787bc

nikhilsimha reviewed Nov 22, 2023

View reviewed changes

mears-stripe added 2 commits November 22, 2023 08:38

PR feedback

5096e4a

Empty-Commit

d1ec5d8

Empty-Commit

93ecd2c

mears-stripe changed the title ~~Break up joins~~ Improve the scalability of the join between the LHS and GroupBys by breaking up the join Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the scalability of the join between the LHS and GroupBys by breaking up the join #621

Improve the scalability of the join between the LHS and GroupBys by breaking up the join #621

mears-stripe commented Nov 22, 2023 •

edited

nikhilsimha Nov 22, 2023

nikhilsimha Nov 22, 2023 •

edited

nikhilsimha Nov 22, 2023

qiyang0221 commented Nov 22, 2023

nikhilsimha commented Nov 22, 2023

mears-stripe commented Nov 23, 2023

Improve the scalability of the join between the LHS and GroupBys by breaking up the join #621

Are you sure you want to change the base?

Improve the scalability of the join between the LHS and GroupBys by breaking up the join #621

Conversation

mears-stripe commented Nov 22, 2023 • edited

Summary

Why / Goal

Test Plan

Checklist

Reviewers

nikhilsimha Nov 22, 2023

Choose a reason for hiding this comment

nikhilsimha Nov 22, 2023 • edited

Choose a reason for hiding this comment

nikhilsimha Nov 22, 2023

Choose a reason for hiding this comment

qiyang0221 commented Nov 22, 2023

nikhilsimha commented Nov 22, 2023

mears-stripe commented Nov 23, 2023

mears-stripe commented Nov 22, 2023 •

edited

nikhilsimha Nov 22, 2023 •

edited