New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To add random exchange in NLJ and SPF when right child is DAS #1484
base: master
Are you sure you want to change the base?
Conversation
563a09f
to
9f44290
Compare
@@ -2792,6 +2792,7 @@ int ObStaticEngineCG::generate_basic_transmit_spec( | |||
spec.is_wf_hybrid_ = op.is_wf_hybrid(); | |||
spec.sample_type_ = op.get_sample_type(); | |||
spec.repartition_table_id_ = op.get_repartition_table_id(); | |||
spec.is_related_pair_ = op.is_related_pair(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
插入 random shuffle 后,会将原来的 dfo 分割成两个 dfo。那么 nlj 所在的 dfo 就会变成纯计算 dfo,它会被调度到 qc 所在到机器上。qc 的机器很可能和左表数据分布在不同的机器上,这会导致左表的所有数据都需要经过网络才能传输给 nlj 算子(比如左表为单分区表,本来不用网络传输就能把数据送给nlj算子)。为了减少网络传输,我们仍然让 nlj 所在的 dfo 和左表的 dfo 分布在同样的机器上。这样,尽管做了 global random shuffle,但还是有一部分数据可以走 local channel。
注释我加在了 exchange info 的结构体里:
/**
When data volume of left table in NLJ is small, there is a task skew due to small granule count.
Therefore, we insert random shuffle above left table to achieve load balancing. Exchange operator
splits DFO into two, and DFO with NLJ is purely computing and scheduled to QC machine. So, data
in left table will be sent to NLJ operator via network. To minimize network transmission, we
schedule DFO with NLJ on machines where left table data is located.
*/
efc3c8b
to
e933fae
Compare
e933fae
to
91f69a9
Compare
@@ -4535,6 +4537,8 @@ int ObLogPlan::compute_join_exchange_info(JoinPath &join_path, | |||
} else { | |||
left_exch_info.dist_method_ = ObPQDistributeMethod::HASH; | |||
right_exch_info.dist_method_ = ObPQDistributeMethod::PARTITION_HASH; | |||
// The hash join dfo should rely on the child of the local shuffle side. | |||
left_exch_info.is_related_child_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slave mapping 中也可以用 is_related_child_ 这个标记,表示父 dfo 依赖子 dfo 进行构造
LOG_WARN("fail to assign p2p dh map info", K(ret)); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里之前是个bug。父dfo依赖子dfo进行构造时,没有生成p2p id。这会导致slave mapping和join filter一起使用时报4016。在这里生成p2p id之后可以解决这个bug
zhangyuqin1998 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Task Description
When the data volume in the pipeline is small, the GI operator can only divide a small number of granules, resulting in some workers being idle because they cannot get the granules. This solution inserts a random shuffle operator on GI, and threads that grab granules randomly distribute data to various threads in the cluster, enabling all threads to work and achieving load balancing.
#1494
Solution Description
For now, only Nested Loop Join and SubPlan Filter are considered. Since they both drive right tables with the left table and the right table's query is slower, we need to enable as many threads as possible to lead the right table getting rows in parallel. So we add a random shuffle operator on top of the left child to shuffle the data and enable more threads to work.
Passed Regressions
N/A
Upgrade Compatibility
N/A
Other Information
This is the origin plan. There is obvious task skew.
After adding the random shuffle operator, task skew has been eliminated and and the total query time is reduced
### Release Note