[SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter #11204

ziudu · 2024-05-13T17:25:58Z

According to [parisni in [HUDI-6150] Support bucketing for each hive client (https://github.com//pull/8657)

"So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong:

the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync
this current PR
the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index"

Do you have any update on this?

ziudu · 2024-05-14T02:31:11Z

Hi Danny0405,

I think the support for 2 hudi tables' Spark sort-merge-join with bucket optimization is an important feature.

Currently if we join 2 hudi tables, the bucket index's bucket information is not usable by spark, so shuffle is always needs. As explained in 8657 - hashing- file naming- file numbering- file sorting are different.

Unfortunately, according to https://issues.apache.org/jira/browse/SPARK-19256, spark bucket is not compatible with hive bucket yet. So if we have to choose one between spark and hive, I think spark might be of higher priority.

danny0405 · 2024-05-14T03:44:14Z

So if we have to choose one between spark and hive, I think spark might be of higher priority

I agree, do you have energy to complete that suspended PR.

ziudu · 2024-05-14T04:58:21Z

I'm a newbie. It took me a while to understand why bucket join does not work.

cono · 2024-06-06T12:50:35Z

This is really useful feature to have.
We want to use Hudi at work, but unfortunately we have couple of bucketed/sorted tables, and this is definitely a stopper for us to migrate to Hudi.

danny0405 · 2024-06-07T00:46:56Z

@KnightChess do you have intreast to push-forward this feature?

KnightChess · 2024-06-07T02:49:25Z

@KnightChess do you have intreast to push-forward this feature?

@danny0405 yes, I follow up this problem

danny0405 added hive Issues related to hive performance labels May 14, 2024

danny0405 added optimizer spark-sql migration labels Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter #11204

[SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter #11204

ziudu commented May 13, 2024

ziudu commented May 14, 2024

danny0405 commented May 14, 2024

ziudu commented May 14, 2024

cono commented Jun 6, 2024

danny0405 commented Jun 7, 2024

KnightChess commented Jun 7, 2024

[SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter #11204

[SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter #11204

Comments

ziudu commented May 13, 2024

ziudu commented May 14, 2024

danny0405 commented May 14, 2024

ziudu commented May 14, 2024

cono commented Jun 6, 2024

danny0405 commented Jun 7, 2024

KnightChess commented Jun 7, 2024