Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5757][CORE] Remove unnecessary ProjectExecTransformer for Generate #5782

Merged

Conversation

xumingming
Copy link
Contributor

@xumingming xumingming commented May 17, 2024

What changes were proposed in this pull request?

If generator function's input is already Attribute reference, we omit the
introduction of the ProjectExec.

Previous implementation always assume there is Project under Generate. In
the new implementation we added a metadata(injectedProject) in Substrait
plan to tell us whether there is a dedicated Project under Generate.

(Fixes: #5757)

Copy link

#5757

Copy link

Run Gluten Clickhouse CI

@xumingming xumingming force-pushed the remove-unnecessary-project-for-generator-2 branch from cc95e19 to 5378fd4 Compare May 17, 2024 05:24
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@xumingming xumingming force-pushed the remove-unnecessary-project-for-generator-2 branch from 0130c96 to 49ccf81 Compare May 17, 2024 12:44
Copy link

Run Gluten Clickhouse CI

@xumingming
Copy link
Contributor Author

xumingming commented May 20, 2024

@zhztheplayer Seems the test transform.sql in spark-test-spark32-slow failed, do you happen to know how to run this single test locally on my laptop?

2024-05-17T15:18:18.8398949Z - transform.sql *** FAILED ***
2024-05-17T15:18:18.8400189Z   transform.sql
2024-05-17T15:18:18.8404882Z   Expected "struct<[a:string,b:string,c:string,d:array<int>,e:string]>", but got "struct<[]>" Schema did not match for query #34
2024-05-17T15:18:18.8406267Z   SELECT TRANSFORM(b, MAX(a), CAST(SUM(c) AS STRING), myCol, myCol2)
2024-05-17T15:18:18.8407378Z     USING 'cat' AS (a STRING, b STRING, c STRING, d ARRAY<INT>, e STRING)
2024-05-17T15:18:18.8408098Z   FROM script_trans
2024-05-17T15:18:18.8408682Z   LATERAL VIEW explode(array(array(1,2,3))) myTable AS myCol
2024-05-17T15:18:18.8428916Z   LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2
2024-05-17T15:18:18.8429675Z   WHERE a <= 4
2024-05-17T15:18:18.8430207Z   GROUP BY b, myCol, myCol2
2024-05-17T15:18:18.8431094Z   HAVING max(a) > 1: -- !query
2024-05-17T15:18:18.8431873Z   SELECT TRANSFORM(b, MAX(a), CAST(SUM(c) AS STRING), myCol, myCol2)
2024-05-17T15:18:18.8432979Z     USING 'cat' AS (a STRING, b STRING, c STRING, d ARRAY<INT>, e STRING)
2024-05-17T15:18:18.8433740Z   FROM script_trans
2024-05-17T15:18:18.8434308Z   LATERAL VIEW explode(array(array(1,2,3))) myTable AS myCol
2024-05-17T15:18:18.8435114Z   LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2
2024-05-17T15:18:18.8435765Z   WHERE a <= 4
2024-05-17T15:18:18.8436159Z   GROUP BY b, myCol, myCol2
2024-05-17T15:18:18.8436626Z   HAVING max(a) > 1
2024-05-17T15:18:18.8437143Z   -- !query schema
2024-05-17T15:18:18.8437518Z   struct<>
2024-05-17T15:18:18.8437924Z   -- !query output
2024-05-17T15:18:18.8438584Z   org.apache.gluten.exception.GlutenException
2024-05-17T15:18:18.8439452Z   java.lang.RuntimeException: Exception: VeloxRuntimeError
2024-05-17T15:18:18.8440223Z   Error Source: RUNTIME
2024-05-17T15:18:18.8440726Z   Error Code: INVALID_STATE
2024-05-17T15:18:18.8441350Z   Reason: Type mismatch: INTEGER vs. ARRAY<INTEGER>
2024-05-17T15:18:18.8442022Z   Retriable: False
2024-05-17T15:18:18.8442657Z   Expression: type_->kindEquals(vector.type())
2024-05-17T15:18:18.8443399Z   Context: Operator: PartialAggregation[6] 5
2024-05-17T15:18:18.8444040Z   Function: decode

@xumingming xumingming force-pushed the remove-unnecessary-project-for-generator-2 branch from 49ccf81 to 5bf6dfe Compare May 20, 2024 03:03
Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member

spark-test-spark32-sl

I think it's in suite GlutenSQLQueryTestSuite and you could try modifying supportedList to run a single test.

@xumingming xumingming force-pushed the remove-unnecessary-project-for-generator-2 branch from 5bf6dfe to 81f500e Compare May 20, 2024 09:15
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

…erate

If generator function's input is already Attribute reference, we omit the
introduction of the ProjectExec.

Previous implementation always assume there is Project under Generate. In
the new implementation we added a metadata(injectedProject) in Substrait
plan to tell us whether there is a dedicated Project under Generate
@xumingming xumingming force-pushed the remove-unnecessary-project-for-generator-2 branch from d146786 to 9363d75 Compare May 22, 2024 07:58
Copy link

Run Gluten Clickhouse CI

@xumingming
Copy link
Contributor Author

xumingming commented May 22, 2024

@zhztheplayer Since #5838 is merged, this PR is ready for review now.

Previous implementation always assume there is a dedicated Project under Generate, now this is not true anymore. If there is a Project under Generate it could be a dedicated Project for Generate, or it also could be just a normal Project. So we added a metadata(injectedProject) in Substrait plan to tell us whether there is a dedicated Project under Generate.

@xumingming
Copy link
Contributor Author

@zhztheplayer Gluten ClickHouse CI failed, it needs login, what's the username and password?

@xumingming
Copy link
Contributor Author

All tests passed.

@zhztheplayer
Copy link
Member

cc @marin-ma

Copy link
Contributor

@marin-ma marin-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@marin-ma marin-ma merged commit 5dd884e into apache:main May 23, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary ProjectExec generated for Generate function
3 participants