Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] Drop internal metadata in DataFrame.schema #46636

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented May 17, 2024

What changes were proposed in this pull request?

Drop internal metadata in DataFrame.schema

Why are the changes needed?

Internal metadata might be leaked in both Spark Connect and Spark Classic,

e.g. in Spark Classic

In [9]: spark.range(10).select(sf.lit(1).alias("key"), "id").groupBy("key").agg(sf.max("id")).schema.json()
Out[9]: '{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{"__autoGeneratedAlias":"true"},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"}'

What make it worse is that internal metadata maybe leaked in different cases of Spark Connect and Spark Classic, so need to add additional _drop_meta in Pandas APIs to make assertions work.

Does this PR introduce any user-facing change?

yes, the internal metadata will be dropped in DataFrame.schema, e.g. the __autoGeneratedAlias in the example query

before:

In [9]: spark.range(10).select(sf.lit(1).alias("key"), "id").groupBy("key").agg(sf.max("id")).schema.json()
Out[9]: '{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{"__autoGeneratedAlias":"true"},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"}'

after:

In [2]: spark.range(10).select(sf.lit(1).alias("key"), "id").groupBy("key").agg(sf.max("id")).schema.json()
Out[2]: '{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"}

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

No

@zhengruifeng zhengruifeng changed the title [SPARK-48322][SQL][CONNECT][PYTHON] Drop internal metadata in DataFrame.schema [SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] Drop internal metadata in DataFrame.schema May 17, 2024
@zhengruifeng
Copy link
Contributor Author

cc @cloud-fan and @HyukjinKwon

@zhengruifeng
Copy link
Contributor Author

cc @cloud-fan and @HyukjinKwon would you mind taking a look? thanks

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Contributor

can we update the Does this PR introduce any user-facing change? section? This is a user-visible change, as getting StructField metadata is a public API.

@zhengruifeng
Copy link
Contributor Author

can we update the Does this PR introduce any user-facing change? section? This is a user-visible change, as getting StructField metadata is a public API.

done, thanks

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the connect_remove_internal_meta branch May 29, 2024 01:10
@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented May 29, 2024

I am going to revert this PR, since it causes some streaming integration tests fail

will try to fix this issue again

zhengruifeng added a commit that referenced this pull request May 29, 2024
…l metadata in `DataFrame.schema`"

revert #46636

#46636 (comment)

Closes #46790 from zhengruifeng/revert_metadata_drop.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
@@ -561,7 +561,7 @@ class Dataset[T] private[sql](
* @since 1.6.0
*/
def schema: StructType = sparkSession.withActive {
queryExecution.analyzed.schema
removeInternalMetadata(queryExecution.analyzed.schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng for such a breaking change, we should send an email to the dev list before merging the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, it has been reverted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants