-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] Drop internal metadata in DataFrame.schema
#46636
[SPARK-48322][SPARK-42965][SQL][CONNECT][PYTHON] Drop internal metadata in DataFrame.schema
#46636
Conversation
DataFrame.schema
DataFrame.schema
ee0826b
to
0b3eba0
Compare
cc @cloud-fan and @HyukjinKwon |
cc @cloud-fan and @HyukjinKwon would you mind taking a look? thanks |
thanks, merging to master! |
can we update the |
done, thanks |
Merged to master. |
I am going to revert this PR, since it causes some streaming integration tests fail will try to fix this issue again |
…l metadata in `DataFrame.schema`" revert #46636 #46636 (comment) Closes #46790 from zhengruifeng/revert_metadata_drop. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
@@ -561,7 +561,7 @@ class Dataset[T] private[sql]( | |||
* @since 1.6.0 | |||
*/ | |||
def schema: StructType = sparkSession.withActive { | |||
queryExecution.analyzed.schema | |||
removeInternalMetadata(queryExecution.analyzed.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng for such a breaking change, we should send an email to the dev list before merging the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, it has been reverted
…ta in `DataFrame.schema` ### What changes were proposed in this pull request? Drop internal metadata in `DataFrame.schema` ### Why are the changes needed? Internal metadata might be leaked in both Spark Connect and Spark Classic, e.g. in Spark Classic ``` In [9]: spark.range(10).select(sf.lit(1).alias("key"), "id").groupBy("key").agg(sf.max("id")).schema.json() Out[9]: '{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{"__autoGeneratedAlias":"true"},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"}' ``` What make it worse is that internal metadata maybe leaked in different cases of Spark Connect and Spark Classic, so need to add additional `_drop_meta` in Pandas APIs to make assertions work. ### Does this PR introduce _any_ user-facing change? yes, the internal metadata will be dropped in `DataFrame.schema`, e.g. the `__autoGeneratedAlias` in the example query before: ``` In [9]: spark.range(10).select(sf.lit(1).alias("key"), "id").groupBy("key").agg(sf.max("id")).schema.json() Out[9]: '{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{"__autoGeneratedAlias":"true"},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"}' ``` after: ``` In [2]: spark.range(10).select(sf.lit(1).alias("key"), "id").groupBy("key").agg(sf.max("id")).schema.json() Out[2]: '{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"} ``` ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46636 from zhengruifeng/connect_remove_internal_meta. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…l metadata in `DataFrame.schema`" revert apache#46636 apache#46636 (comment) Closes apache#46790 from zhengruifeng/revert_metadata_drop. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
Drop internal metadata in
DataFrame.schema
Why are the changes needed?
Internal metadata might be leaked in both Spark Connect and Spark Classic,
e.g. in Spark Classic
What make it worse is that internal metadata maybe leaked in different cases of Spark Connect and Spark Classic, so need to add additional
_drop_meta
in Pandas APIs to make assertions work.Does this PR introduce any user-facing change?
yes, the internal metadata will be dropped in
DataFrame.schema
, e.g. the__autoGeneratedAlias
in the example querybefore:
after:
How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
No