Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5771][VL] Add metrics for ColumnarArrowEvalPythonExec #5772

Merged
merged 5 commits into from
May 23, 2024

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented May 16, 2024

What changes were proposed in this pull request?

Add metric for ColumnarArrowEvalPythonExec

(Fixes: #5771)

Spark UI

image

How was this patch tested?

We tested performance of arrow udf and collected some performance:

from pyspark.sql.functions import pandas_udf, PandasUDFType
import pyspark.sql.functions as F
import os
@pandas_udf('long')
def pandas_plus_one(v):
    return (v + 1)

@pandas_udf('string')
def pd_get_first(v):
    return v.str.split(':').str[1]
# test int
df = spark.read.orc("file:///xx/yy").select("user_id").withColumn("processed_user_id", pandas_plus_one("user_id")).select("processed_user_id")
# test string
df = spark.read.orc("file:///xx/yy").select("url").withColumn("processed_url", pd_get_first("url")).select("processed_url")

The perf shows ~20% perf gain compared with vanilla spark.

image

Copy link

#5771

@yma11 yma11 changed the title [GLUTEN-5771] Add metric for ColumnarArrowEvalPythonExec [GLUTEN-5771] Add metrics for ColumnarArrowEvalPythonExec May 16, 2024
@yma11 yma11 changed the title [GLUTEN-5771] Add metrics for ColumnarArrowEvalPythonExec [GLUTEN-5771][VL] Add metrics for ColumnarArrowEvalPythonExec May 16, 2024
@FelixYBW
Copy link
Contributor

@yma11 can you add a UI chart for the pyarrow UDF? Also add some implementation details?

In theory we can convert Velox to Arrow in Velox pipeline, then pass the arrow pointer to Spark where it's send to python process. There is no C2R and R2C in the whole process and no memcpy between Velox and Spark. Can we achieve this?

@yma11
Copy link
Contributor Author

yma11 commented May 17, 2024

@yma11 can you add a UI chart for the pyarrow UDF? Also add some implementation details?

In theory we can convert Velox to Arrow in Velox pipeline, then pass the arrow pointer to Spark where it's send to python process. There is no C2R and R2C in the whole process and no memcpy between Velox and Spark. Can we achieve this?

Yes. There is no C2R and R2C in current implementation. There is a VeloxColumnar to Arrow only. But for memcpy, it depends on the arrow bridge. I found there are still some memory allocation at velox for data types like string. Let me add the implementation under the feature track.

@yma11
Copy link
Contributor Author

yma11 commented May 20, 2024

@yma11 can you add a UI chart for the pyarrow UDF? Also add some implementation details?
In theory we can convert Velox to Arrow in Velox pipeline, then pass the arrow pointer to Spark where it's send to python process. There is no C2R and R2C in the whole process and no memcpy between Velox and Spark. Can we achieve this?

Yes. There is no C2R and R2C in current implementation. There is a VeloxColumnar to Arrow only. But for memcpy, it depends on the arrow bridge. I found there are still some memory allocation at velox for data types like string. Let me add the implementation under the feature track.

@FelixYBW The implementation details are now added in 5461. Perf data is also wrapped there. FYI.

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this file (ColumnarArrowEvalPythonExec.scala)'s package is package org.apache.spark.api.python which is wrong. Would you like to fix it? @yma11

@yma11 yma11 force-pushed the metric branch 3 times, most recently from 3c20f93 to af02cd6 Compare May 21, 2024 05:38
@yma11
Copy link
Contributor Author

yma11 commented May 21, 2024

I just noticed that this file (ColumnarArrowEvalPythonExec.scala)'s package is package org.apache.spark.api.python which is wrong. Would you like to fix it? @yma11

Fixed.

@yma11
Copy link
Contributor Author

yma11 commented May 22, 2024

@zhztheplayer Please help take a look again. Thanks.

@zhouyuan zhouyuan merged commit 621a479 into apache:main May 23, 2024
38 checks passed
@yma11 yma11 deleted the metric branch May 31, 2024 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metric for ColumnarArrowEvalPythonExec
4 participants