feat(pyspark): pandas agg udf support #9206

ted0928 · 2024-05-17T06:52:46Z

Description of changes

Indroduce new annotation for aggregate udf as @ibis.udf.agg.pandas
Add implement for pyspark

Issues closed

Resolves feat(pyspark): support udaf #9173

cpcloud · 2024-05-31T13:58:31Z

@ted0928 Thanks for the PR!

Unfortunately we're not yet ready to go down the rabbit hole of non-builtin aggregate functions yet.

The way the implementations of custom aggregates work greatly influences their API design (which isn't really true for scalar UDFs).

For example, in all of our backends that support user-defined aggregate functions there is no assumption that all the data for a given aggregate call will be in the same place.

The consequences of not making this assumption are that users typically need to implement at least 3 methods:

Some initialization of the state that will be stored and mutated in the aggregate.
A "step" method that accepts (conceptually) a row, and updates the state according to the specific aggregate.
Some "finalize" method if anything needs to be done to compute the final value of the aggregate.
Optionally a method that can back out the changes for the "step" method to support window functions, but even this is pretty dicey since this actually forces a specific algorithm in the database and forces a non-optimal worst-case runtime for window functions.

This is what the SQLite API looks like in both Python and C.

This is also typically what a single-node system looks like.

For a distributed system, typically "step" operates on a partial aggregate, and there's an additional "merge" step that combines the partial aggregate state into another bundle of state.

PySpark Pandas UDFs are an exception (a distributed aggregate that requires none of the complexity of the above, sacrificing reliability and scalability for convenience since the data must all be on a single node for an aggregate like this to work) and so starting on this part of the API sets a precedent for an API whose design is very uncertain at the moment.

ted0928 force-pushed the main branch from e82d85e to b697f26 Compare May 17, 2024 07:05

ted0928 marked this pull request as ready for review May 17, 2024 07:27

ted0928 force-pushed the main branch from b697f26 to ebea2ef Compare May 21, 2024 08:27

ted0928 closed this May 21, 2024

ted0928 reopened this May 21, 2024

feat(pyspark): pandas agg udf support

6e23204

ted0928 force-pushed the main branch from ebea2ef to 6e23204 Compare May 24, 2024 11:41

cpcloud closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pyspark): pandas agg udf support #9206

feat(pyspark): pandas agg udf support #9206

ted0928 commented May 17, 2024

cpcloud commented May 31, 2024 •

edited

feat(pyspark): pandas agg udf support #9206

feat(pyspark): pandas agg udf support #9206

Conversation

ted0928 commented May 17, 2024

Description of changes

Issues closed

cpcloud commented May 31, 2024 • edited

cpcloud commented May 31, 2024 •

edited