Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pyspark): pandas agg udf support #9206

Closed
wants to merge 1 commit into from
Closed

Conversation

ted0928
Copy link
Contributor

@ted0928 ted0928 commented May 17, 2024

Description of changes

Indroduce new annotation for aggregate udf as @ibis.udf.agg.pandas
Add implement for pyspark

Issues closed

@cpcloud
Copy link
Member

cpcloud commented May 31, 2024

@ted0928 Thanks for the PR!

Unfortunately we're not yet ready to go down the rabbit hole of non-builtin aggregate functions yet.

The way the implementations of custom aggregates work greatly influences their API design (which isn't really true for scalar UDFs).

For example, in all of our backends that support user-defined aggregate functions there is no assumption that all the data for a given aggregate call will be in the same place.

The consequences of not making this assumption are that users typically need to implement at least 3 methods:

  1. Some initialization of the state that will be stored and mutated in the aggregate.
  2. A "step" method that accepts (conceptually) a row, and updates the state according to the specific aggregate.
  3. Some "finalize" method if anything needs to be done to compute the final value of the aggregate.
  4. Optionally a method that can back out the changes for the "step" method to support window functions, but even this is pretty dicey since this actually forces a specific algorithm in the database and forces a non-optimal worst-case runtime for window functions.

This is what the SQLite API looks like in both Python and C.

This is also typically what a single-node system looks like.

For a distributed system, typically "step" operates on a partial aggregate, and there's an additional "merge" step that combines the partial aggregate state into another bundle of state.

PySpark Pandas UDFs are an exception (a distributed aggregate that requires none of the complexity of the above, sacrificing reliability and scalability for convenience since the data must all be on a single node for an aggregate like this to work) and so starting on this part of the API sets a precedent for an API whose design is very uncertain at the moment.

@cpcloud cpcloud closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(pyspark): support udaf
2 participants