Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): Handle generalized ufunc edge cases better #16086

Closed

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented May 6, 2024

Fixes #14811

@github-actions github-actions bot added fix Bug fix python Related to Python Polars labels May 6, 2024
@itamarst itamarst marked this pull request as ready for review May 6, 2024 19:29
@MarcoGorelli
Copy link
Collaborator

thanks for doing this! numba+Polars currently really feels like playing with fire, so I'm happy to see this

@deanm0000 I think you've looked at related code recently? fancy taking a look if this interests you?

@deanm0000
Copy link
Collaborator

@MarcoGorelli I put a couple notes, nothing earth shattering from me.

@itamarst
Copy link
Contributor Author

itamarst commented May 7, 2024

OK I addressed the two comments.

@itamarst itamarst requested a review from deanm0000 May 7, 2024 13:52
result[i] = mean + i


def test_generalized_ufunc() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really testing Expr.array_ufunc if you use map_batches. I was thinking tests like df.select(gufunc_mean(pl.col('s')).alias('result') That said, I don't think you need to put those tests in this PR as it's only tangentially related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@ritchie46
Copy link
Member

@MarcoGorelli I'd like to leave this one to you. I don't understand enough of numba to have an opinion here.

@itamarst
Copy link
Contributor Author

itamarst commented May 8, 2024

Just to expand: this isn't really about Numba, Numba is merely a convenient way to write generalized ufunc for unit testing purposes.

Standard ufuncs in NumPy operate on an value by value basis. So if you have missing data on Polars side, that's fine (assuming it's valid bit pattern of data, anyway): you do log10() or whatever on all the data, and then throw away the missing data values at the end, it doesn't impact the result.

Generalized ufunc (https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html) operate on the whole array though. And now missing data becomes a problem. E.g. if you calculate a mean, you don't want to include the missing data because those are garbage values (see the example in original issue).

So, you shouldn't pass anything with missing data to a generalized ufunc.

In addition, generalized ufuncs can have output sizes that aren't the same as the input size (normal ufuncs operate value by value, so by definition the output is same size as input). That means the trick origin/main does where it allocates only once can't be used as is, and in fact may lead to memory corruption if the memory Polars allocates is too small. If the memory Polars allocates is too large, the result would have garbage values. So to fix this, we have the ufunc do the allocation in this case.

(Technically, given a parser for the signature mini-language, you could figure out the size in many more cases, but this at least makes the code work, i.e. takes us from "wrong results or potential memory corruption" to "correct but not maximally efficient", so it's a good first step).

@MarcoGorelli
Copy link
Collaborator

As far as I can tell, this fixes two separate issues:

  • missing data
  • input size differs from output size

The first one seems quite high-priority, whereas the second one just errors when it shouldn't - good to fix, but definitely lower-prio than silently giving wrong results

Is it possible to split this into two separate PRs, one for each issue which it fixes?

@itamarst
Copy link
Contributor Author

The second one can lead to memory corruption, I think, which seems pretty high priority... (Polars thinks output is N length, ufunc writes out 2N values).

I can split up this PR next week, if you think it's helpful.

@MarcoGorelli
Copy link
Collaborator

does it always throw a loud error like in #14811 though? If so, I'll maintain that it's lower priority than the missing values one. It's still important, but there's a lot of open issues and PRs, gotta prioritise somehow

The missing values one looks much easier to review, too, so that one can be merged a lot quicker

@itamarst
Copy link
Contributor Author

Going to split this into two new PRs, so closing.

@itamarst itamarst closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

generalized ufuncs that return arrays are broken in a variety of cases
5 participants