fix(python): Handle generalized ufunc edge cases better #16086

itamarst · 2024-05-06T18:40:21Z

Fixes #14811

MarcoGorelli · 2024-05-06T21:24:10Z

thanks for doing this! numba+Polars currently really feels like playing with fire, so I'm happy to see this

@deanm0000 I think you've looked at related code recently? fancy taking a look if this interests you?

py-polars/polars/series/series.py

py-polars/tests/unit/interop/numpy/test_ufunc_series.py

deanm0000 · 2024-05-07T03:58:51Z

@MarcoGorelli I put a couple notes, nothing earth shattering from me.

itamarst · 2024-05-07T13:52:08Z

OK I addressed the two comments.

deanm0000 · 2024-05-07T16:57:07Z

py-polars/tests/unit/interop/numpy/test_ufunc_expr.py

+        result[i] = mean + i
+
+
+def test_generalized_ufunc() -> None:


It's not really testing Expr.array_ufunc if you use map_batches. I was thinking tests like df.select(gufunc_mean(pl.col('s')).alias('result') That said, I don't think you need to put those tests in this PR as it's only tangentially related.

ritchie46 · 2024-05-08T10:59:16Z

@MarcoGorelli I'd like to leave this one to you. I don't understand enough of numba to have an opinion here.

itamarst · 2024-05-08T13:37:31Z

Just to expand: this isn't really about Numba, Numba is merely a convenient way to write generalized ufunc for unit testing purposes.

Standard ufuncs in NumPy operate on an value by value basis. So if you have missing data on Polars side, that's fine (assuming it's valid bit pattern of data, anyway): you do log10() or whatever on all the data, and then throw away the missing data values at the end, it doesn't impact the result.

Generalized ufunc (https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html) operate on the whole array though. And now missing data becomes a problem. E.g. if you calculate a mean, you don't want to include the missing data because those are garbage values (see the example in original issue).

So, you shouldn't pass anything with missing data to a generalized ufunc.

In addition, generalized ufuncs can have output sizes that aren't the same as the input size (normal ufuncs operate value by value, so by definition the output is same size as input). That means the trick origin/main does where it allocates only once can't be used as is, and in fact may lead to memory corruption if the memory Polars allocates is too small. If the memory Polars allocates is too large, the result would have garbage values. So to fix this, we have the ufunc do the allocation in this case.

(Technically, given a parser for the signature mini-language, you could figure out the size in many more cases, but this at least makes the code work, i.e. takes us from "wrong results or potential memory corruption" to "correct but not maximally efficient", so it's a good first step).

MarcoGorelli · 2024-05-11T08:24:40Z

As far as I can tell, this fixes two separate issues:

missing data
input size differs from output size

The first one seems quite high-priority, whereas the second one just errors when it shouldn't - good to fix, but definitely lower-prio than silently giving wrong results

Is it possible to split this into two separate PRs, one for each issue which it fixes?

itamarst · 2024-05-11T12:08:24Z

The second one can lead to memory corruption, I think, which seems pretty high priority... (Polars thinks output is N length, ufunc writes out 2N values).

I can split up this PR next week, if you think it's helpful.

MarcoGorelli · 2024-05-11T12:36:27Z

does it always throw a loud error like in #14811 though? If so, I'll maintain that it's lower priority than the missing values one. It's still important, but there's a lot of open issues and PRs, gotta prioritise somehow

The missing values one looks much easier to review, too, so that one can be merged a lot quicker

itamarst · 2024-05-13T16:32:42Z

Going to split this into two new PRs, so closing.

pythonspeed added 8 commits May 6, 2024 12:27

Tests for expected behavior in generalized ufunc edge cases.

e0b505f

Disallow calling generalized ufuncs if data is missing.

04afbeb

WIP sketch of supporting output size different than input.

782998a

ufunc tests pass again.

ff80578

Bit more testing.

ab4f0a1

Lint fixes.

144192e

Delete copy/paste error.

fe58184

Add Numba as a dependency.

70194b0

github-actions bot added fix Bug fix python Related to Python Polars labels May 6, 2024

itamarst marked this pull request as ready for review May 6, 2024 19:29

itamarst requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli and reswqa as code owners May 6, 2024 19:29

deanm0000 reviewed May 7, 2024

View reviewed changes

py-polars/polars/series/series.py Outdated Show resolved Hide resolved

deanm0000 reviewed May 7, 2024

View reviewed changes

py-polars/tests/unit/interop/numpy/test_ufunc_series.py Outdated Show resolved Hide resolved

pythonspeed added 3 commits May 7, 2024 09:19

Switch to ComputeError.

d9e555d

A bit more testing.

3672eac

Reformat.

8ff6106

itamarst requested a review from deanm0000 May 7, 2024 13:52

deanm0000 reviewed May 7, 2024

View reviewed changes

itamarst closed this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(python): Handle generalized ufunc edge cases better #16086

fix(python): Handle generalized ufunc edge cases better #16086

itamarst commented May 6, 2024

MarcoGorelli commented May 6, 2024

deanm0000 commented May 7, 2024

itamarst commented May 7, 2024

deanm0000 May 7, 2024

itamarst May 7, 2024

ritchie46 commented May 8, 2024

itamarst commented May 8, 2024 •

edited

MarcoGorelli commented May 11, 2024

itamarst commented May 11, 2024

MarcoGorelli commented May 11, 2024

itamarst commented May 13, 2024

fix(python): Handle generalized ufunc edge cases better #16086

fix(python): Handle generalized ufunc edge cases better #16086

Conversation

itamarst commented May 6, 2024

MarcoGorelli commented May 6, 2024

deanm0000 commented May 7, 2024

itamarst commented May 7, 2024

deanm0000 May 7, 2024

Choose a reason for hiding this comment

itamarst May 7, 2024

Choose a reason for hiding this comment

ritchie46 commented May 8, 2024

itamarst commented May 8, 2024 • edited

MarcoGorelli commented May 11, 2024

itamarst commented May 11, 2024

MarcoGorelli commented May 11, 2024

itamarst commented May 13, 2024

itamarst commented May 8, 2024 •

edited