generalized ufuncs that return arrays are broken in a variety of cases #14811

itamarst · 2024-03-01T15:16:48Z

This is a follow-up to #14748.

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl
from numba import guvectorize, float64


# ISSUE 1: Missing data silently corrupts the results
@guvectorize([(float64[:], float64[:])], "(m)->(m)")
def mean(arr, result):
    value = arr.mean()
    for i in range(len(arr)):
        result[i] = value


series = pl.Series([2, 2, 2, None], dtype=pl.Float64)
print("== ISSUE 1 ==")
print("Mean should be 2, but we get 1.5:", mean(series))
print()

# ISSUE 2: Output arrays with different size than input array are not handled by Polars:
@guvectorize([(float64[:], float64[:], float64[:])], "(n),(m)->(m)")
def different_output_size(arr, arr2, result):
    total = arr.mean()
    for i in range(len(arr2)):
        result[i] = total


arr = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.arange(0, 1_000_000, dtype=np.float64)

assert len(different_output_size(arr, arr2)) == len(arr2)
assert len(different_output_size(arr2, arr)) == len(arr)

print("== ISSUE 2 ==")
series = pl.Series(arr, dtype=pl.Float64)
series2 = pl.Series(arr2, dtype=pl.Float64)
assert len(different_output_size(series, series2)) == len(series2)
assert len(different_output_size(series2, series)) == len(series)

Log output

== ISSUE 1 ==
Mean should be 2, but we get 1.5: shape: (4,)
Series: '' [f64]
[
        1.5
        1.5
        1.5
        null
]

== ISSUE 2 ==
thread '<unnamed>' panicked at crates/polars-arrow/src/compute/boolean_kleene.rs:105:5:
assertion `left == right` failed: lhs and rhs must have the same length
  left: 3
 right: 1000000
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/itamarst/devel/polars/py-polars/../guvectorize2.py", line 36, in <module>
    assert len(different_output_size(series, series2)) == len(series2)
  File "/home/itamarst/devel/polars/.venv/lib/python3.10/site-packages/numba/np/ufunc/gufunc.py", line 172, in __call__
    return self.ufunc(*args, **kwargs)
  File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 1434, in __array_ufunc__
    validity_mask &= arg.is_not_null()
  File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 642, in __and__
    return self._from_pyseries(self._s.bitand(other._s))
pyo3_runtime.PanicException: assertion `left == right` failed: lhs and rhs must have the same length
  left: 3
 right: 1000000

Issue description

The current ufunc support essentially assumes just ufunc, and can't handle the variety of edge cases generalized ufuncs add:

MISSING-DATA: Missing data can give unexpected results, since processing is of the whole array. This is documented but many people won't notice those docs and just get bad data unknowningly. Even if you do read the docs, it can happen by accident.
DIFFERENT-OUTPUT-SIZE: The current code assumes the output array is the same as all input arrays, which is not a requirement for generalized ufuncs. I suspect there is an edge case where different output size can result in memory corruption...

There are also potentially edge cases in the signature syntax that can be rejected early rather than late, with a more informative error message, e.g. multi-dimensional output maybe.

Expected behavior

This is an open question. Here is what I would expect:

MISSING-DATA: Raise an exception, e.g. ValueError("There is missing data in in your Series, which can result in generalized ufuncs giving wrong results").
DIFFERENT-OUTPUT-SIZE: DTRT and return a Series of the correct size. The reason identical input and output size is assumed right now is (a) because it's always true for normal ufunc and (b) to deal with missing data (c) it's a nice optimization. However, the code could be changed to allow the ufunc to allocate the output array, and then that data copied into a Series.

Note that that both these proposals are for generalized ufunc only. Regular ufunc would continue as normal.

Type of `ufunc`	Behavior with missing data	Who allocates memory
Normal	Accepted, since the `ufunc` runs on individual values anyway	Polars, as an optimization
Generalized	Rejected with exception, for correctness	The `ufunc`

Installed versions

--------Version info---------
Polars:               0.20.13
Index type:           UInt32
Platform:             Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-04-04T06:56:09Z

MISSING-DATA: Raise an exception, e.g. ValueError("There is missing data in in your Series, which can result in generalized ufuncs giving wrong results").

sounds reasonable - have you already looked into raising a PR for this?

itamarst · 2024-04-04T12:06:25Z

@MarcoGorelli no, I have two PRs on ufuncs already (docs + sketch of partial fix that requires design feedback) and want to get through those first.

itamarst · 2024-04-04T12:07:39Z

And now I see there's a duplicate PR that got opened in interim :/

itamarst · 2024-04-04T12:08:26Z

I can do this first if you think it's a prerequisite for docs improvements.

cmdlineluser · 2024-05-03T02:26:17Z

I just encountered some odd behaviour which may fall under this category.

import polars as pl
from numba import guvectorize, int64 

@guvectorize([(int64[:], int64[:])], "(n)->(n)")
def gufunc_example(col, result):
    for i in range(len(col)):
        print("[DEBUG]:", col[i])
            result[i] = col[i]

df = pl.DataFrame({"a": [12, 13, 14, 15, 16]}).with_row_index()
df = df.with_columns(pl.when(pl.col("index") < 3).then(pl.all()))

>>> df
shape: (5, 2)
┌───────┬──────┐
│ index ┆ a    │
│ ---   ┆ ---  │
│ u32   ┆ i64  │
╞═══════╪══════╡
│ 0     ┆ 12   │
│ 1     ┆ 13   │
│ 2     ┆ 14   │
│ null  ┆ null │
│ null  ┆ null │
└───────┴──────┘

Inside the ufunc, we seem to have access to the original "state":

>>> out = df.with_columns(b = pl.col("a").map_batches(gufunc_example))
[DEBUG]: 12
[DEBUG]: 13
[DEBUG]: 14
[DEBUG]: 15 # <- ???
[DEBUG]: 16 # <- ???

deanm0000 · 2024-05-03T10:38:02Z

That's a characteristic of how when/then works, unrelated to ufuncs. Anytime you do a when/then, it gives all the data to every then/otherwise and the when condition just chooses which result goes where.

cmdlineluser · 2024-05-03T11:11:06Z

@deanm0000 But the when/then was used in a previous step to modify the dataframe - so the values are gone.

df = df.with_columns(pl.when(pl.col("index") < 3).then(pl.all()))

deanm0000 · 2024-05-03T12:14:37Z

oh whoops I guess I should read more carefully.

Arrow memory stores null state as a single bit separate from the data itself while leaving any arbitrary value in the value space. In this case we started with a good value and made it null so it doesn't do anything with the value itself. Since it's meant to be arbitrary it just changes the null state.

Here's a different example:

df = pl.DataFrame({"a": [12, 13, None, 15, 16]}).with_row_index()
df.with_columns(b = pl.col("a").map_batches(gufunc_example))
[DEBUG]: 12
[DEBUG]: 13
[DEBUG]: 0
[DEBUG]: 15
[DEBUG]: 16

There are warnings in the docs about numpy handling missing data differently and this is a manifestation of that.

cmdlineluser · 2024-05-03T13:01:47Z

Yeah, I think I can now understand the problem - thanks.

So if you start with missing data, you get 0

import polars as pl
from numba import guvectorize, int64 

@guvectorize([(int64[:], int64[:])], "(n)->(n)")
def gufunc_example(col, result):
    for i in range(len(col)):
        result[i] = col[2]

df = pl.DataFrame({"a": [12, 13, None, 15, 16]}).with_row_index()

df.with_columns(b = pl.col("a").map_batches(gufunc_example))
# shape: (5, 3)
# ┌───────┬──────┬──────┐
# │ index ┆ a    ┆ b    │
# │ ---   ┆ ---  ┆ ---  │
# │ u32   ┆ i64  ┆ i64  │
# ╞═══════╪══════╪══════╡
# │ 0     ┆ 12   ┆ 0    │
# │ 1     ┆ 13   ┆ 0    │
# │ 2     ┆ null ┆ null │
# │ 3     ┆ 15   ┆ 0    │
# │ 4     ┆ 16   ┆ 0    │
# └───────┴──────┴──────┘

But if missing data is introduced via "expressions", you still get the original value:

df = pl.DataFrame({"a": [12, 13, 14, 15, 16]}).with_row_index()

df = df.with_columns(pl.when(pl.col("index") != 2).then(pl.all()))

df.with_columns(b = pl.col("a").map_batches(gufunc_example))
# shape: (5, 3)
# ┌───────┬──────┬──────┐
# │ index ┆ a    ┆ b    │
# │ ---   ┆ ---  ┆ ---  │
# │ u32   ┆ i64  ┆ i64  │
# ╞═══════╪══════╪══════╡
# │ 0     ┆ 12   ┆ 14   │
# │ 1     ┆ 13   ┆ 14   │
# │ null  ┆ null ┆ null │
# │ 3     ┆ 15   ┆ 14   │
# │ 4     ┆ 16   ┆ 14   │
# └───────┴──────┴──────┘

itamarst · 2024-05-03T13:33:16Z

I will try to do this soon, since it really is an easy pitfall to hit and blocking an already-written docs PR.

itamarst · 2024-05-06T15:46:25Z

OK, starting work on this.

itamarst · 2024-05-13T12:56:12Z

Per @MarcoGorelli's request, going to restrict this just to the first case, missing data. The other issue is here: #16193

itamarst added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 1, 2024

itamarst mentioned this issue Mar 1, 2024

fix(python): Add support for generalized ufuncs that return a scalar #14786

Draft

deanm0000 assigned deanm0000 and itamarst and unassigned deanm0000 Mar 1, 2024

deanm0000 removed the needs triage Awaiting prioritization by a maintainer label Mar 1, 2024

MarcoGorelli mentioned this issue Apr 4, 2024

docs(python): Expand NumPy section in the user guide with ufunc info #13392

Closed

This was referenced May 6, 2024

fix(python): Handle generalized ufunc edge cases better #16086

Closed

Generalized ufuncs won't work in case where input sizes don't match #16193

Closed

itamarst mentioned this issue May 13, 2024

fix(python): Don't allow passing missing data to generalized ufuncs #16198

Merged

ritchie46 closed this as completed in #16198 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generalized ufuncs that return arrays are broken in a variety of cases #14811

generalized ufuncs that return arrays are broken in a variety of cases #14811

itamarst commented Mar 1, 2024 •

edited

MarcoGorelli commented Apr 4, 2024

itamarst commented Apr 4, 2024

itamarst commented Apr 4, 2024

itamarst commented Apr 4, 2024

cmdlineluser commented May 3, 2024

deanm0000 commented May 3, 2024

cmdlineluser commented May 3, 2024

deanm0000 commented May 3, 2024

cmdlineluser commented May 3, 2024

itamarst commented May 3, 2024

itamarst commented May 6, 2024

itamarst commented May 13, 2024

generalized ufuncs that return arrays are broken in a variety of cases #14811

generalized ufuncs that return arrays are broken in a variety of cases #14811

Comments

itamarst commented Mar 1, 2024 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Apr 4, 2024

itamarst commented Apr 4, 2024

itamarst commented Apr 4, 2024

itamarst commented Apr 4, 2024

cmdlineluser commented May 3, 2024

deanm0000 commented May 3, 2024

cmdlineluser commented May 3, 2024

deanm0000 commented May 3, 2024

cmdlineluser commented May 3, 2024

itamarst commented May 3, 2024

itamarst commented May 6, 2024

itamarst commented May 13, 2024

itamarst commented Mar 1, 2024 •

edited