Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalized ufuncs that return arrays are broken in a variety of cases #14811

Closed
2 tasks done
itamarst opened this issue Mar 1, 2024 · 12 comments · Fixed by #16198
Closed
2 tasks done

generalized ufuncs that return arrays are broken in a variety of cases #14811

itamarst opened this issue Mar 1, 2024 · 12 comments · Fixed by #16198
Assignees
Labels
bug Something isn't working python Related to Python Polars

Comments

@itamarst
Copy link
Contributor

itamarst commented Mar 1, 2024

This is a follow-up to #14748.

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl
from numba import guvectorize, float64


# ISSUE 1: Missing data silently corrupts the results
@guvectorize([(float64[:], float64[:])], "(m)->(m)")
def mean(arr, result):
    value = arr.mean()
    for i in range(len(arr)):
        result[i] = value


series = pl.Series([2, 2, 2, None], dtype=pl.Float64)
print("== ISSUE 1 ==")
print("Mean should be 2, but we get 1.5:", mean(series))
print()

# ISSUE 2: Output arrays with different size than input array are not handled by Polars:
@guvectorize([(float64[:], float64[:], float64[:])], "(n),(m)->(m)")
def different_output_size(arr, arr2, result):
    total = arr.mean()
    for i in range(len(arr2)):
        result[i] = total


arr = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.arange(0, 1_000_000, dtype=np.float64)

assert len(different_output_size(arr, arr2)) == len(arr2)
assert len(different_output_size(arr2, arr)) == len(arr)

print("== ISSUE 2 ==")
series = pl.Series(arr, dtype=pl.Float64)
series2 = pl.Series(arr2, dtype=pl.Float64)
assert len(different_output_size(series, series2)) == len(series2)
assert len(different_output_size(series2, series)) == len(series)

Log output

== ISSUE 1 ==
Mean should be 2, but we get 1.5: shape: (4,)
Series: '' [f64]
[
        1.5
        1.5
        1.5
        null
]

== ISSUE 2 ==
thread '<unnamed>' panicked at crates/polars-arrow/src/compute/boolean_kleene.rs:105:5:
assertion `left == right` failed: lhs and rhs must have the same length
  left: 3
 right: 1000000
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/itamarst/devel/polars/py-polars/../guvectorize2.py", line 36, in <module>
    assert len(different_output_size(series, series2)) == len(series2)
  File "/home/itamarst/devel/polars/.venv/lib/python3.10/site-packages/numba/np/ufunc/gufunc.py", line 172, in __call__
    return self.ufunc(*args, **kwargs)
  File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 1434, in __array_ufunc__
    validity_mask &= arg.is_not_null()
  File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 642, in __and__
    return self._from_pyseries(self._s.bitand(other._s))
pyo3_runtime.PanicException: assertion `left == right` failed: lhs and rhs must have the same length
  left: 3
 right: 1000000

Issue description

The current ufunc support essentially assumes just ufunc, and can't handle the variety of edge cases generalized ufuncs add:

  1. MISSING-DATA: Missing data can give unexpected results, since processing is of the whole array. This is documented but many people won't notice those docs and just get bad data unknowningly. Even if you do read the docs, it can happen by accident.
  2. DIFFERENT-OUTPUT-SIZE: The current code assumes the output array is the same as all input arrays, which is not a requirement for generalized ufuncs. I suspect there is an edge case where different output size can result in memory corruption...

There are also potentially edge cases in the signature syntax that can be rejected early rather than late, with a more informative error message, e.g. multi-dimensional output maybe.

Expected behavior

This is an open question. Here is what I would expect:

  1. MISSING-DATA: Raise an exception, e.g. ValueError("There is missing data in in your Series, which can result in generalized ufuncs giving wrong results").
  2. DIFFERENT-OUTPUT-SIZE: DTRT and return a Series of the correct size. The reason identical input and output size is assumed right now is (a) because it's always true for normal ufunc and (b) to deal with missing data (c) it's a nice optimization. However, the code could be changed to allow the ufunc to allocate the output array, and then that data copied into a Series.

Note that that both these proposals are for generalized ufunc only. Regular ufunc would continue as normal.

Type of ufunc Behavior with missing data Who allocates memory
Normal Accepted, since the ufunc runs on individual values anyway Polars, as an optimization
Generalized Rejected with exception, for correctness The ufunc

Installed versions

--------Version info---------
Polars:               0.20.13
Index type:           UInt32
Platform:             Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
@itamarst itamarst added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 1, 2024
@deanm0000 deanm0000 assigned deanm0000 and itamarst and unassigned deanm0000 Mar 1, 2024
@deanm0000 deanm0000 removed the needs triage Awaiting prioritization by a maintainer label Mar 1, 2024
@MarcoGorelli
Copy link
Collaborator

MISSING-DATA: Raise an exception, e.g. ValueError("There is missing data in in your Series, which can result in generalized ufuncs giving wrong results").

sounds reasonable - have you already looked into raising a PR for this?

@itamarst
Copy link
Contributor Author

itamarst commented Apr 4, 2024

@MarcoGorelli no, I have two PRs on ufuncs already (docs + sketch of partial fix that requires design feedback) and want to get through those first.

@itamarst
Copy link
Contributor Author

itamarst commented Apr 4, 2024

And now I see there's a duplicate PR that got opened in interim :/

@itamarst
Copy link
Contributor Author

itamarst commented Apr 4, 2024

I can do this first if you think it's a prerequisite for docs improvements.

@cmdlineluser
Copy link
Contributor

I just encountered some odd behaviour which may fall under this category.

import polars as pl
from numba import guvectorize, int64 

@guvectorize([(int64[:], int64[:])], "(n)->(n)")
def gufunc_example(col, result):
    for i in range(len(col)):
        print("[DEBUG]:", col[i])
            result[i] = col[i]
df = pl.DataFrame({"a": [12, 13, 14, 15, 16]}).with_row_index()
df = df.with_columns(pl.when(pl.col("index") < 3).then(pl.all()))
>>> df
shape: (5, 2)
┌───────┬──────┐
│ indexa    │
│ ------  │
│ u32i64  │
╞═══════╪══════╡
│ 012   │
│ 113   │
│ 214   │
│ nullnull │
│ nullnull │
└───────┴──────┘

Inside the ufunc, we seem to have access to the original "state":

>>> out = df.with_columns(b = pl.col("a").map_batches(gufunc_example))
[DEBUG]: 12
[DEBUG]: 13
[DEBUG]: 14
[DEBUG]: 15 # <- ???
[DEBUG]: 16 # <- ???

@deanm0000
Copy link
Collaborator

That's a characteristic of how when/then works, unrelated to ufuncs. Anytime you do a when/then, it gives all the data to every then/otherwise and the when condition just chooses which result goes where.

@cmdlineluser
Copy link
Contributor

@deanm0000 But the when/then was used in a previous step to modify the dataframe - so the values are gone.

df = df.with_columns(pl.when(pl.col("index") < 3).then(pl.all()))

@deanm0000
Copy link
Collaborator

oh whoops I guess I should read more carefully.

Arrow memory stores null state as a single bit separate from the data itself while leaving any arbitrary value in the value space. In this case we started with a good value and made it null so it doesn't do anything with the value itself. Since it's meant to be arbitrary it just changes the null state.

Here's a different example:

df = pl.DataFrame({"a": [12, 13, None, 15, 16]}).with_row_index()
df.with_columns(b = pl.col("a").map_batches(gufunc_example))
[DEBUG]: 12
[DEBUG]: 13
[DEBUG]: 0
[DEBUG]: 15
[DEBUG]: 16

There are warnings in the docs about numpy handling missing data differently and this is a manifestation of that.

@cmdlineluser
Copy link
Contributor

Yeah, I think I can now understand the problem - thanks.

So if you start with missing data, you get 0

import polars as pl
from numba import guvectorize, int64 

@guvectorize([(int64[:], int64[:])], "(n)->(n)")
def gufunc_example(col, result):
    for i in range(len(col)):
        result[i] = col[2]

df = pl.DataFrame({"a": [12, 13, None, 15, 16]}).with_row_index()

df.with_columns(b = pl.col("a").map_batches(gufunc_example))
# shape: (5, 3)
# ┌───────┬──────┬──────┐
# │ index ┆ a    ┆ b    │
# │ ---   ┆ ---  ┆ ---  │
# │ u32   ┆ i64  ┆ i64  │
# ╞═══════╪══════╪══════╡
# │ 0     ┆ 12   ┆ 0    │
# │ 1     ┆ 13   ┆ 0    │
# │ 2     ┆ null ┆ null │
# │ 3     ┆ 15   ┆ 0    │
# │ 4     ┆ 16   ┆ 0    │
# └───────┴──────┴──────┘

But if missing data is introduced via "expressions", you still get the original value:

df = pl.DataFrame({"a": [12, 13, 14, 15, 16]}).with_row_index()

df = df.with_columns(pl.when(pl.col("index") != 2).then(pl.all()))

df.with_columns(b = pl.col("a").map_batches(gufunc_example))
# shape: (5, 3)
# ┌───────┬──────┬──────┐
# │ index ┆ a    ┆ b    │
# │ ---   ┆ ---  ┆ ---  │
# │ u32   ┆ i64  ┆ i64  │
# ╞═══════╪══════╪══════╡
# │ 0     ┆ 12   ┆ 14   │
# │ 1     ┆ 13   ┆ 14   │
# │ null  ┆ null ┆ null │
# │ 3     ┆ 15   ┆ 14   │
# │ 4     ┆ 16   ┆ 14   │
# └───────┴──────┴──────┘

@itamarst
Copy link
Contributor Author

itamarst commented May 3, 2024

I will try to do this soon, since it really is an easy pitfall to hit and blocking an already-written docs PR.

@itamarst
Copy link
Contributor Author

itamarst commented May 6, 2024

OK, starting work on this.

@itamarst
Copy link
Contributor Author

Per @MarcoGorelli's request, going to restrict this just to the first case, missing data. The other issue is here: #16193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
4 participants