Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily evaluated error expression #15184

Closed
TimonKnigge opened this issue Mar 20, 2024 · 2 comments
Closed

Lazily evaluated error expression #15184

TimonKnigge opened this issue Mar 20, 2024 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@TimonKnigge
Copy link

Description

There's often situations where I want to do some computation on a LazyFrame, and conditionally throw an error, but I don't actually want to collect the LazyFrame yet. I'm curious if we can have some kind of lazily evaluated error expression that only raises when it actually tries to be instantiated.

For example

assert isinstance(df, pl.LazyFrame)
print(df.schema)
>>> OrderedDict([('x', Int64), ('y', Int64)])

df = df.with_columns(
    pl.when(pl.col('y') != 0)
    .then(pl.col('x') / pl.col('y'))
    .otherwise(pl.raise("Division by zero"))
)
# (nothing happens yet)

df = df.collect()
# Iff there are any zeros in column 'y':
>>> ComputeError: encountered error 'Division by zero'

Just to stress: I understand the above division produces inf, but this is just an example, the point is there are other computations I may want to do, with business logic that isn't really captured by the typing system.

@TimonKnigge TimonKnigge added the enhancement New feature or an improvement of an existing feature label Mar 20, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 20, 2024

I like this idea, but unfortunately the when/then architecture doesn't allow for this: when you supply a when/then chain, all columns are computed in parallel, and then filtered.

In some cases you can use Expr.map_batches or Expr.map_elements:

import polars as pl

# business_threshold = 5
business_threshold = 2

def raise_if_too_high(s):
    if (s > business_threshold).any():
        raise ValueError("My business logic doesn't like this.")
    return s

df = pl.DataFrame({"a": [1, 2, 3]})

df.select(
    pl.col("a").map_batches(lambda s: raise_if_too_high(s))
)
polars.exceptions.ComputeError: ValueError: My business logic doesn't like this.

Note that if we set business_threshold to 5 then no error is raised.

@stinodego
Copy link
Member

Your example won't work due to the way when/then/otherwise works.

The following issue is related to your request, I will close this in favor of that one:
#11064

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants