Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in latest polars, query OOMs #16118

Open
2 tasks done
knl opened this issue May 8, 2024 · 1 comment
Open
2 tasks done

Regression in latest polars, query OOMs #16118

knl opened this issue May 8, 2024 · 1 comment
Labels
bug Something isn't working incomplete Incomplete issue: needs MWE python Related to Python Polars

Comments

@knl
Copy link

knl commented May 8, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Unfortunately, I can't provide a minimal example, as I deal with large amounts of data I can't share, and this problem is only visible in such cases.

Log output

No response

Issue description

I noticed that after upgrading to polars 0.20.24 my large query started getting killed by OOM, on a 2TB machine. Previously, the query worked fine, consuming less than 200GB. The runs look like this:

The query looks like:

h_types = (
    pl.from_dicts([
        {'st': 'FOPT',       'tt': 'RETAIL',        'h_type': 'direct'},
        {'st': 'FOPT',       'tt': 'LAST',          'h_type': 'direct2'},
        {'st': 'FIND',    'tt': 'LAST',          'h_type': 'indirect'},
        {'st': 'FIND',    'tt': 'EX', 'h_type': 'external'},
        {'st': 'FIND_M', 'tt': 'EX', 'h_type': 'multi_external'},
    ])
    .with_columns(
        pl.col('st').cast(pl.Categorical),
        pl.col('tt').cast(pl.Categorical),
    )
    .lazy()
)

def assign_h_type(df: pl.LazyFrame) -> pl.LazyFrame:
    return df.join(
        h_types,
        on=['st', 'tt'],
        how='left'
    ).with_columns(pl.col('h_type').fill_null('passive').cast(pl.Categorical))


data = (
    read_parquet(f'/parquet/rawdata/orders/date={date}/', partitioning='hive')
    .filter(
        pl.col('d').str.starts_with('omm')&
        (pl.col('class_') == 'Created')&
        (pl.col('o_num').is_not_null())
    )
    .select(
        pl.col('o_num').cast(pl.UInt64),
        pl.col('id_'),
        pl.col('d').cast(pl.Categorical),
        pl.col('app_').cast(pl.Categorical),
    )
    .unique()
    .sort('o_num')
    .with_columns(
        (pl.col('o_num') // 1e14 - 9).alias('sid').cast(pl.UInt8),
    )
    .join(
        read_parquet(f'/parquet/rawdata/entry/date={date}/', partitioning='hive')
        .filter(
            pl.col('d').str.starts_with('omm')&
            pl.col('blocked').is_null()&
            pl.col('tsn').is_not_null()
            #pl.col('app_').str.starts_with('h_')
        )
        .select(
            pl.col('oid_').alias('id_'),
            pl.col('d').cast(pl.Categorical),
            pl.col('app_').cast(pl.Categorical),
            pl.col('tsn').cast(pl.UInt64),
            pl.col('st').cast(pl.Categorical),
            pl.col('tt').cast(pl.Categorical),
            pl.col('success_'),
            pl.col('portf').str.strip_suffix('_XK').str.extract("^(\w*ZZZ(_BM)?)", 1).fill_null('EO').cast(pl.Categorical),
        )
        .pipe(assign_h_type)
        .drop('st', 'tt'),
        on=['id_', 'd', 'app_'],
        how='inner',
    )
    .select(
        'sid',
        'tsn',
        'o_num',
        'h_type',
        'success_',
        'portf',
    )
    .sort('sid', 'tsn')
    .cache()
)

opps = (
    data
    .filter(pl.col('h_type') == 'direct')
    .select(
        'sid',
        'tsn',
        'portf',
    )
    .unique()
    .sort('sid', 'tsn')
    .collect(streaming=True)
)

If, per #15795, I put collect before .filter in opps, I get OOM even for 0.20.16.

Expected behavior

I would expect that the recent versions finish without OOM.

Installed versions

Replace this line with the output of pl.show_versions(). Leave the backticks in place.
@knl knl added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 8, 2024
@ritchie46
Copy link
Member

Is there a minimal example that shows a memory increase? I do need something with syntetic data to be able to understand what happens. It doesn't have to OOM, just be a similar query.

@ritchie46 ritchie46 added incomplete Incomplete issue: needs MWE and removed needs triage Awaiting prioritization by a maintainer labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working incomplete Incomplete issue: needs MWE python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants