Group by column name may conflict with aggregation columns, even if renamed #16170

kevinli1993 · 2024-05-11T18:05:25Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl

np.random.seed(0)
ds = pl.DataFrame(dict(
    A = np.random.randn(1000),
    B = np.random.randn(1000),
    C = np.random.randn(1000),
))


print(ds
 .group_by(pl.col("A").round(0))
 .agg(pl.col("A", "B", "C").mean().name.prefix("mean_"))
)


print(ds
 .group_by(pl.col("A").round(0))
 .agg(pl.col("A").mean().alias("mean_A"), pl.col("B", "C").mean().name.prefix("mean_"))
)

Log output

$ POLARS_VERBOSE=1 python3 repro.py 1>/dev/null
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION

Issue description

The repro gives the following output
Gives me the result

shape: (7, 3)
┌──────┬───────────┬───────────┐
│ A    ┆ mean_B    ┆ mean_C    │
│ ---  ┆ ---       ┆ ---       │
│ f64  ┆ f64       ┆ f64       │
╞══════╪═══════════╪═══════════╡
│ 2.0  ┆ 0.012071  ┆ -0.359696 │
│ -3.0 ┆ 0.365608  ┆ 0.287775  │
│ 1.0  ┆ -0.102876 ┆ -0.096788 │
│ -1.0 ┆ 0.04637   ┆ -0.108115 │
│ 3.0  ┆ -0.738978 ┆ 0.034889  │
│ 0.0  ┆ 0.045118  ┆ 0.0589    │
│ -2.0 ┆ 0.058728  ┆ -0.15119  │
└──────┴───────────┴───────────┘

shape: (7, 4)
┌──────┬───────────┬───────────┬───────────┐
│ A    ┆ mean_A    ┆ mean_B    ┆ mean_C    │
│ ---  ┆ ---       ┆ ---       ┆ ---       │
│ f64  ┆ f64       ┆ f64       ┆ f64       │
╞══════╪═══════════╪═══════════╪═══════════╡
│ 0.0  ┆ 0.000783  ┆ 0.045118  ┆ 0.0589    │
│ -3.0 ┆ -2.711414 ┆ 0.365608  ┆ 0.287775  │
│ 1.0  ┆ 0.939236  ┆ -0.102876 ┆ -0.096788 │
│ 2.0  ┆ 1.906479  ┆ 0.012071  ┆ -0.359696 │
│ 3.0  ┆ 2.683335  ┆ -0.738978 ┆ 0.034889  │
│ -1.0 ┆ -0.916675 ┆ 0.04637   ┆ -0.108115 │
│ -2.0 ┆ -1.814999 ┆ 0.058728  ┆ -0.15119  │
└──────┴───────────┴───────────┴───────────┘

I would have expected the result of the two to be the same. The confusion seems to be that the group by expression is itself called A. However, this does not explain why

using A separately in the agg call works in the 2nd case;
but using A together with B and C does not.

(It may have something to do with how .name.prefix() works, e.g. it activates too late, but that's just a guess.)

Expected behavior

That both calls gives the same resulting dataframe, with A, mean_A, mean_B, and mean_C in both cases.

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.12.3 (main, Apr 12 2024, 17:16:04) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-05-14T09:31:54Z

Can reproduce.

It seems in this case, the "A" aggregation doesn't actually happen at all?

df = pl.DataFrame({
    "A": [4, 4, 3], 
    "B": [1, 2, 3], 
    "C": [5, 6, 7]
})

df.group_by("A").agg(
    pl.col("B", "C").len().name.prefix("len_")
)

# shape: (2, 3)
# ┌─────┬───────┬───────┐
# │ A   ┆ len_B ┆ len_C │
# │ --- ┆ ---   ┆ ---   │
# │ i64 ┆ u32   ┆ u32   │
# ╞═════╪═══════╪═══════╡
# │ 4   ┆ 2     ┆ 2     │
# │ 3   ┆ 1     ┆ 1     │
# └─────┴───────┴───────┘

df.group_by("A").agg(
    pl.col("A", "C").len().name.prefix("len_")
)

# shape: (2, 2)
# ┌─────┬───────┐
# │ A   ┆ len_C │
# │ --- ┆ ---   │
# │ i64 ┆ u32   │
# ╞═════╪═══════╡
# │ 4   ┆ 2     │
# │ 3   ┆ 1     │
# └─────┴───────┘

.explain() shows the "A" aggregation has silently disappeared:

AGGREGATE
   [col("C").count().alias("len_C")] 
BY [col("A")] 
FROM
DF ["A", "B", "C"]; PROJECT 2/3 COLUMNS; SELECTION: "None"

Sorry to ping @ritchie46 - but it seems this one could do with a label / bump.

kevinli1993 · 2024-05-14T23:26:47Z

I can confirm this bug/behavior still persists in the latest version of polars (0.20.26).

Details

$ python3 -c 'import polars; polars.show_versions()'
--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.12.3 (main, Apr 12 2024, 17:16:04) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

ritchie46 · 2024-05-19T07:44:01Z

Taking a look.

kevinli1993 · 2024-05-20T14:11:30Z

It looks great, thanks @ritchie46!

kevinli1993 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 11, 2024

ritchie46 self-assigned this May 19, 2024

ritchie46 mentioned this issue May 19, 2024

fix: Don't exclude explicitly named columns in group-by context' expr expansion #16318

Merged

ritchie46 closed this as completed in #16318 May 19, 2024

c-peters added the accepted Ready for implementation label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group by column name may conflict with aggregation columns, even if renamed #16170

Group by column name may conflict with aggregation columns, even if renamed #16170

kevinli1993 commented May 11, 2024 •

edited

cmdlineluser commented May 14, 2024 •

edited

kevinli1993 commented May 14, 2024 •

edited

ritchie46 commented May 19, 2024

kevinli1993 commented May 20, 2024

Group by column name may conflict with aggregation columns, even if renamed #16170

Group by column name may conflict with aggregation columns, even if renamed #16170

Comments

kevinli1993 commented May 11, 2024 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented May 14, 2024 • edited

kevinli1993 commented May 14, 2024 • edited

ritchie46 commented May 19, 2024

kevinli1993 commented May 20, 2024

kevinli1993 commented May 11, 2024 •

edited

cmdlineluser commented May 14, 2024 •

edited

kevinli1993 commented May 14, 2024 •

edited