Rolling after sort returns weird and different outputs #16145

justin13601 · 2024-05-10T06:24:37Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import timedelta, datetime
import polars as pl


df = pl.DataFrame(
    {
        "id": [1, 1, 1, 1, 2, 2],
        "time": [
            # id 1
            datetime(year=1989, month=12, day=1, hour=12, minute=3),
            datetime(year=1989, month=12, day=2, hour=5,  minute=17),
            datetime(year=1989, month=12, day=2, hour=12, minute=3),
            datetime(year=1989, month=12, day=6, hour=11, minute=0),
            # id 2
            datetime(year=1989, month=12, day=1, hour=13, minute=14),
            datetime(year=1989, month=12, day=3, hour=15, minute=17),
        ],
        "A": [1, 0, 1, 0, 0, 0],
        "B": [0, 1, 0, 1, 1, 0],
        "C": [1, 1, 0, 0, 1, 0],
    }
)

df = df.sort(by=["id", "time"])

df.rolling(
            index_column="time",
            group_by="id",
            period=timedelta(days=1),
            offset=timedelta(days=0),
            closed='right',
        ).agg(*[pl.col(c).sum().alias(c) for c in df.columns if c not in ["id", "time"]])

Log output

group_by keys are sorted; running sorted key fast path

Issue description

Running the above snippet returns a strange output and it is different each time the same snippet is run. I noticed that calling sort, regardless if the dataframe is already sorted, causes this issue when running rolling. In the above snippet, I have verified that df is the same before and after sort (via from polars.testing import assert_frame_equal).

Should return:

┌─────┬─────────────────────┬─────┬─────┬─────┐
│ id  ┆ time                ┆ A   ┆ B   ┆ C   │
│ --- ┆ ---                 ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs]        ┆ u16 ┆ u16 ┆ u16 │
╞═════╪═════════════════════╪═════╪═════╪═════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-02 12:03:00 ┆ 0   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-06 11:00:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-01 13:14:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-03 15:17:00 ┆ 0   ┆ 0   ┆ 0   │
└─────┴─────────────────────┴─────┴─────┴─────┘

Instead, here are a few examples of returns (almost seems random):

┌─────────────────┬─────────────────────┬──────┬──────┬──────┐
│ id              ┆ time                ┆ A    ┆ B    ┆ C    │
│ ---             ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64             ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1               ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1               ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1               ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2               ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2               ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 140614846972176 ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────────────────┴─────────────────────┴──────┴──────┴──────┘

┌─────┬─────────────────────┬──────┬──────┬──────┐
│ id  ┆ time                ┆ A    ┆ B    ┆ C    │
│ --- ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════╪═════════════════════╪══════╪══════╪══════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1   ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 1   ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────┴─────────────────────┴──────┴──────┴──────┘

┌─────┬─────────────────────┬──────┬──────┬──────┐
│ id  ┆ time                ┆ A    ┆ B    ┆ C    │
│ --- ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════╪═════════════════════╪══════╪══════╪══════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1   ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 4   ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────┴─────────────────────┴──────┴──────┴──────┘

┌─────┬─────────────────────┬──────┬──────┬──────┐
│ id  ┆ time                ┆ A    ┆ B    ┆ C    │
│ --- ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════╪═════════════════════╪══════╪══════╪══════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1   ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 272 ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────┴─────────────────────┴──────┴──────┴──────┘

┌────────────┬─────────────────────┬──────┬──────┬──────┐
│ id         ┆ time                ┆ A    ┆ B    ┆ C    │
│ ---        ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64        ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1          ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1          ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1          ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2          ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2          ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 8589934593 ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└────────────┴─────────────────────┴──────┴──────┴──────┘

Expected behavior

Should return:

┌─────┬─────────────────────┬─────┬─────┬─────┐
│ id  ┆ time                ┆ A   ┆ B   ┆ C   │
│ --- ┆ ---                 ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs]        ┆ u16 ┆ u16 ┆ u16 │
╞═════╪═════════════════════╪═════╪═════╪═════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-02 12:03:00 ┆ 0   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-06 11:00:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-01 13:14:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-03 15:17:00 ┆ 0   ┆ 0   ┆ 0   │
└─────┴─────────────────────┴─────┴─────┴─────┘

Removing df = df.sort(by=["id", "time"]) provides the correct return.

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python:               3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.5.8
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              12.0.1
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                2.1.2+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-05-10T13:55:28Z

Can reproduce.

from datetime import datetime
import polars as pl

df = pl.DataFrame({
    "id": [1, 2],
    "time": [
        datetime(year=1989, month=12, day=1, hour=12, minute=3),
        datetime(year=1989, month=12, day=1, hour=13, minute=14),
    ]
})

(df.sort("id").rolling(
    index_column="time",
    group_by="id",
    period="1d",
    offset="0d",
    closed='right',
).agg().select("id")
)

shape: (2, 1)
┌─────────────────────┐
│ id                  │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 2                   │
│ 3539883390149865530 │
└─────────────────────┘

It seems to be related to offset= - if I use -0d for example, it doesn't happen.

So it appears to be an issue when offset is "positive".

MarcoGorelli · 2024-05-10T14:25:23Z

interesting - @cmdlineluser when I use your example I get:

In [1]: from datetime import datetime
   ...: import polars as pl
   ...:
   ...: df = pl.DataFrame({
   ...:     "id": [1, 2],
   ...:     "time": [
   ...:         datetime(year=1989, month=12, day=1, hour=12, minute=3),
   ...:         datetime(year=1989, month=12, day=1, hour=13, minute=14),
   ...:     ]
   ...: })
   ...:
   ...: (df.sort("id").rolling(
   ...:     index_column="time",
   ...:     group_by="id",
   ...:     period="1d",
   ...:     offset="0d",
   ...:     closed='right',
   ...: ).agg().select("id")
   ...: )
Out[1]:
shape: (2, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 2   │
│ 0   │
└─────┘

which versions are you using? which OS?

cmdlineluser · 2024-05-10T14:33:17Z

I'm using 0.20.25 on MacOS @MarcoGorelli

I should have included that it is non-deterministic, sometimes I get 0 (but never the correct answer 1)

MarcoGorelli · 2024-05-11T08:11:22Z

The call stack goes like this:

polars/crates/polars-time/src/group_by/dynamic.rs

Lines 555 to 558 in d11da5e

    
           let groups = self 
        
               .0 
        
               .group_by_with_series(group_by.clone(), true, true)? 
        
               .take_groups();

polars/crates/polars-core/src/frame/group_by/mod.rs

Lines 55 to 57 in f992a7a

    
           let groups = if by.len() == 1 { 
        
               let series = &by[0]; 
        
               series.group_tuples(multithreaded, sorted)

polars/crates/polars-core/src/series/implementations/mod.rs

Lines 187 to 190 in 23791bd

    
           #[cfg(feature = "algorithm_group_by")] 
        
           fn group_tuples(&self, multithreaded: bool, sorted: bool) -> PolarsResult<GroupsProxy> { 
        
               IntoGroupsProxy::group_tuples(&self.0, multithreaded, sorted) 
        
           }

polars/crates/polars-core/src/frame/group_by/into_groups.rs

Lines 140 to 148 in 19f0939

    
           // sorted path 
        
           if self.is_sorted_ascending_flag() || self.is_sorted_descending_flag() { 
        
               // don't have to pass `sorted` arg, GroupSlice is always sorted. 
        
               return Ok(GroupsProxy::Slice { 
        
                   groups: self.rechunk().create_groups_from_sorted(multithreaded), 
        
                   rolling: false, 
        
               }); 
        
           }

polars/crates/polars-core/src/frame/group_by/into_groups.rs

Lines 59 to 62 in 19f0939

    
           fn create_groups_from_sorted(&self, multithreaded: bool) -> GroupsSlice { 
        
               if verbose() { 
        
                   eprintln!("group_by keys are sorted; running sorted key fast path"); 
        
               }

MarcoGorelli · 2024-05-11T08:28:20Z

pinging @ritchie46 for visibility

cmdlineluser · 2024-05-11T10:41:14Z

It's probably unrelated to the underlying issue @MarcoGorelli - but I noticed this while trying to figure out why "-0d" would not reproduce the behaviour.

code.py

df = pl.DataFrame({"foo": [1] * 3}).sort("foo").with_row_index()

df.rolling(index_column="foo", period="1i", offset="0i").agg("index")
# shape: (3, 2)
# ┌─────┬───────────┐
# │ foo ┆ index     │
# │ --- ┆ ---       │
# │ i64 ┆ list[u32] │
# ╞═════╪═══════════╡
# │ 1   ┆ []        │
# │ 1   ┆ []        │
# │ 1   ┆ []        │
# └─────┴───────────┘

df.rolling(index_column="foo", period="1i", offset="-0i").agg("index")
# shape: (3, 2)
# ┌─────┬───────────┐
# │ foo ┆ index     │
# │ --- ┆ ---       │
# │ i64 ┆ list[u32] │
# ╞═════╪═══════════╡
# │ 1   ┆ [0, 1, 2] │
# │ 1   ┆ [1, 2]    │
# │ 1   ┆ [2]       │
# └─────┴───────────┘

Edit: Separate issue created - apologies for the noise.

MarcoGorelli · 2024-05-11T10:57:51Z

thanks - could you open a separate issue about that one please?

ritchie46 · 2024-05-13T09:54:49Z

Taking a look.

justin13601 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024

MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels May 10, 2024

ritchie46 self-assigned this May 13, 2024

ritchie46 mentioned this issue May 13, 2024

fix: Fix rolling empty group OOB #16186

Merged

ritchie46 closed this as completed in #16186 May 13, 2024

c-peters added the accepted Ready for implementation label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling after sort returns weird and different outputs #16145

Rolling after sort returns weird and different outputs #16145

justin13601 commented May 10, 2024 •

edited

cmdlineluser commented May 10, 2024

MarcoGorelli commented May 10, 2024

cmdlineluser commented May 10, 2024

MarcoGorelli commented May 11, 2024 •

edited

MarcoGorelli commented May 11, 2024

cmdlineluser commented May 11, 2024 •

edited

MarcoGorelli commented May 11, 2024

ritchie46 commented May 13, 2024

Rolling after sort returns weird and different outputs #16145

Rolling after sort returns weird and different outputs #16145

Comments

justin13601 commented May 10, 2024 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented May 10, 2024

MarcoGorelli commented May 10, 2024

cmdlineluser commented May 10, 2024

MarcoGorelli commented May 11, 2024 • edited

MarcoGorelli commented May 11, 2024

cmdlineluser commented May 11, 2024 • edited

MarcoGorelli commented May 11, 2024

ritchie46 commented May 13, 2024

justin13601 commented May 10, 2024 •

edited

MarcoGorelli commented May 11, 2024 •

edited

cmdlineluser commented May 11, 2024 •

edited