Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling after sort returns weird and different outputs #16145

Closed
2 tasks done
justin13601 opened this issue May 10, 2024 · 8 comments · Fixed by #16186
Closed
2 tasks done

Rolling after sort returns weird and different outputs #16145

justin13601 opened this issue May 10, 2024 · 8 comments · Fixed by #16186
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@justin13601
Copy link

justin13601 commented May 10, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import timedelta, datetime
import polars as pl


df = pl.DataFrame(
    {
        "id": [1, 1, 1, 1, 2, 2],
        "time": [
            # id 1
            datetime(year=1989, month=12, day=1, hour=12, minute=3),
            datetime(year=1989, month=12, day=2, hour=5,  minute=17),
            datetime(year=1989, month=12, day=2, hour=12, minute=3),
            datetime(year=1989, month=12, day=6, hour=11, minute=0),
            # id 2
            datetime(year=1989, month=12, day=1, hour=13, minute=14),
            datetime(year=1989, month=12, day=3, hour=15, minute=17),
        ],
        "A": [1, 0, 1, 0, 0, 0],
        "B": [0, 1, 0, 1, 1, 0],
        "C": [1, 1, 0, 0, 1, 0],
    }
)

df = df.sort(by=["id", "time"])

df.rolling(
            index_column="time",
            group_by="id",
            period=timedelta(days=1),
            offset=timedelta(days=0),
            closed='right',
        ).agg(*[pl.col(c).sum().alias(c) for c in df.columns if c not in ["id", "time"]])

Log output

group_by keys are sorted; running sorted key fast path

Issue description

Running the above snippet returns a strange output and it is different each time the same snippet is run. I noticed that calling sort, regardless if the dataframe is already sorted, causes this issue when running rolling. In the above snippet, I have verified that df is the same before and after sort (via from polars.testing import assert_frame_equal).

Should return:

┌─────┬─────────────────────┬─────┬─────┬─────┐
│ id  ┆ time                ┆ A   ┆ B   ┆ C   │
│ --- ┆ ---                 ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs]        ┆ u16 ┆ u16 ┆ u16 │
╞═════╪═════════════════════╪═════╪═════╪═════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-02 12:03:00 ┆ 0   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-06 11:00:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-01 13:14:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-03 15:17:00 ┆ 0   ┆ 0   ┆ 0   │
└─────┴─────────────────────┴─────┴─────┴─────┘

Instead, here are a few examples of returns (almost seems random):

┌─────────────────┬─────────────────────┬──────┬──────┬──────┐
│ id              ┆ time                ┆ A    ┆ B    ┆ C    │
│ ---             ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64             ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1               ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1               ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1               ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2               ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2               ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 140614846972176 ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────────────────┴─────────────────────┴──────┴──────┴──────┘
┌─────┬─────────────────────┬──────┬──────┬──────┐
│ id  ┆ time                ┆ A    ┆ B    ┆ C    │
│ --- ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════╪═════════════════════╪══════╪══════╪══════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1   ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 1   ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────┴─────────────────────┴──────┴──────┴──────┘
┌─────┬─────────────────────┬──────┬──────┬──────┐
│ id  ┆ time                ┆ A    ┆ B    ┆ C    │
│ --- ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════╪═════════════════════╪══════╪══════╪══════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1   ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 4   ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────┴─────────────────────┴──────┴──────┴──────┘
┌─────┬─────────────────────┬──────┬──────┬──────┐
│ id  ┆ time                ┆ A    ┆ B    ┆ C    │
│ --- ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞═════╪═════════════════════╪══════╪══════╪══════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1   ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2   ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 272 ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└─────┴─────────────────────┴──────┴──────┴──────┘
┌────────────┬─────────────────────┬──────┬──────┬──────┐
│ id         ┆ time                ┆ A    ┆ B    ┆ C    │
│ ---        ┆ ---                 ┆ ---  ┆ ---  ┆ ---  │
│ i64        ┆ datetime[μs]        ┆ u16  ┆ u16  ┆ u16  │
╞════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1          ┆ 1989-12-01 12:03:00 ┆ 1    ┆ 1    ┆ 1    │
│ 1          ┆ 1989-12-02 05:17:00 ┆ 1    ┆ 0    ┆ 0    │
│ 1          ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ null │
│ 2          ┆ 1989-12-06 11:00:00 ┆ null ┆ null ┆ null │
│ 2          ┆ 1989-12-01 13:14:00 ┆ null ┆ null ┆ null │
│ 8589934593 ┆ 1989-12-03 15:17:00 ┆ null ┆ null ┆ null │
└────────────┴─────────────────────┴──────┴──────┴──────┘

Expected behavior

Should return:

┌─────┬─────────────────────┬─────┬─────┬─────┐
│ id  ┆ time                ┆ A   ┆ B   ┆ C   │
│ --- ┆ ---                 ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs]        ┆ u16 ┆ u16 ┆ u16 │
╞═════╪═════════════════════╪═════╪═════╪═════╡
│ 1   ┆ 1989-12-01 12:03:00 ┆ 1   ┆ 1   ┆ 1   │
│ 1   ┆ 1989-12-02 05:17:00 ┆ 1   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-02 12:03:00 ┆ 0   ┆ 0   ┆ 0   │
│ 1   ┆ 1989-12-06 11:00:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-01 13:14:00 ┆ 0   ┆ 0   ┆ 0   │
│ 2   ┆ 1989-12-03 15:17:00 ┆ 0   ┆ 0   ┆ 0   │
└─────┴─────────────────────┴─────┴─────┴─────┘

Removing df = df.sort(by=["id", "time"]) provides the correct return.

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python:               3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.5.8
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              12.0.1
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                2.1.2+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@justin13601 justin13601 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

from datetime import datetime
import polars as pl

df = pl.DataFrame({
    "id": [1, 2],
    "time": [
        datetime(year=1989, month=12, day=1, hour=12, minute=3),
        datetime(year=1989, month=12, day=1, hour=13, minute=14),
    ]
})

(df.sort("id").rolling(
    index_column="time",
    group_by="id",
    period="1d",
    offset="0d",
    closed='right',
).agg().select("id")
)
shape: (2, 1)
┌─────────────────────┐
│ id                  │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 2                   │
│ 3539883390149865530 │
└─────────────────────┘

It seems to be related to offset= - if I use -0d for example, it doesn't happen.

So it appears to be an issue when offset is "positive".

@MarcoGorelli
Copy link
Collaborator

interesting - @cmdlineluser when I use your example I get:

In [1]: from datetime import datetime
   ...: import polars as pl
   ...:
   ...: df = pl.DataFrame({
   ...:     "id": [1, 2],
   ...:     "time": [
   ...:         datetime(year=1989, month=12, day=1, hour=12, minute=3),
   ...:         datetime(year=1989, month=12, day=1, hour=13, minute=14),
   ...:     ]
   ...: })
   ...:
   ...: (df.sort("id").rolling(
   ...:     index_column="time",
   ...:     group_by="id",
   ...:     period="1d",
   ...:     offset="0d",
   ...:     closed='right',
   ...: ).agg().select("id")
   ...: )
Out[1]:
shape: (2, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 2   │
│ 0   │
└─────┘

which versions are you using? which OS?

@cmdlineluser
Copy link
Contributor

I'm using 0.20.25 on MacOS @MarcoGorelli

I should have included that it is non-deterministic, sometimes I get 0 (but never the correct answer 1)

@MarcoGorelli MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels May 10, 2024
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 11, 2024

The call stack goes like this:

let groups = self
.0
.group_by_with_series(group_by.clone(), true, true)?
.take_groups();

let groups = if by.len() == 1 {
let series = &by[0];
series.group_tuples(multithreaded, sorted)

#[cfg(feature = "algorithm_group_by")]
fn group_tuples(&self, multithreaded: bool, sorted: bool) -> PolarsResult<GroupsProxy> {
IntoGroupsProxy::group_tuples(&self.0, multithreaded, sorted)
}

// sorted path
if self.is_sorted_ascending_flag() || self.is_sorted_descending_flag() {
// don't have to pass `sorted` arg, GroupSlice is always sorted.
return Ok(GroupsProxy::Slice {
groups: self.rechunk().create_groups_from_sorted(multithreaded),
rolling: false,
});
}

fn create_groups_from_sorted(&self, multithreaded: bool) -> GroupsSlice {
if verbose() {
eprintln!("group_by keys are sorted; running sorted key fast path");
}

@MarcoGorelli
Copy link
Collaborator

pinging @ritchie46 for visibility

@cmdlineluser
Copy link
Contributor

cmdlineluser commented May 11, 2024

It's probably unrelated to the underlying issue @MarcoGorelli - but I noticed this while trying to figure out why "-0d" would not reproduce the behaviour.

code.py
df = pl.DataFrame({"foo": [1] * 3}).sort("foo").with_row_index()

df.rolling(index_column="foo", period="1i", offset="0i").agg("index")
# shape: (3, 2)
# ┌─────┬───────────┐
# │ foo ┆ index     │
# │ --- ┆ ---       │
# │ i64 ┆ list[u32] │
# ╞═════╪═══════════╡
# │ 1   ┆ []        │
# │ 1   ┆ []        │
# │ 1   ┆ []        │
# └─────┴───────────┘

df.rolling(index_column="foo", period="1i", offset="-0i").agg("index")
# shape: (3, 2)
# ┌─────┬───────────┐
# │ foo ┆ index     │
# │ --- ┆ ---       │
# │ i64 ┆ list[u32] │
# ╞═════╪═══════════╡
# │ 1   ┆ [0, 1, 2] │
# │ 1   ┆ [1, 2]    │
# │ 1   ┆ [2]       │
# └─────┴───────────┘

Edit: Separate issue created - apologies for the noise.

@MarcoGorelli
Copy link
Collaborator

thanks - could you open a separate issue about that one please?

@ritchie46
Copy link
Member

Taking a look.

@ritchie46 ritchie46 self-assigned this May 13, 2024
@c-peters c-peters added the accepted Ready for implementation label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants