Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by_dynamic offset ignored #16162

Open
2 tasks done
stout-yeoman opened this issue May 10, 2024 · 2 comments
Open
2 tasks done

group_by_dynamic offset ignored #16162

stout-yeoman opened this issue May 10, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@stout-yeoman
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

    df = pl.DataFrame(
        {
            "x": [
                datetime.datetime(2023, 1, 1, 0, 0, 0),
                datetime.datetime(2023, 1, 1, 0, 1, 0),
                datetime.datetime(2023, 1, 1, 0, 5, 0),
                datetime.datetime(2023, 1, 1, 0, 10, 0),
                datetime.datetime(2023, 1, 1, 0, 11, 0),
            ],
            "y": [1, 2, 3, 4, 5],
        }
    ).sort("x")

   agg_df = df.group_by_dynamic(index_column="x", every="5m", period="10m", offset="5m", include_boundaries=True).agg(
        pl.col("y").max().alias("y_max"),
        pl.col("y").min().alias("y_min")
    )
    print(agg_df)

Log output

shape: (3, 5)
┌─────────────────────┬─────────────────────┬─────────────────────┬───────┬───────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ x                   ┆ y_max ┆ y_min │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ i64   ┆ i64   │
╞═════════════════════╪═════════════════════╪═════════════════════╪═══════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2023-01-01 00:10:00 ┆ 2023-01-01 00:00:00 ┆ 3     ┆ 1     │
│ 2023-01-01 00:05:00 ┆ 2023-01-01 00:15:00 ┆ 2023-01-01 00:05:00 ┆ 5     ┆ 3     │
│ 2023-01-01 00:10:00 ┆ 2023-01-01 00:20:00 ┆ 2023-01-01 00:10:00 ┆ 5     ┆ 4     │
└─────────────────────┴─────────────────────┴─────────────────────┴───────┴───────┘

Issue description

It appears that the offset is ignored in the group_by_dynamic method of DataFrame.

Expected behavior

In the above example, I would have expected to see the following:
Given that the documentation states that offset is set to negative every, I would expect an offset of "0m" to cover the period [2023-01-01 00:05:00, 2023-01-01 00:15:00), and an offset of "5m" (example above) to cover the period [2023-01-01 00:10:00, 2023-01-01 00:20:00).
I would expect this to be reflected in the boundaries, as well as the aggregate values.

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             macOS-13.6.6-x86_64-i386-64bit
Python:               3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@stout-yeoman stout-yeoman added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 10, 2024

hey - I think this looks correct, or at least, it respects the docs

The resulting window is then shifted back until the earliest datapoint is in or in front of it.

However, a few questions do come up about the group-by-dynamic logic, so it may need revisiting

@stout-yeoman
Copy link
Author

I interpreted the sentence you quote as belonging to the a day of the week bullet point of start_by (given the indentation). I don't think it's completely clear from the documentation that offset only affects the start_by of the window. I would have assumed that

offset of the window

pertains to the offset of each window and not just the starting point of the resulting data frame. This would also be more in line with the offset for DataFrame's rolling method, that does just this:

    df = pl.DataFrame(
        {
            "x": [
                datetime.datetime(2023, 1, 1, 0, 0, 0),
                datetime.datetime(2023, 1, 1, 0, 1, 0),
                datetime.datetime(2023, 1, 1, 0, 5, 0),
                datetime.datetime(2023, 1, 1, 0, 10, 0),
                datetime.datetime(2023, 1, 1, 0, 11, 0),
            ],
            "y": [1, 2, 3, 4, 5],
        }
    ).sort("x")

    agg_df = df.rolling(
        index_column="x", period="5m", offset="5m", closed="left"
    ).agg(pl.col("y").max().alias("y_max"), pl.col("y").min().alias("y_min"))
    print(agg_df)

Output:


shape: (5, 3)
┌─────────────────────┬───────┬───────┐
│ x                   ┆ y_max ┆ y_min │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 3     ┆ 3     │
│ 2023-01-01 00:01:00 ┆ 4     ┆ 4     │
│ 2023-01-01 00:05:00 ┆ 5     ┆ 4     │
│ 2023-01-01 00:10:00 ┆ null  ┆ null  │
│ 2023-01-01 00:11:00 ┆ null  ┆ null  │
└─────────────────────┴───────┴───────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants