`DataFrame.pivot` does not work with `index=None` even though function signature implies it is acceptable #11592

henryharbeck · 2023-10-08T12:31:21Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame(
    {
        "bar": ["y", "y", "y", "x"],
        "baz": [1, 2, 3, 4],
    }
)
df.pivot(values="baz", index=None, columns="bar", aggregate_function="sum")

Log output

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_7501/2184148441.py in ?()
      4         "bar": ["y", "y", "y", "x"],
      5         "baz": [1, 2, 3, 4],
      6     }
      7 )
----> 8 df.pivot(values="baz", index=None, columns="bar", aggregate_function="sum")

~/development/polars_help/venv/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, values, index, columns, aggregate_function, maintain_order, sort_columns, separator)
   7024         else:
   7025             aggregate_expr = aggregate_function._pyexpr
   7026 
   7027         return self._from_pydf(
-> 7028             self._df.pivot_expr(
   7029                 values,
   7030                 index,
   7031                 columns,

TypeError: argument 'index': 'NoneType' object cannot be converted to 'PyString'

Issue description

DataFrame.pivot does not work with index=None.

The function signature implies it is acceptable by type hinting None as an option.

However, the docstring says

index - One or multiple keys to group by.

potentially implying that None is not valid as that would be grouping by 0 keys.

Expected behavior

In my opinion, there is no reason index=None should not be valid.
It would just mean that the output of the pivot would always be a single row.

For the example provided, the expected output would be

┌─────┬─────┐
│ y   ┆ x   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 6   ┆ 4   │
└─────┴─────┘

The docstring for the index parameter should also be updated to be clear that passing None is valid - or at least not imply that it is invalid.

Installed versions

--------Version info---------
Polars:              0.19.7
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:              3.11.4 (main, Jun  8 2023, 17:02:11) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               <not installed>
openpyxl:            <not installed>
pandas:              <not installed>
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

The text was updated successfully, but these errors were encountered:

deanm0000 · 2023-10-08T16:30:36Z

What you're intending to get back is the transpose of group_by.agg

df.group_by('bar').agg(pl.col('baz')).transpose()

henryharbeck · 2023-10-09T02:31:38Z

@deanm0000, the transpose does not provide header names like pivot does. Obviously can promote the first row as headers afterwards though.

A perhaps simpler workaround for the issue would be to create a literal column (with a single unique value), use that as the index to pivot and then drop it after.

(
    df.with_columns(pl.lit(1))
    .pivot(values="baz", index="literal", columns="bar", aggregate_function="sum")
    .drop("literal")
)

The above produces the expected output.

Workarounds or alternative approaches aside, I created this issue as I believe (and the type hints in the function indicate) that index=None should be valid, but is currently raising an error.

deanm0000 · 2023-10-09T15:30:57Z

I wasn't trying to discount the request, just trying to help. That said, what I put in earlier was on mobile without looking at how it worked. A more complete version would be

(
    df.group_by('bar', maintain_order=True).sum()
    .select('baz').transpose(column_names=df['bar'].unique())
    )

A way that doesn't use transpose and so might be more efficient...

(
    df
    .group_by('bar', maintain_order=True)
    .agg(pl.col('baz').sum())
    .select(pl.col('baz').implode().list.to_struct().struct.rename_fields(df['bar'].unique()))
    .unnest('baz')
)

gab23r · 2023-10-11T12:00:02Z

This bug reminds me of this one: #10075

henryharbeck · 2024-02-12T21:40:49Z

Hey @MarcoGorelli, as you look to have been in the world of pivot lately, thought you may appreciate a heads up on this one.
I have checked again on 0.20.7, and this still produces the same error.

And on main, there is still the discrepancy for index between the type hint (None as an option) and docstring ("One or multiple keys to group by.", implying that None is not okay.).

MarcoGorelli · 2024-05-03T20:40:08Z

Hey

Just looking at this again, and I'm not sure about adding complexity to pivot, it's already pretty complicated

The desired functionality can be achieved with

df.group_by('bar').agg(pl.sum('baz')).transpose(column_names='bar')

shape: (1, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4   ┆ 6   │
└─────┴─────┘

which is a bit expensive, but then again so is pivot. And it's arguably clearer than

df.pivot(values="baz", index=None, columns="bar", aggregate_function="sum")

?

Something else I feel uneasy about is this:

values=None means "use all remaining columns if index and columns have already been specified"

So, for consistency, I'd expect index=None to mean "use all remaining columns if columns and values have been specified". I have some recollection that this is what @mcrumiller was going for at some point, rather than the 1-row variant suggested here

Either that, or to remove | None from the type hint

ohines · 2024-05-04T09:01:18Z

@MarcoGorelli
It seems there are two points:

Can the user specify that no index cols should be used and a single output row be produced e.g. index=[].
should index=None correspond to index=[] or 'all remaining columns'.

I think that the second point is distinct from the first. Perhaps we could allow e.g.

df = pl.DataFrame(
    {
        "foo": ["A", "B", "C"],
        "N": [1, 2, 3],
        "M": [4, 5, 6],
    }
)
df.pivot(index=[], columns="foo", values=None, aggregate_function=None)

shape: (1, 6)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ N_foo_A ┆ N_foo_B ┆ N_foo_C ┆ M_foo_A ┆ M_foo_B ┆ M_foo_C │
│ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ i64     ┆ i64     ┆ i64     ┆ i64     ┆ i64     ┆ i64     │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ 1       ┆ 2       ┆ 3       ┆ 4       ┆ 5       ┆ 6       │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Note that this behaviour is already implemented in #15855
so that PR would just have to be modified to remove the None becomes [] logic.

henryharbeck added bug Something isn't working python Related to Python Polars labels Oct 8, 2023

deanm0000 mentioned this issue Oct 25, 2023

Make melt (unpivot) consistent with pivot #11974

Open

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

adamgreg mentioned this issue Feb 9, 2024

DataFrame.pivot() with empty list of index columns #3372

Closed

MarcoGorelli added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Feb 12, 2024

ohines linked a pull request Apr 24, 2024 that will close this issue

fix: Allow index=None in pivot() #15855

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DataFrame.pivot` does not work with `index=None` even though function signature implies it is acceptable #11592

`DataFrame.pivot` does not work with `index=None` even though function signature implies it is acceptable #11592

henryharbeck commented Oct 8, 2023 •

edited

deanm0000 commented Oct 8, 2023

henryharbeck commented Oct 9, 2023

deanm0000 commented Oct 9, 2023

gab23r commented Oct 11, 2023

henryharbeck commented Feb 12, 2024

MarcoGorelli commented May 3, 2024 •

edited

ohines commented May 4, 2024

DataFrame.pivot does not work with index=None even though function signature implies it is acceptable #11592

DataFrame.pivot does not work with index=None even though function signature implies it is acceptable #11592

Comments

henryharbeck commented Oct 8, 2023 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Oct 8, 2023

henryharbeck commented Oct 9, 2023

deanm0000 commented Oct 9, 2023

gab23r commented Oct 11, 2023

henryharbeck commented Feb 12, 2024

MarcoGorelli commented May 3, 2024 • edited

ohines commented May 4, 2024

`DataFrame.pivot` does not work with `index=None` even though function signature implies it is acceptable #11592

`DataFrame.pivot` does not work with `index=None` even though function signature implies it is acceptable #11592

henryharbeck commented Oct 8, 2023 •

edited

MarcoGorelli commented May 3, 2024 •

edited