Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.pivot does not work with index=None even though function signature implies it is acceptable #11592

Open
2 tasks done
henryharbeck opened this issue Oct 8, 2023 · 7 comments · May be fixed by #15855
Open
2 tasks done
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@henryharbeck
Copy link
Contributor

henryharbeck commented Oct 8, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame(
    {
        "bar": ["y", "y", "y", "x"],
        "baz": [1, 2, 3, 4],
    }
)
df.pivot(values="baz", index=None, columns="bar", aggregate_function="sum")

Log output

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_7501/2184148441.py in ?()
      4         "bar": ["y", "y", "y", "x"],
      5         "baz": [1, 2, 3, 4],
      6     }
      7 )
----> 8 df.pivot(values="baz", index=None, columns="bar", aggregate_function="sum")

~/development/polars_help/venv/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, values, index, columns, aggregate_function, maintain_order, sort_columns, separator)
   7024         else:
   7025             aggregate_expr = aggregate_function._pyexpr
   7026 
   7027         return self._from_pydf(
-> 7028             self._df.pivot_expr(
   7029                 values,
   7030                 index,
   7031                 columns,

TypeError: argument 'index': 'NoneType' object cannot be converted to 'PyString'

Issue description

DataFrame.pivot does not work with index=None.

The function signature implies it is acceptable by type hinting None as an option.

However, the docstring says

index - One or multiple keys to group by.

potentially implying that None is not valid as that would be grouping by 0 keys.

Expected behavior

In my opinion, there is no reason index=None should not be valid.
It would just mean that the output of the pivot would always be a single row.

For the example provided, the expected output would be

┌─────┬─────┐
│ y   ┆ x   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 6   ┆ 4   │
└─────┴─────┘

The docstring for the index parameter should also be updated to be clear that passing None is valid - or at least not imply that it is invalid.

Installed versions

--------Version info---------
Polars:              0.19.7
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:              3.11.4 (main, Jun  8 2023, 17:02:11) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               <not installed>
openpyxl:            <not installed>
pandas:              <not installed>
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@henryharbeck henryharbeck added bug Something isn't working python Related to Python Polars labels Oct 8, 2023
@deanm0000
Copy link
Collaborator

What you're intending to get back is the transpose of group_by.agg

df.group_by('bar').agg(pl.col('baz')).transpose()

@henryharbeck
Copy link
Contributor Author

@deanm0000, the transpose does not provide header names like pivot does. Obviously can promote the first row as headers afterwards though.

A perhaps simpler workaround for the issue would be to create a literal column (with a single unique value), use that as the index to pivot and then drop it after.

(
    df.with_columns(pl.lit(1))
    .pivot(values="baz", index="literal", columns="bar", aggregate_function="sum")
    .drop("literal")
)

The above produces the expected output.

Workarounds or alternative approaches aside, I created this issue as I believe (and the type hints in the function indicate) that index=None should be valid, but is currently raising an error.

@deanm0000
Copy link
Collaborator

I wasn't trying to discount the request, just trying to help. That said, what I put in earlier was on mobile without looking at how it worked. A more complete version would be

(
    df.group_by('bar', maintain_order=True).sum()
    .select('baz').transpose(column_names=df['bar'].unique())
    )

A way that doesn't use transpose and so might be more efficient...

(
    df
    .group_by('bar', maintain_order=True)
    .agg(pl.col('baz').sum())
    .select(pl.col('baz').implode().list.to_struct().struct.rename_fields(df['bar'].unique()))
    .unnest('baz')
)

@gab23r
Copy link
Contributor

gab23r commented Oct 11, 2023

This bug reminds me of this one: #10075

@henryharbeck
Copy link
Contributor Author

Hey @MarcoGorelli, as you look to have been in the world of pivot lately, thought you may appreciate a heads up on this one.
I have checked again on 0.20.7, and this still produces the same error.

And on main, there is still the discrepancy for index between the type hint (None as an option) and docstring ("One or multiple keys to group by.", implying that None is not okay.).

@MarcoGorelli MarcoGorelli added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Feb 12, 2024
@ohines ohines linked a pull request Apr 24, 2024 that will close this issue
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 3, 2024

Hey

Just looking at this again, and I'm not sure about adding complexity to pivot, it's already pretty complicated

The desired functionality can be achieved with

df.group_by('bar').agg(pl.sum('baz')).transpose(column_names='bar')

shape: (1, 2)
┌─────┬─────┐
│ xy   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
│ 46   │
└─────┴─────┘

which is a bit expensive, but then again so is pivot. And it's arguably clearer than

df.pivot(values="baz", index=None, columns="bar", aggregate_function="sum")

?

Something else I feel uneasy about is this:

  • values=None means "use all remaining columns if index and columns have already been specified"

So, for consistency, I'd expect index=None to mean "use all remaining columns if columns and values have been specified". I have some recollection that this is what @mcrumiller was going for at some point, rather than the 1-row variant suggested here

Either that, or to remove | None from the type hint

@ohines
Copy link

ohines commented May 4, 2024

@MarcoGorelli
It seems there are two points:

  • Can the user specify that no index cols should be used and a single output row be produced e.g. index=[].
  • should index=None correspond to index=[] or 'all remaining columns'.

I think that the second point is distinct from the first. Perhaps we could allow e.g.

df = pl.DataFrame(
    {
        "foo": ["A", "B", "C"],
        "N": [1, 2, 3],
        "M": [4, 5, 6],
    }
)
df.pivot(index=[], columns="foo", values=None, aggregate_function=None)

shape: (1, 6)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ N_foo_AN_foo_BN_foo_CM_foo_AM_foo_BM_foo_C │
│ ------------------     │
│ i64i64i64i64i64i64     │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ 123456       │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Note that this behaviour is already implemented in #15855
so that PR would just have to be modified to remove the None becomes [] logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

6 participants