Allow multiple lambdas in Groupby.aggregate #26905

TomAugspurger · 2019-06-17T16:47:42Z

codecov · 2019-06-17T17:28:52Z

Codecov Report

Merging #26905 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26905      +/-   ##
==========================================
- Coverage   92.04%      92%   -0.04%     
==========================================
  Files         180      180              
  Lines       50714    50783      +69     
==========================================
+ Hits        46679    46724      +45     
- Misses       4035     4059      +24

Flag	Coverage Δ
#multiple	`90.64% <100%> (-0.03%)`	⬇️
#single	`41.83% <16.12%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`89.72% <100%> (+0.38%)`	⬆️
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️
pandas/core/internals/managers.py	`96.09% <0%> (-0.77%)`	⬇️
pandas/core/internals/blocks.py	`94.61% <0%> (-0.53%)`	⬇️
pandas/core/panel.py	`17.61% <0%> (-0.2%)`	⬇️
pandas/core/frame.py	`96.89% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.94% <0%> (+0.1%)`	⬆️
pandas/core/indexing.py	`93.48% <0%> (+0.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...be712d8. Read the comment docs.

zertrin · 2019-06-20T14:29:12Z

Am I correct to understand that this also allows using Keyword aggregation with multiple lambdas as a way to avoid the automatic name mangling if one does wish so?

Something like:

df.groupby('A').agg(
    foo=('B', lambda x: x.max() - x.min()),
    bar=('B', lambda x: x.median() - x.mean()),
)

Which results in a dataframe with 2 columns foo and bar.

TomAugspurger · 2019-06-20T14:31:25Z

Right

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2]})

In [3]: df.groupby([1, 1]).agg(foo=('A', lambda x: x.max()), bar=("A", lambda x: x.min()))
Out[3]:
   foo  bar
1    2    2

On master, that raises with the SpecificationError about multiple lambdas.

zertrin · 2019-06-20T14:37:37Z

Great news! This issue is one of the main reasons I opened issue #18366 at the first place.

Might I suggest including an example in the documentation as a way to skip automatic name mangling and control the output column names?

Then that would be really great, because users can discover it as a replacement of the already working old behaviour which will disappear.

jreback

looks good. some doc and a question about the output format.

doc/source/whatsnew/v0.25.0.rst

pandas/core/groupby/generic.py

pandas/tests/groupby/aggregate/test_aggregate.py

…mbdas

TomAugspurger · 2019-06-24T16:01:32Z

Updated

Changed mangling for [lambda x: 0, lambda x: 1] to have the names [<lambda_0>, <lambda_1>] rather than [<lambda>, <lambda_1>]. Note that .agg([lambda x: 0]) is still just [<lambda>]
Added a short whatsnew note
Added tests for NamedAgg

zertrin · 2019-06-24T17:07:46Z

1. Changed mangling for `[lambda x: 0, lambda x: 1]` to have the names `[<lambda_0>, <lambda_1>]` rather than `[<lambda>, <lambda_1>]`. Note that `.agg([lambda x: 0])` is still just `[<lambda>]`

This made me suddenly remember a comment from someone in #18366 saying:

there is something deeply queer about mixing the Python's function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named <lambda> is causes me severe cognitive dissonance.

which lead me to the idea: maybe* when using lambda aggfuncs, using named aggregation could be made mandatory?

This is probably not an acceptable idea now that we are that late in the iteration (I'm fine with the current pull request BTW), but I just wanted to ask your opinion on this (*: maybe for a future deprecation?)

WillAyd · 2019-06-25T01:06:29Z

pandas/core/groupby/generic.py

@@ -1710,3 +1715,97 @@ def _normalize_keyword_aggregation(kwargs):
        order.append((column,
                      com.get_callable_name(aggfunc) or aggfunc))
    return aggspec, columns, order
+
+
+def _make_lambda(


Do we really need this? Seems like this could be done a lot more succinctly with a partial, i.e.:

f = functools.partial(func) f.__name__ = "<lambda_{}>".format(i)

Could be done directly in loop so one less function and has the added benefit of maintaining other function attributes

Do you find it strange to use functools.partial when you aren't partially applying any arguments? It's not where my mind initially went.

Yea I can see where that is strange (effectively using partial to copy the function) but personally find it to be less code and more readable than this wrapper.

I find the currently construction more informative (e.g. it has a doc-string and typing)

pandas/core/groupby/generic.py

* use functools.partial * remove dead or

TomAugspurger

Thank @WillAyd.

TomAugspurger · 2019-06-25T11:31:57Z

pandas/core/groupby/generic.py

@@ -1710,3 +1715,97 @@ def _normalize_keyword_aggregation(kwargs):
        order.append((column,
                      com.get_callable_name(aggfunc) or aggfunc))
    return aggspec, columns, order
+
+
+def _make_lambda(


Do you find it strange to use functools.partial when you aren't partially applying any arguments? It's not where my mind initially went.

pandas/core/groupby/generic.py

jreback

lgtm, minor comments.

doc/source/whatsnew/v0.25.0.rst

pandas/core/groupby/generic.py

jreback · 2019-06-25T12:29:42Z

pandas/core/groupby/generic.py

@@ -1710,3 +1715,97 @@ def _normalize_keyword_aggregation(kwargs):
        order.append((column,
                      com.get_callable_name(aggfunc) or aggfunc))
    return aggspec, columns, order
+
+
+def _make_lambda(


I find the currently construction more informative (e.g. it has a doc-string and typing)

pandas/core/groupby/generic.py

* remove type * remove dead code * remove release note

jreback

small comments on types. otherwise lgtm. merge on green.

jreback · 2019-06-26T12:20:04Z

pandas/core/groupby/generic.py

+#     -> typing.Sequence[Callable[..., ScalarResult]]:
+
+def _managle_lambda_list(aggfuncs):
+    """


can you add Parameters & types

pandas/core/groupby/generic.py

…mbdas

jreback · 2019-06-27T03:24:26Z

pandas/core/groupby/generic.py

@@ -47,6 +48,10 @@
 NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
 # TODO(typing) the return value on this callable should be any *scalar*.
 AggScalar = Union[str, Callable[..., Any]]
+# TODO: validate types on ScalarResult and move to _typing


can you make an issue for this and point to this code after we merge

jreback · 2019-06-27T03:26:18Z

pandas/core/groupby/generic.py

@@ -208,6 +213,8 @@ def aggregate(self, func, *args, **kwargs):
            raise TypeError("Must provide 'func' or tuples of "
                            "'(column, aggfunc).")

+        func = _maybe_mangle_lambdas(func)


hmm I don't think you actually need to do this here, rather put it around https://github.com/pandas-dev/pandas/pull/26905/files#diff-bfee1ba9e7cb79839776fac1a57ed940L810 and pull out the change you have in https://github.com/pandas-dev/pandas/pull/26905/files#diff-bfee1ba9e7cb79839776fac1a57ed940L832

IIUC, the one on L810 is SeriesGroupBy.aggregate. I think it's entirely separate from NDFramGroupBy.aggregate.

pandas/core/groupby/generic.py

…mbdas

jreback · 2019-06-27T20:47:03Z

hmm, can you merge master again, have a conflict now.

…mbdas

jreback · 2019-06-27T23:20:27Z

thanks @TomAugspurger very nice!

badge · 2019-07-22T13:21:44Z

Right

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2]})

In [3]: df.groupby([1, 1]).agg(foo=('A', lambda x: x.max()), bar=("A", lambda x: x.min()))
Out[3]:
   foo  bar
1    2    2

On master, that raises with the SpecificationError about multiple lambdas.

I am able to run the SeriesGroupBy example in the What’s new in 0.25.0 notes, but Tom's example with DataFrameGroupBy here still fails on Pandas 0.25.0 with KeyError: "None of [MultiIndex([('A', '<lambda>'),\n ('A', '<lambda>')],\n )] are in the [columns]".

I assume this isn't expected behaviour? Shall I create a new issue for this?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : 2.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.3
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

TomAugspurger · 2019-07-22T13:36:16Z

@badge a new issue would be good.

Can you note that "in DataFrameGroupby.aggregate, order needs to be mangled too"?

ENH

d877ae9

TomAugspurger added the Groupby label Jun 17, 2019

TomAugspurger added this to the 0.25.0 milestone Jun 17, 2019

TomAugspurger mentioned this pull request Jun 20, 2019

Allow Keyword Aggregation in DataFrame.agg and Series.agg #26513

Closed

jreback requested changes Jun 21, 2019

View reviewed changes

TomAugspurger added 2 commits June 24, 2019 10:17

Merge remote-tracking branch 'upstream/master' into 26430-multiple-la…

41b6096

…mbdas

updates

54c36a1

fix assert

9c2bcf2

WillAyd requested changes Jun 25, 2019

View reviewed changes

updates

d549046

* use functools.partial * remove dead or

TomAugspurger commented Jun 25, 2019

View reviewed changes

jreback requested changes Jun 25, 2019

View reviewed changes

TomAugspurger added 2 commits June 25, 2019 14:03

fixups

3ece0a5

* remove type * remove dead code * remove release note

fixup types

dde7610

jreback approved these changes Jun 26, 2019

View reviewed changes

TomAugspurger added 2 commits June 26, 2019 08:48

Merge remote-tracking branch 'upstream/master' into 26430-multiple-la…

b2947bd

…mbdas

doc

6581abc

jreback reviewed Jun 27, 2019

View reviewed changes

jreback requested changes Jun 27, 2019

View reviewed changes

TomAugspurger added 2 commits June 27, 2019 09:03

Merge remote-tracking branch 'upstream/master' into 26430-multiple-la…

868cc60

…mbdas

types

be712d8

Merge remote-tracking branch 'upstream/master' into 26430-multiple-la…

5aa538d

…mbdas

jreback approved these changes Jun 27, 2019

View reviewed changes

jreback merged commit 87d7cdf into pandas-dev:master Jun 27, 2019

TomAugspurger deleted the 26430-multiple-lambdas branch July 22, 2019 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple lambdas in Groupby.aggregate #26905

Allow multiple lambdas in Groupby.aggregate #26905

TomAugspurger commented Jun 17, 2019

codecov bot commented Jun 17, 2019 •

edited

zertrin commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

zertrin commented Jun 20, 2019

jreback left a comment

TomAugspurger commented Jun 24, 2019

zertrin commented Jun 24, 2019

WillAyd Jun 25, 2019

TomAugspurger Jun 25, 2019 •

edited

WillAyd Jun 25, 2019

jreback Jun 25, 2019

TomAugspurger left a comment

TomAugspurger Jun 25, 2019 •

edited

jreback left a comment

jreback Jun 25, 2019

jreback left a comment

jreback Jun 26, 2019

jreback Jun 27, 2019

jreback Jun 27, 2019

TomAugspurger Jun 27, 2019

jreback commented Jun 27, 2019

jreback commented Jun 27, 2019

badge commented Jul 22, 2019

INSTALLED VERSIONS

TomAugspurger commented Jul 22, 2019

Allow multiple lambdas in Groupby.aggregate #26905

Allow multiple lambdas in Groupby.aggregate #26905

Conversation

TomAugspurger commented Jun 17, 2019

codecov bot commented Jun 17, 2019 • edited

Codecov Report

zertrin commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

zertrin commented Jun 20, 2019

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 24, 2019

zertrin commented Jun 24, 2019

Choose a reason for hiding this comment

TomAugspurger Jun 25, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Jun 25, 2019 • edited

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 27, 2019

jreback commented Jun 27, 2019

badge commented Jul 22, 2019

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 22, 2019

codecov bot commented Jun 17, 2019 •

edited

TomAugspurger Jun 25, 2019 •

edited

TomAugspurger Jun 25, 2019 •

edited

Output of `pd.show_versions()`