Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple lambdas in Groupby.aggregate #26905

Merged
merged 12 commits into from
Jun 27, 2019

Conversation

TomAugspurger
Copy link
Contributor

Closes #26430

@TomAugspurger TomAugspurger added this to the 0.25.0 milestone Jun 17, 2019
@codecov
Copy link

codecov bot commented Jun 17, 2019

Codecov Report

Merging #26905 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26905      +/-   ##
==========================================
- Coverage   92.04%      92%   -0.04%     
==========================================
  Files         180      180              
  Lines       50714    50783      +69     
==========================================
+ Hits        46679    46724      +45     
- Misses       4035     4059      +24
Flag Coverage Δ
#multiple 90.64% <100%> (-0.03%) ⬇️
#single 41.83% <16.12%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 89.72% <100%> (+0.38%) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/internals/managers.py 96.09% <0%> (-0.77%) ⬇️
pandas/core/internals/blocks.py 94.61% <0%> (-0.53%) ⬇️
pandas/core/panel.py 17.61% <0%> (-0.2%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.94% <0%> (+0.1%) ⬆️
pandas/core/indexing.py 93.48% <0%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...be712d8. Read the comment docs.

@zertrin
Copy link
Contributor

zertrin commented Jun 20, 2019

Am I correct to understand that this also allows using Keyword aggregation with multiple lambdas as a way to avoid the automatic name mangling if one does wish so?

Something like:

df.groupby('A').agg(
    foo=('B', lambda x: x.max() - x.min()),
    bar=('B', lambda x: x.median() - x.mean()),
)

Which results in a dataframe with 2 columns foo and bar.

@TomAugspurger
Copy link
Contributor Author

Right

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2]})

In [3]: df.groupby([1, 1]).agg(foo=('A', lambda x: x.max()), bar=("A", lambda x: x.min()))
Out[3]:
   foo  bar
1    2    2

On master, that raises with the SpecificationError about multiple lambdas.

@zertrin
Copy link
Contributor

zertrin commented Jun 20, 2019

Great news! This issue is one of the main reasons I opened issue #18366 at the first place.

Might I suggest including an example in the documentation as a way to skip automatic name mangling and control the output column names?

Then that would be really great, because users can discover it as a replacement of the already working old behaviour which will disappear.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. some doc and a question about the output format.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Updated

  1. Changed mangling for [lambda x: 0, lambda x: 1] to have the names [<lambda_0>, <lambda_1>] rather than [<lambda>, <lambda_1>]. Note that .agg([lambda x: 0]) is still just [<lambda>]
  2. Added a short whatsnew note
  3. Added tests for NamedAgg

@zertrin
Copy link
Contributor

zertrin commented Jun 24, 2019

1. Changed mangling for `[lambda x: 0, lambda x: 1]` to have the names `[<lambda_0>, <lambda_1>]` rather than `[<lambda>, <lambda_1>]`. Note that `.agg([lambda x: 0])` is still just `[<lambda>]`

This made me suddenly remember a comment from someone in #18366 saying:

there is something deeply queer about mixing the Python's function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named <lambda> is causes me severe cognitive dissonance.

which lead me to the idea: maybe* when using lambda aggfuncs, using named aggregation could be made mandatory?

This is probably not an acceptable idea now that we are that late in the iteration (I'm fine with the current pull request BTW), but I just wanted to ask your opinion on this (*: maybe for a future deprecation?)

@@ -1710,3 +1715,97 @@ def _normalize_keyword_aggregation(kwargs):
order.append((column,
com.get_callable_name(aggfunc) or aggfunc))
return aggspec, columns, order


def _make_lambda(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? Seems like this could be done a lot more succinctly with a partial, i.e.:

f = functools.partial(func)
f.__name__ = "<lambda_{}>".format(i)

Could be done directly in loop so one less function and has the added benefit of maintaining other function attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you find it strange to use functools.partial when you aren't partially applying any arguments? It's not where my mind initially went.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I can see where that is strange (effectively using partial to copy the function) but personally find it to be less code and more readable than this wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the currently construction more informative (e.g. it has a doc-string and typing)

pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
* use functools.partial
* remove dead or
Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @WillAyd.

@@ -1710,3 +1715,97 @@ def _normalize_keyword_aggregation(kwargs):
order.append((column,
com.get_callable_name(aggfunc) or aggfunc))
return aggspec, columns, order


def _make_lambda(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you find it strange to use functools.partial when you aren't partially applying any arguments? It's not where my mind initially went.

pandas/core/groupby/generic.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, minor comments.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
@@ -1710,3 +1715,97 @@ def _normalize_keyword_aggregation(kwargs):
order.append((column,
com.get_callable_name(aggfunc) or aggfunc))
return aggspec, columns, order


def _make_lambda(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the currently construction more informative (e.g. it has a doc-string and typing)

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
* remove type
* remove dead code
* remove release note
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments on types. otherwise lgtm. merge on green.

# -> typing.Sequence[Callable[..., ScalarResult]]:

def _managle_lambda_list(aggfuncs):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Parameters & types

pandas/core/groupby/generic.py Show resolved Hide resolved
@@ -47,6 +48,10 @@
NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
# TODO(typing) the return value on this callable should be any *scalar*.
AggScalar = Union[str, Callable[..., Any]]
# TODO: validate types on ScalarResult and move to _typing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make an issue for this and point to this code after we merge

@@ -208,6 +213,8 @@ def aggregate(self, func, *args, **kwargs):
raise TypeError("Must provide 'func' or tuples of "
"'(column, aggfunc).")

func = _maybe_mangle_lambdas(func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the one on L810 is SeriesGroupBy.aggregate. I think it's entirely separate from NDFramGroupBy.aggregate.

pandas/core/groupby/generic.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

hmm, can you merge master again, have a conflict now.

@jreback jreback merged commit 87d7cdf into pandas-dev:master Jun 27, 2019
@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

thanks @TomAugspurger very nice!

@badge
Copy link

badge commented Jul 22, 2019

Right

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2]})

In [3]: df.groupby([1, 1]).agg(foo=('A', lambda x: x.max()), bar=("A", lambda x: x.min()))
Out[3]:
   foo  bar
1    2    2

On master, that raises with the SpecificationError about multiple lambdas.

I am able to run the SeriesGroupBy example in the What’s new in 0.25.0 notes, but Tom's example with DataFrameGroupBy here still fails on Pandas 0.25.0 with KeyError: "None of [MultiIndex([('A', '<lambda>'),\n ('A', '<lambda>')],\n )] are in the [columns]".

I assume this isn't expected behaviour? Shall I create a new issue for this?

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : 2.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.3
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@TomAugspurger TomAugspurger deleted the 26430-multiple-lambdas branch July 22, 2019 13:35
@TomAugspurger
Copy link
Contributor Author

@badge a new issue would be good.

Can you note that "in DataFrameGroupby.aggregate, order needs to be mangled too"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept multiple lambda in groupby list
5 participants