Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support nested renaming / selection #26399

Merged
merged 31 commits into from
May 30, 2019

Conversation

TomAugspurger
Copy link
Contributor

This adds special support for controlling the output column names when performing column-specific groupby aggregations. We use kwargs, using the keywords as the output names, and expecting tuples of (selection, aggfunc).

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                    'height': [9.1, 6.0, 9.5, 34.0],
   ...:                    'weight': [7.9, 7.5, 9.9, 198.0]})
   ...:

In [3]: df
Out[3]:
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [4]: df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max'))
Out[4]:
      min_height  max_weight
kind
cat          9.1         9.9
dog          6.0       198.0

Closes #18366


Putting this up as a WIP before the meeting on Thursday.

TODO:

  • Update old deprecation message to point to new way
  • Decide if we want to allow this elsewhere (DataFrame.agg? SeriesGroupBy.agg?)

cc @jorisvandenbossche @WillAyd

@TomAugspurger TomAugspurger added Groupby API Design Deprecate Functionality to remove in pandas labels May 15, 2019
@TomAugspurger TomAugspurger added this to the 0.25.0 milestone May 15, 2019
@WillAyd
Copy link
Member

WillAyd commented May 15, 2019

Thanks for bringing this back up as this had certainly gone stale. I might not have been as vocal as I should have and may be in the minority but I rather dislike the mapping of dict keys to labels for the following reasons:

  • It makes our API more difficult to manage going forward; any keyword we reasonably think to add to agg now has to come with some sort of deprecation cycle for user code that may be using that keyword
  • It limits the range of valid labels that can be used
  • It can have very strange side-effects when conflicting with other keywords (slightly duplicative of my first point, but we see this issue with using name in GroupBy ops for example)
  • It seems "unpythonic" to me (this is purely an opinion so not necessarily correct)

I unfortunately have a travel conflict for the meeting Thursday but just providing these points as food for thought. I'll be sure to check the follow ups - thanks again!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 15, 2019

I rather dislike the mapping of dict keys to labels

@WillAyd For clarification: you mean the mapping of "keyword argument names" instead of "dict keys" ? (because it is currently that we use dict keys for it)

@WillAyd
Copy link
Member

WillAyd commented May 15, 2019 via email

@jorisvandenbossche
Copy link
Member

For some of your points about limitation what you can use: note that you can still pass it as a dict (and then do **{..}, although not that nice) if you want to have names that are not valid python identifiers.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Thanks @WillAyd

I rather dislike the mapping of dict keys to labels

I agree that it's awkward. In particular, I don't like how idiosyncratic this is; it really is a special mode with its own set of rules. But our set of options (this, nothing, or some other API), I think this is best. I need to read through the discussion again, but do we agree that we need to do something to replace the deprecated feature?

any keyword we reasonably think to add to agg now has to come with some sort of deprecation cycle for user code that may be using that keyword

I think this isn't an issue, or it isn't made any worse by this PR. I believe that in "normal" mode (user passing an aggfunc) pandas promises to pass through kwargs to the aggfunc. So I don't think we'd be able to add keywords to .agg for use by pandas without deprecating things anyway.

It limits the range of valid labels that can be used

I try to document this. You need to use the (ugly) .agg(**{'not an identifier': ('col', 'sum')}) syntax. This is the same limitation for assign.

It can have very strange side-effects when conflicting with other keywords

If you have a chance, can you make an example? We kinda have that with _level, since we use that internally.

In [6]: df.groupby('kind').agg(_level=('height', 'sum'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-45280b886040> in <module>
----> 1 df.groupby('kind').agg(_level=('height', 'sum'))

~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1335     @Appender(_shared_docs['aggregate'])
   1336     def aggregate(self, arg=None, *args, **kwargs):
-> 1337         return super().aggregate(arg, *args, **kwargs)
   1338
   1339     agg = aggregate

~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    166         elif func is None:
    167             # nicer error message
--> 168             raise TypeError("Must provide 'func' or tuples of "
    169                             "'(column, aggfunc).")
    170

TypeError: Must provide 'func' or tuples of '(column, aggfunc).

@TomAugspurger
Copy link
Contributor Author

How do people feel about the tuple (selection, aggfunc) convention? Is it too opaque? In the last call we discussed using a dict like {selection: aggfunc}, but I disliked that because it would always be a single-item dict, which felt strange.

We could expose a namedtuple, like

In [16]: NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])

In [17]: NamedAgg('height', 'sum')
Out[17]: NamedAgg(column='height', aggfunc='sum')

but I'm not sure where in the pandas namespace that would fit.


One other concern: in the issue someone raise an annoyance with multiple lambdas. Currently we raise a SpecificationError

In [23]: df.groupby('kind').agg(a=('height', lambda x: 1), b=('height', lambda x: 2))
---------------------------------------------------------------------------
SpecificationError                        Traceback (most recent call last)
~/sandbox/pandas/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    483                 try:
--> 484                     result = _agg(arg, _agg_1dim)
    485                 except SpecificationError:

~/sandbox/pandas/pandas/core/base.py in _agg(arg, func)
    434                 for fname, agg_how in arg.items():
--> 435                     result[fname] = func(fname, agg_how)
    436                 return result

~/sandbox/pandas/pandas/core/base.py in _agg_1dim(name, how, subset)
    417                                              "in aggregation")
--> 418                 return colg.aggregate(how, _level=(_level or 0) + 1)
    419

~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    766             ret = self._aggregate_multiple_funcs(func_or_funcs,
--> 767                                                  (_level or 0) + 1)
    768         else:

~/sandbox/pandas/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg, _level)
    829                     'Function names must be unique, found multiple named '
--> 830                     '{}'.format(name))
    831

SpecificationError: Function names must be unique, found multiple named <lambda>

Addressing that will require a bit of work; I've tried to keep this PR small by reusing existing machinery where possible.

@WillAyd
Copy link
Member

WillAyd commented May 15, 2019

If it is between this and still doing nothing I'd vote for the latter without offering anything better. I think this will be pretty buggy and not provide that great of a user experience anyway, so hesitant to start down this path

@TomAugspurger
Copy link
Contributor Author

FYI, this seems somewhat hard to support in Python 3.5, since before 3.6 the order of **kwargs was undefined. That means that for 3.5 and earlier we can't reliably know the output column order. I'm in favor of raising a runtime error for 3.5.


If it is between this and still doing nothing I'd vote for the latter without offering anything better.

And how about if I restrict your choices this or nothing? 😄

If you sort our issues by most +1s, this is the 3rd-most upvoted issue right now. So I'm inclined to do something. Right now this seems like the least-bad something to me :)

@WillAyd
Copy link
Member

WillAyd commented May 15, 2019

And how about if I restrict your choices this or nothing? 😄

Ha tough one! I'd still say nothing in that case. From my (very biased) workflow I couldn't imagine ever really using this instead of just assigning the columns I want after the fact. (caveat: I don't use pipe / method chaining)

@jorisvandenbossche
Copy link
Member

Do we want to keep the fundamental discussion on the issue? (so about to do this or not, not the technical details of the PR)

@TomAugspurger
Copy link
Contributor Author

Do we want to keep the fundamental discussion on the issue?

Sure, I'll do a short summary of this PR over there.

@jorisvandenbossche
Copy link
Member

It's mainly Will's arguments that are relevant there.

FYI, this seems somewhat hard to support in Python 3.5, since before 3.6 the order of **kwargs was undefined. That means that for 3.5 and earlier we can't reliably know the output column order. I'm in favor of raising a runtime error for 3.5.

But don't we have the same issue now already with the dict of dicts approach? In those cases we simply sort the keys, so why not here?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 15, 2019

I suppose sorting is best for python 3.5.

@TomAugspurger TomAugspurger marked this pull request as ready for review May 15, 2019 21:51
@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #26399 into master will decrease coverage by <.01%.
The diff coverage is 96.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26399      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50739    50763      +24     
==========================================
+ Hits        46524    46542      +18     
- Misses       4215     4221       +6
Flag Coverage Δ
#multiple 90.19% <96.29%> (ø) ⬆️
#single 41.15% <14.81%> (-0.19%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 89.18% <96.29%> (+0.21%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff4437e...06a86ec. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #26399 into master will decrease coverage by <.01%.
The diff coverage is 96.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26399      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50739    50763      +24     
==========================================
+ Hits        46524    46542      +18     
- Misses       4215     4221       +6
Flag Coverage Δ
#multiple 90.19% <96.29%> (ø) ⬆️
#single 41.15% <14.81%> (-0.19%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 89.18% <96.29%> (+0.21%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff4437e...06a86ec. Read the comment docs.

@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #26399 into master will decrease coverage by <.01%.
The diff coverage is 97.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26399      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.01%     
==========================================
  Files         174      174              
  Lines       50649    50677      +28     
==========================================
+ Hits        46483    46505      +22     
- Misses       4166     4172       +6
Flag Coverage Δ
#multiple 90.3% <97.14%> (ø) ⬆️
#single 41.66% <31.42%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/__init__.py 100% <100%> (ø) ⬆️
pandas/core/base.py 98.19% <100%> (ø) ⬆️
pandas/core/api.py 100% <100%> (ø) ⬆️
pandas/core/groupby/generic.py 89.22% <96.77%> (+0.26%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a91da0c...25dca1a. Read the comment docs.

@smcinerney
Copy link

smcinerney commented May 15, 2019

@TomAugspurger : thanks for keeping this alive, but I think 'ENH' is underselling it, it was essentially an outright BUG (#18366) to deprecate existing behavior in groupby.agg that users were relying on. (It also made us underperform a useful feature of R dplyr/tidyverse(/plyr) aggregate/summarize/mutate, which have essentially implemented this for 10 years). Should not break core pandas functionality.

@WillAyd : can you please be more specific and quantifiable about your objections? Yes, dict-based-renaming limits to valid ASCII(/Unicode?) names. If a user wants an arbitrary name, they can either name-mangle, or just use a different method than dict-based-renaming. But for 99+% of users this functionality will work out-of-the-box. Why compromise the API for the sake of 0.01%(?) of users, or whatever? Can you actually ballpark-estimate what % of pandas user code has non-compliant names?

Dict-based-renaming names moreover cannot be existing kwargs of that command; so yeah we need a Warning which checks name-collision of the dict with kwargs, the warning should default to on, and be disableable for speed. (That's not too hard, right?). Obviously we should strive for kwarg naming consistency across all commands that would support dict-based-renaming to minimize such collisions. That addresses all your objections, right?

  • It makes our API more difficult to manage going forward; any keyword we reasonably think to add to agg now has to come with some sort of deprecation cycle for user code that may be using that keyword

Yes, that's just life in API-land. This one has huge tangible benefits, it's worth the price.

  • It seems "unpythonic" to me (this is purely an opinion so not necessarily correct)

I personally disagree, anyway that's subjective. And R supports this sort of thing for 10+ years; this behavior is expected and achievable.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax looks reasonable to me. I don't think you should sort for 3.5, though do we for .assign?

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

I see now in the discussion above that you actually originally proposed NamedAgg, before Agg, and then @mroeschke preferred KeywordAgg.

Yeah, that was my initial preference between the two, but it's not strong. NamedAgg / "named aggregation" emphasizes the "what"l more, KeywordAgg / "keyword aggregation" emphasizes the "how" more. @mroeschke did you have a strong reason for KeywordAgg?

@TomAugspurger
Copy link
Contributor Author

Opened up a couple issues for SeriesGroupBy.agg and NDFrame.agg. I'll do the last round of doc fixups here once we decide on KeywordAgg vs. NamedAgg, (also gives conda a chance to fix itself).

@jreback
Copy link
Contributor

jreback commented May 24, 2019

i think NamedAgg might be more descriptive here

@mroeschke
Copy link
Member

KeywordAgg was just a preference over Agg. In general, I am in favor of any naming that's more descriptive than Agg.

@TomAugspurger
Copy link
Contributor Author

Switched to NamedAgg and updated the docs a bit.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 25, 2019 via email

@zertrin
Copy link
Contributor

zertrin commented May 25, 2019

@ previous comment: see answer in the thread here #26399 (comment)

@zertrin
Copy link
Contributor

zertrin commented May 25, 2019

they were split on whether the keyword was the column selection or the output name

Is is arguable whether the NamedTuple really solves that particular confusion: the keyword is still there, and in the namedtuple the first argument is just called column, which in itself does not disambiguate whether it is selection or output. I don't think that the NamedTuple really helps more than good documentation would. (but I am not saying that it should be removed either)

In the end, I think it is much more important that the examples in the documentation are clear as to which is which. The current example are good enough already in my opinion (.agg(min_height=('height', 'min'), max_height=('height', 'max')) shows clearly which col is input and which one is output).

Once the user see and understand the feature from the documentation once, she should not have much problems herself in the future to understand it.

Ultimately, the user is also responsible to give output columns meaningful names that aren't confusing. Using namedtuples doesn't solve every problem if the user gives confusing names (for example something like .agg(X=pd.NamedAgg(column='F', aggfunc='max')) is a little bit but not that much clearer than .agg(X=('F', 'max')))

And lastly but most importantly in my eyes, a huge number of python coders do not know exactly what a namedtuple is (most beginner to intermediate python coders might never have heard of it). At first glance, something like pd.NamedAgg(column='F', aggfunc='max') looks like a class (the user knows about pd.Series and pd.DataFrame which look similar).

My point is: emphasising/encouraging usage of the namedtuple might lead users to think that this pd.NamedAgg() construct is somehow special and required (even if the documentation says otherwise, in one sentence, which the user might overlook), whereas in fact, it is only some namedtuple wrapper, but the user probably won't understand it if she doesn't know what a namedtuple is.

So in summary, I think that we should do the reverse: show the simple keyword-tuple in most examples, and have one section with one example showing the namedtuple as an alternative (clearly stating that it doesn't do anything more than the simple keyword-tuple version).

@zertrin
Copy link
Contributor

zertrin commented May 25, 2019

One last thought: on the other hand, I must concede that the NamedAgg solution has one advantage: easier discovery. If a user stumble upon it and wonders what it is, (in the future) she can google for pd.NamedAgg and find the relevant part of the documentation explaining the keyword aggregation mechanic.

- The keywords are the *output* column names
- The values are tuples whose first element is the column to select
and the second element is the aggregation to apply to that column. Pandas
provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can give a :class:`NamedAgg` ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this added in the reference.rst ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring autogenerated by the namedtuple isn't super helpful

Screen Shot 2019-05-28 at 9 24 37 AM

pandas/core/base.py Outdated Show resolved Hide resolved
@@ -41,6 +44,10 @@

from pandas.plotting._core import boxplot_frame_groupby

NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what this is (can we add a doc-string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could using typing.NamedTuple.

pandas/core/groupby/generic.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

I think the remaining comments are about how to included NamedAgg in our API docs. I think our options are to

  1. Not use autodoc, just write out the docstring in the API docs (downside, doesn't work for IPython)
  2. Using typing.NamedTuple (I'm not very familiar with the downsides of this class vs. collections.namedtuple)

I'd prefer to leave that as a followup, along with or after DataFrame.agg.

@TomAugspurger
Copy link
Contributor Author

All green. If we're comfortable with it, I have the following followups to do:

  1. Allow Keyword aggregation in SeriesGroupBy #26512
  2. Allow Keyword Aggregation in DataFrame.agg and Series.agg #26513
  3. Accept multiple lambda in groupby list #26430

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nitpicky stuff but OK with this as is and with follow ups. Nice work @TomAugspurger

@@ -41,6 +44,10 @@

from pandas.plotting._core import boxplot_frame_groupby

NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
# TODO(typing) the return value on this callable should be any *scalar*.
AggScalar = Union[str, Callable[..., Any]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a TypeVar instead of a Union

)


If your desired output column names are not valid python keywords, construct a dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these are technically identifiers instead of keywords

@jreback jreback merged commit 072408e into pandas-dev:master May 30, 2019
@jreback
Copy link
Contributor

jreback commented May 30, 2019

thanks @TomAugspurger nice work.

yeah happy to merge this and have issues for the followups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecation of relabeling dicts in groupby.agg brings many issues
7 participants