Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH/DEPR: Series.unique returns Series #24108

Closed
wants to merge 5 commits into from

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Dec 5, 2018

I know this PR is too big, but I need it to initiate discussion before v.0.24 cutoff.

I've been working on adding a return_inverse to unique since mid June (as fast as possible). I know that changing the return type of Series.unique is potentially a controversial issue, but I truly believe that this should be done for v.0.24 (as otherwise the behavior is locked in past 1.0 and who knows if it ever changes then).

IMO, it's even a necessity if pandas ever wants to support return_inverse for unique. Here's a few arguments from working on this for a while:

The reason I'm pushing this WIP for discussion is that this is obviously needs a deprecation cycle, and I really think this should be part of v.0.24. I'm sorry for the late timing, but I've been working as fast as feedback speed allowed (as often as politeness allowed me to ping) - #21645 was lying around mostly finished for 2 month, the cython backend (#22986 / #23400) took about 2 months, and I haven't been able to get an answer at #22824 for about 5 weeks (e.g. @jorisvandenbossche seems to be very busy or simply not available)

As for the PR itself, I only wanted to change the return-type in this PR, but Series.unique touches several important paths:

  1. Series.unique
  2. IndexOpsMixin.unique and hence Index.unique
  3. Categorical.unique
  4. EA.unique

So, as a demo here, I've adapted the first three, and it's a separate issue that the EA contract should support the possibility for return_inverse. This is also something that should IMO needs to make it to v.0.24.

@pep8speaks
Copy link

Hello @h-vetinari! Thanks for submitting the PR.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline notes

if isinstance(self, ABCSeries):
uniqs = self.unique(raw=True)
else:
uniqs = self.unique()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, raw=True is unfortunately not compatible with the Index case

>>> pd.Series([pd.Timestamp('2016-01-01')
... for _ in range(3)]).unique(raw=False)
0 2016-01-01
dtype: datetime64[ns]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is clearly superior output...

Categories (3, object): [a < b < c]
"""
result = super(Series, self).unique()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch for raw=True is exactly the same as before, but the diff is messed up because of the changed indentation.

if isinstance(duplicated, ABCSeries) and method != pd.unique:
result = method(duplicated, raw=True)
else:
result = method(duplicated)
Copy link
Contributor Author

@h-vetinari h-vetinari Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward, I'll admit, but it's the only way I found of keeping the parametrisation while avoiding to raise the FutureWarning.


We see that the values of `animals` get reconstructed correctly, but
the index does not match yet -- consequently, the last step is to
correctly set the index.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, it's be really neat to have #22225 available to use .set_index...

@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #24108 into master will decrease coverage by 0.04%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24108      +/-   ##
==========================================
- Coverage   92.21%   92.16%   -0.05%     
==========================================
  Files         161      161              
  Lines       51684    51723      +39     
==========================================
+ Hits        47658    47673      +15     
- Misses       4026     4050      +24
Flag Coverage Δ
#multiple 90.57% <60%> (-0.04%) ⬇️
#single 42.98% <23.63%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 92.59% <45.45%> (-1.09%) ⬇️
pandas/core/indexes/base.py 96.22% <50%> (-0.11%) ⬇️
pandas/core/base.py 96.75% <66.66%> (-0.89%) ⬇️
pandas/core/arrays/categorical.py 95.17% <75%> (-0.23%) ⬇️
pandas/core/algorithms.py 94.68% <77.77%> (-0.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b5f4d1...6fd279a. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #24108 into master will decrease coverage by 0.04%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24108      +/-   ##
==========================================
- Coverage   92.21%   92.16%   -0.05%     
==========================================
  Files         161      161              
  Lines       51684    51723      +39     
==========================================
+ Hits        47658    47673      +15     
- Misses       4026     4050      +24
Flag Coverage Δ
#multiple 90.57% <60%> (-0.04%) ⬇️
#single 42.98% <23.63%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 92.59% <45.45%> (-1.09%) ⬇️
pandas/core/indexes/base.py 96.22% <50%> (-0.11%) ⬇️
pandas/core/base.py 96.75% <66.66%> (-0.89%) ⬇️
pandas/core/arrays/categorical.py 95.17% <75%> (-0.23%) ⬇️
pandas/core/algorithms.py 94.68% <77.77%> (-0.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b5f4d1...6fd279a. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right this is way too big.

the question about the return value of .unique needs to be answered first; return an .array of the result (for Index and Series) is probably the most reasonable change and is mostly backward compatible

however better to bring this up on the issue (for unique return value)

@h-vetinari
Copy link
Contributor Author

@jreback:

the question about the return value of .unique needs to be answered first;

This is what this PR is about, since the issue had stalled for >1 months despite several pings.

return an .array of the result (for Index and Series) is probably the most reasonable change and is mostly backward compatible

I disagree with this quite strongly:

  • Changing Index.unique from Index->ndarray is as much of a breaking change as changing Series.unique from ndarray->Series (but has no benefits for reconstruction)
  • Series.unique already special-cases Categorical and EA. An ndarray fits even less as the return of a Series method for where pandas is heading.
  • Since .unique strongly advertises that it does not sort, there's an implicit index mapping happening already, only that it's very hard to coax out.
  • If it were to keep returning ndarray, having an inverse is basically impossible without running into several antipatterns.
  • etc.

however better to bring this up on the issue (for unique return value)

Would you mind chiming in there then?

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 6, 2018
@h-vetinari h-vetinari mentioned this pull request Dec 6, 2018
@jreback
Copy link
Contributor

jreback commented May 12, 2019

good idea, but closing as stale

@jreback jreback closed this May 12, 2019
@h-vetinari
Copy link
Contributor Author

@jreback: good idea, but closing as stale

Not stale, but will need precursors, like #24119.

@h-vetinari
Copy link
Contributor Author

@jreback @gfyoung @simonjayhawkins
Can we please reopen this PR, or at least the precursor #24119?

@jreback
Copy link
Contributor

jreback commented Oct 7, 2019

i am not averse to changing the return type of .unique but cannot be linked to return_inverse which likely has less support

PRs need to do exactly 1 thing as have been stated many times
this ape e too complex

@h-vetinari
Copy link
Contributor Author

@jreback
Thanks for the response. I know this PR is too big (see OP), but it is the goal towards which I'd chip off smaller PRs (e.g. #24119).

Unfortunately, the return_inverse parts will have to come first, because it is not possible to return a Series with the correct index without tracking the inverse internally. In fact, I would argue that this is the main reason that Series.unique historically returned an array, because the cython-backend did not support calculating the inverse (luckily that's already been added in #22986 / #23400; the rest is a relatively easy change - see #24119).

@jreback
Copy link
Contributor

jreback commented Oct 7, 2019

@h-vetinari maybe i still don’t see it

why is returning a Series from .unique() actually useful?
why not an .array? then doesn’t .factorize() exactly provide the ability to reconstruct (iirc your main goal); speaking of reconstruction; how is this useful to a user? what is the use case?

@jorisvandenbossche
Copy link
Member

@jreback many of those questions have already seen some discussion in #22824 (including example use cases), so I would suggest we keep the general discussion on overhauling unique there. Can you repeat your comment there?

@h-vetinari h-vetinari mentioned this pull request Dec 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: provide a better way of doing np.unique(return_inverses=True)
5 participants