API: add return_inverse to pd.unique #24119

h-vetinari · 2018-12-05T23:27:05Z

splits off first chunk of API/ENH/DEPR: Series.unique returns Series #24108; progress towards API: provide a better way of doing np.unique(return_inverses=True) #4087 / ENH: adding .unique() to DF (or return_inverse for duplicated) #21357 / BUG: df.duplicated treats None as np.nan in object columns #21720 / API/ENH: overhaul/unify/improve .unique #22824
tests expanded / parametrized / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is the first part I'm splitting off of #24108, but now with full test coverage. For the moment, I've added return_inverse to pd.unique and to Categorical.unique, but it's not trivial because of inconsistencies like the following:

>>> import pandas as pd
>>> idx = pd.Index([0, 1, 1, 0])
>>> pd.unique(idx)
array([0, 1], dtype=int64)
>>>
>>> # So pd.unique(Index) yields an array, except if the Index is categorical...?
>>> idx = idx.astype('category')
>>> pd.unique(idx)
CategoricalIndex([0, 1], categories=[0, 1], ordered=False, dtype='category')

I'd be open to further split off the change for Categorical.unique, or just return NotImplemented for all ExtensionArray types. As mentioned in #24108 already, I believe that the possibility for return_inverse (or maybe even kwargs in general??) is something that should be added to the EA interface. @TomAugspurger @jreback @jbrockmendel

pep8speaks · 2018-12-05T23:27:11Z

Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-11 14:19:02 UTC

h-vetinari · 2018-12-06T00:25:36Z

Failure is (thankfully) only a flaky hypothesis test.

codecov · 2019-02-02T00:52:18Z

Codecov Report

Merging #24119 into master will decrease coverage by 0.17%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master   #24119      +/-   ##
==========================================
- Coverage   92.37%    92.2%   -0.18%     
==========================================
  Files         166      162       -4     
  Lines       52420    51720     -700     
==========================================
- Hits        48423    47688     -735     
- Misses       3997     4032      +35

Flag	Coverage Δ
#multiple	`90.6% <91.66%> (-0.2%)`	⬇️
#single	`43.01% <25%> (+0.13%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.43% <100%> (-0.55%)`	⬇️
pandas/core/algorithms.py	`94.9% <87.5%> (+0.12%)`	⬆️
pandas/io/s3.py	`0% <0%> (-86.37%)`	⬇️
pandas/io/sas/sasreader.py	`86.2% <0%> (-9.95%)`	⬇️
pandas/io/parquet.py	`76.92% <0%> (-7.7%)`	⬇️
pandas/io/clipboard/clipboards.py	`28.23% <0%> (-2.36%)`	⬇️
pandas/core/arrays/base.py	`96.77% <0%> (-1.49%)`	⬇️
pandas/core/computation/check.py	`90.9% <0%> (-1.4%)`	⬇️
pandas/core/arrays/datetimelike.py	`96.35% <0%> (-1.33%)`	⬇️
pandas/core/indexes/datetimelike.py	`97.29% <0%> (-1.23%)`	⬇️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...006d7ad. Read the comment docs.

codecov · 2019-02-02T00:52:18Z

Codecov Report

Merging #24119 into master will decrease coverage by 0.69%.
The diff coverage is 91.66%.

@@            Coverage Diff            @@
##           master   #24119     +/-   ##
=========================================
- Coverage   93.07%   92.37%   -0.7%     
=========================================
  Files         192      166     -26     
  Lines       49551    52439   +2888     
=========================================
+ Hits        46119    48441   +2322     
- Misses       3432     3998    +566

Flag	Coverage Δ
#multiple	`90.79% <91.66%> (-1.04%)`	⬇️
#single	`42.86% <25%> (+0.35%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`96% <100%> (-1.31%)`	⬇️
pandas/core/algorithms.py	`94.58% <87.5%> (-0.94%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-75%)`	⬇️
pandas/compat/__init__.py	`57.91% <0%> (-36.96%)`	⬇️
pandas/plotting/_misc.py	`38.68% <0%> (-26.18%)`	⬇️
pandas/io/common.py	`72.86% <0%> (-21.25%)`	⬇️
pandas/io/gcs.py	`80% <0%> (-20%)`	⬇️
pandas/io/s3.py	`86.36% <0%> (-13.64%)`	⬇️
pandas/io/formats/console.py	`66.66% <0%> (-11.46%)`	⬇️
pandas/core/computation/expr.py	`88.68% <0%> (-8.84%)`	⬇️
... and 198 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8f9be7...786159f. Read the comment docs.

h-vetinari · 2019-02-02T00:55:33Z

Thinks changed around a bit here with DatetimeArray etc., but this commit should work.

As mentioned in the OP, this splits off first chunk of #24108 and makes some progress towards #4087 / #21357 / #21720 / #22824. I'm sure there'll still be lots of discussion, but having an implementation is a good start (even though there's not much happening - the cython backend is already there since a few months).

The diff in test_algos.py is completely busted, but I painstakingly created some nicely modular commits to step through the changes in a sane fashion.

h-vetinari · 2019-03-10T12:55:48Z

@jreback @jorisvandenbossche @TomAugspurger
Is it possible to give this PR some initial review? It's been lying around for ~3 months...

TomAugspurger · 2019-03-10T13:58:22Z

I won't have time in the near-term.

…

On Sun, Mar 10, 2019 at 7:55 AM h-vetinari ***@***.***> wrote: @jreback <https://github.com/jreback> @jorisvandenbossche <https://github.com/jorisvandenbossche> @TomAugspurger <https://github.com/TomAugspurger> Is it possible to give this PR some initial review? It's been lying around for ~3 months... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpQHnXJQ1HqQj0zox1oTTiECzGoYks5vVQDZgaJpZM4ZFeYy> .

jreback · 2019-03-10T16:03:28Z

so again why should we add a method to do this, when we already have one?

In [16]: idx = pd.Index([0, 1, 1, 0])                                                                                                                                                                                                   

In [17]: pd.factorize(idx)                                                                                                                                                                                                              
Out[17]: (array([0, 1, 1, 0]), Int64Index([0, 1], dtype='int64'))

In [18]: idx.take(pd.factorize(idx)[0])                                                                                                                                                                                                 
Out[18]: Int64Index([0, 1, 1, 0], dtype='int64')

true numpy calls this return_inverse in .unique() but they also don't have .factorize(). I really don't like having multiple ways of doing things. I am not convinced this is actually that useful. Can you show the usecase that is not possible or (claimed) not performant that this cannot be done now?

jreback

comments

jreback · 2019-05-12T21:25:39Z

closing as stale

h-vetinari · 2019-05-13T05:39:54Z

@jreback, this is not stale, but I do see that I overlooked to answer your comment a bit further up.

I've made the case in #22824 several times, but among other things, factorize and unique do different things, namely in the treatment of missing values (and there are several cases where that difference can be crucial).

Then there's the fact that most people who are not knee-deep in stats or R won't know factorize, but unique is incredibly intuitive and matches with what numpy offers.

Please reopen this and lets have this discussion (bearing in mind that the actual goal would be #24108; this PR is just a stepping stone).

h-vetinari · 2019-10-08T06:02:20Z

Thanks for reopening! Will try to merge master soon.

h-vetinari

A few comments on the diff

h-vetinari · 2019-10-11T12:10:17Z

doc/source/whatsnew/v1.0.0.rst

@@ -94,6 +94,25 @@ of the Series or columns of a DataFrame will also have string dtype.
 We recommend explicitly using the ``string`` data type when working with strings.
 See :ref:`text.types` for more.

+
+.. _whatsnew_1000.enhancements.unique:


Note, this was chopped off of #24108 and the section is intended to be bigger, compare here

h-vetinari · 2019-10-11T12:12:23Z

pandas/tests/test_algos.py

@@ -355,19 +371,23 @@ def test_factorize_na_sentinel(self, sort, na_sentinel, data, uniques):
        else:
            tm.assert_extension_array_equal(uniques, expected_uniques)

-
 class TestUnique:


Note that no tests are removed here (even though the diff is large). I sometimes joined tests and parametrized them with fixtures. In fact, there should be many more tests now...

h-vetinari · 2019-10-11T12:13:17Z

pandas/tests/test_algos.py

+        assert_series_or_index_or_array_or_categorical_equal(result, expected)
+
+        # TODO: add support for return_inverse to DatetimeArray/DatetimeIndex,
+        # as well as [[Series/Index].unique


I left this here as an indication where things should be heading.

h-vetinari · 2019-10-12T18:11:05Z

@TomAugspurger @jorisvandenbossche @jreback
This is updated and green. Since the diff in the tests basically unreadable, I'll recall the following:

The diff in test_algos.py is completely busted, but I painstakingly created some nicely modular commits to step through the changes in a sane fashion.

jreback · 2019-10-12T18:24:30Z

I am highly u likely to change my -1 on this

i view this as duplicative and confusing api

if you want to document in an example great

h-vetinari · 2019-10-12T18:55:38Z

@jreback: i view this as duplicative and confusing api

.unique is central API, and extremely well-established (on top of being very intuitive; plus: the method exists in numpy with the inverse). You yourself suggested to add a way of doing return_inverse=True for unique.

You had also already agreed to the utility of having an inverse here (at the time you suggested to add the inverse to duplicated, which I did in #21645, only to be redirected to add the inverse to .unique where it fits more properly).

And as I won't tire of repeating: unique != factorize. The methods have different goals and differ in several key points.

I feel the conversation keeps going in circles - maybe this could be a nice example case for writing a fully fledged enhancement proposal?

jreback · 2020-01-26T01:46:32Z

closing. I don't think anyone has the bandwith to work with you on this.

gfyoung added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 6, 2018

h-vetinari added 12 commits February 2, 2019 00:42

Code & whatsnew

158d925

Extend and parametrize unique-test to all numpy dtypes

49e42f1

Also test inverse in test_unique_all_dtypes

9285e8c

Parametrize test_timedelta64_dtype_array_returned

45eff67

Add inverse to test_timedelta64_dtype_array_returned

de22c62

Add inverse to test_nan_in_object_array

2a4f4a7

Parametrize test_categorical

39a2e64

Add inverse to test_categorical

bfc310f

Parametrize test_datetime64tz_aware

af7d8f3

Add inverse to test_datetime64tz_aware

c089f1f

Remove test case that is covered elsewhere

78d4758

Fix DatetimeArray-case in test_datetime64tz_aware; create TODO

006d7ad

h-vetinari force-pushed the pd_unique_inv branch from d19f073 to 006d7ad Compare February 2, 2019 00:52

h-vetinari mentioned this pull request Mar 7, 2019

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

jreback requested changes Mar 10, 2019

View reviewed changes

jreback closed this May 12, 2019

h-vetinari mentioned this pull request May 13, 2019

API/ENH/DEPR: Series.unique returns Series #24108

Closed

11 tasks

jorisvandenbossche reopened this Oct 7, 2019

h-vetinari added 3 commits October 11, 2019 13:51

blackify conflict files

b310b48

Merge remote-tracking branch 'upstream/master' into pd_unique_inv

df375ad

lint

786159f

h-vetinari commented Oct 11, 2019

View reviewed changes

h-vetinari added 2 commits October 11, 2019 16:15

Merge remote-tracking branch 'upstream/master' into pd_unique_inv

39f23ff

fix oversight from merge

dfee500

jreback closed this Jan 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: add return_inverse to pd.unique #24119

API: add return_inverse to pd.unique #24119

h-vetinari commented Dec 5, 2018

pep8speaks commented Dec 5, 2018 •

edited

h-vetinari commented Dec 6, 2018

codecov bot commented Feb 2, 2019

codecov bot commented Feb 2, 2019 •

edited

h-vetinari commented Feb 2, 2019

h-vetinari commented Mar 10, 2019

TomAugspurger commented Mar 10, 2019 via email

jreback commented Mar 10, 2019

jreback left a comment

jreback commented May 12, 2019

h-vetinari commented May 13, 2019

h-vetinari commented Oct 8, 2019

h-vetinari left a comment

h-vetinari Oct 11, 2019

h-vetinari Oct 11, 2019

h-vetinari Oct 11, 2019

h-vetinari commented Oct 12, 2019

jreback commented Oct 12, 2019

h-vetinari commented Oct 12, 2019

jreback commented Jan 26, 2020

API: add return_inverse to pd.unique #24119

API: add return_inverse to pd.unique #24119

Conversation

h-vetinari commented Dec 5, 2018

pep8speaks commented Dec 5, 2018 • edited

Comment last updated at 2019-10-11 14:19:02 UTC

h-vetinari commented Dec 6, 2018

codecov bot commented Feb 2, 2019

Codecov Report

codecov bot commented Feb 2, 2019 • edited

Codecov Report

h-vetinari commented Feb 2, 2019

h-vetinari commented Mar 10, 2019

TomAugspurger commented Mar 10, 2019 via email

jreback commented Mar 10, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback commented May 12, 2019

h-vetinari commented May 13, 2019

h-vetinari commented Oct 8, 2019

h-vetinari left a comment

Choose a reason for hiding this comment

h-vetinari Oct 11, 2019

Choose a reason for hiding this comment

h-vetinari Oct 11, 2019

Choose a reason for hiding this comment

h-vetinari Oct 11, 2019

Choose a reason for hiding this comment

h-vetinari commented Oct 12, 2019

jreback commented Oct 12, 2019

h-vetinari commented Oct 12, 2019

jreback commented Jan 26, 2020

pep8speaks commented Dec 5, 2018 •

edited

codecov bot commented Feb 2, 2019 •

edited