BUG: in _nsorted for frame with duplicated values index #13412

Tux1 · 2016-06-09T14:18:36Z

The function below has been incorrectly implemented. If the frame has an index with duplicated values, you will get a result with more than n rows and not properly sorted. So nsmallest and nlargest for DataFrame doesn't return a correct frame in this particular case.

def _nsorted(self, columns, n, method, keep):
    if not com.is_list_like(columns):
        columns = [columns]
    columns = list(columns)
    ser = getattr(self[columns[0]], method)(n, keep=keep)
    ascending = dict(nlargest=False, nsmallest=True)[method]
    return self.loc[ser.index].sort_values(columns, ascending=ascending,
                                           kind='mergesort')

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-06-09T15:29:58Z

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

Tux1 · 2016-06-10T00:26:44Z

Yes I will fix that soon
Sorry about example

Le 9 juin 2016 à 23:30, Joris Van den Bossche notifications@github.com a écrit :

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2

In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Tux1 · 2016-06-12T05:24:01Z

my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index

…3412)

…3412) Add note to whatsnew

…3412) Add note to whatsnew Add nlargest benchmark

…3412) Add note to whatsnew Add nlargest benchmark Add tests for Series organize nsorted methods pep 8 fixes passed test and pep8

…3412) Add note to whatsnew Add nlargest benchmark Add tests for Series organize nsorted methods pep 8 fixes passed test and pep8 add docstrings

…3412) Add note to whatsnew Add nlargest benchmark Add tests for Series organize nsorted methods pep 8 fixes passed test and pep8 add docstrings add github issue

closes #13412 closes #14707 (cherry picked from commit 6e514da)

jetpackdata · 2017-03-13T13:11:24Z

Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

jreback · 2017-03-13T13:16:45Z

@shankararul see: #15297

(pandas-dev/pandas#13412) using sort_values instead. As a consequence, the normalization hack is no longer required: use raw float values and change the precision when combine'ing.

Tux1 changed the title ~~BUG: in _nsorted for frame~~ BUG: in _nsorted for frame with duplicated values index Jun 9, 2016

jorisvandenbossche added the Bug label Jun 9, 2016

jreback added Difficulty Novice labels Jun 9, 2016

jreback added this to the Next Major Release milestone Jun 9, 2016

Tux1 mentioned this issue Jun 12, 2016

BUG: in _nsorted for frame with duplicated values index #13428

Closed

4 tasks

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 21, 2016

BUG: _nsorted incorrect with duplicated values in index (pandas-dev#1…

86e186b

…3412)

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 22, 2016

BUG: _nsorted incorrect with duplicated values in index (pandas-dev#1…

9b1ca18

…3412) Add note to whatsnew

mroeschke mentioned this issue Nov 22, 2016

BUG: _nsorted incorrect with duplicated values in index (#13412) #14707

Closed

4 tasks

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 23, 2016

BUG: _nsorted incorrect with duplicated values in index (pandas-dev#1…

fdfaa97

…3412) Add note to whatsnew Add nlargest benchmark

jreback modified the milestones: 0.19.2, Next Major Release Nov 25, 2016

jreback closed this as completed in 6e514da Dec 6, 2016

chris-b1 mentioned this issue Dec 9, 2016

BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

Closed

jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016

BUG: _nsorted incorrect with duplicated values in index

11eb8ab

closes #13412 closes #14707 (cherry picked from commit 6e514da)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: in _nsorted for frame with duplicated values index #13412

BUG: in _nsorted for frame with duplicated values index #13412

Tux1 commented Jun 9, 2016 •

edited

jorisvandenbossche commented Jun 9, 2016

Tux1 commented Jun 10, 2016

Tux1 commented Jun 12, 2016

jetpackdata commented Mar 13, 2017

jreback commented Mar 13, 2017

BUG: in _nsorted for frame with duplicated values index #13412

BUG: in _nsorted for frame with duplicated values index #13412

Comments

Tux1 commented Jun 9, 2016 • edited

jorisvandenbossche commented Jun 9, 2016

Tux1 commented Jun 10, 2016

Tux1 commented Jun 12, 2016

jetpackdata commented Mar 13, 2017

jreback commented Mar 13, 2017

Tux1 commented Jun 9, 2016 •

edited