Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: in _nsorted for frame with duplicated values index #13412

Closed
Tux1 opened this issue Jun 9, 2016 · 5 comments
Closed

BUG: in _nsorted for frame with duplicated values index #13412

Tux1 opened this issue Jun 9, 2016 · 5 comments
Labels
Milestone

Comments

@Tux1
Copy link
Contributor

Tux1 commented Jun 9, 2016

The function below has been incorrectly implemented. If the frame has an index with duplicated values, you will get a result with more than n rows and not properly sorted. So nsmallest and nlargest for DataFrame doesn't return a correct frame in this particular case.

def _nsorted(self, columns, n, method, keep):
    if not com.is_list_like(columns):
        columns = [columns]
    columns = list(columns)
    ser = getattr(self[columns[0]], method)(n, keep=keep)
    ascending = dict(nlargest=False, nsmallest=True)[method]
    return self.loc[ser.index].sort_values(columns, ascending=ascending,
                                           kind='mergesort')
@Tux1 Tux1 changed the title BUG: in _nsorted for frame BUG: in _nsorted for frame with duplicated values index Jun 9, 2016
@jorisvandenbossche
Copy link
Member

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

@jreback jreback added this to the Next Major Release milestone Jun 9, 2016
@Tux1
Copy link
Contributor Author

Tux1 commented Jun 10, 2016

Yes I will fix that soon
Sorry about example

Le 9 juin 2016 à 23:30, Joris Van den Bossche notifications@github.com a écrit :

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2

In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@Tux1
Copy link
Contributor Author

Tux1 commented Jun 12, 2016

my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 22, 2016
mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 23, 2016
mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 24, 2016
…3412)

Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8
@jreback jreback modified the milestones: 0.19.2, Next Major Release Nov 25, 2016
mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 25, 2016
…3412)

Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings
mroeschke added a commit to mroeschke/pandas that referenced this issue Dec 4, 2016
…3412)

Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings

add github issue
@jreback jreback closed this as completed in 6e514da Dec 6, 2016
jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016
@jetpackdata
Copy link

Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

	b
a	
1	1
2	1
3	1
4	1
1	1
2	1
3	1
4	1

@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

@shankararul see: #15297

jerome-white pushed a commit to jerome-white/pyzrt that referenced this issue May 22, 2017
(pandas-dev/pandas#13412) using sort_values
instead.

As a consequence, the normalization hack is no longer required: use
raw float values and change the precision when combine'ing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants