Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.duplicated treats None as np.nan in object columns #21720

Open
h-vetinari opened this issue Jul 3, 2018 · 7 comments
Open

BUG: df.duplicated treats None as np.nan in object columns #21720

h-vetinari opened this issue Jul 3, 2018 · 7 comments
Labels
Bug duplicated duplicated, drop_duplicates Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@h-vetinari
Copy link
Contributor

Found out while writing tests for .duplicated in #21645 (so far, .duplicated was almost exclusively tested implicitly through .drop_duplicates)

At first I thought this is intended behaviour for DataFrame.duplicated(), but Series.duplicated() does not treat it equally. This makes sense to me, since as objects, None is not np.nan - I therefore labelled this as a bug.

s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)
s
# 0     NaN
# 1       3
# 2       3
# 3    None
# 4     NaN
# dtype: object

s.duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().duplicated()
# 0    False
# 1    False
# 2     True
# 3     True  <- CHANGED
# 4     True
# dtype: bool
@WillAyd
Copy link
Member

WillAyd commented Jul 3, 2018

May be related to the discussion in #20442

@WillAyd WillAyd added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 3, 2018
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 3, 2018

As far as I can tell, the difference is due to the call to pandas.core.algorithms.factorize (which is necessary for object data in DFs, not least due to that ancient numpy issue numpy/numpy#641).

from pandas.core.algorithms import factorize
s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)
factorize(s.values)
# (array([-1,  0,  0, -1, -1], dtype=int64), array([3], dtype=object))

@h-vetinari
Copy link
Contributor Author

This is documented behaviour for factorize - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html

Note: Even if there’s a missing value in values, uniques will not contain an entry for it.

However, Series.duplicated works without such a factorize-call, and gets directly fed to the appropriate hashtable, which - for objects - does apparently distinguish np.nan and None.

@jorisvandenbossche
Copy link
Member

As also commented on the PR, there is a similar difference between unique and factorize:

In [1]: s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)

In [2]: pd.unique(s)
Out[2]: array([nan, 3, None], dtype=object)

In [4]: pd.factorize(s)
Out[4]: (array([-1,  0,  0, -1, -1]), Int64Index([3], dtype='int64'))

Factorize treats them all as identical (since it needs to substitute all missing values with -1), while unique treats them as separate values.

unique still takes a different code path than Series.duplicated from the original post (Series.duplicated is implemented through pd.core.algoritms.duplicated, which uses the pd._lib.hashtable.duplicated_{{dtype}} methods.
So not exactly the same, but the same problem that surfaces.

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

The difference between unique and factorize is documented, at least, although I don't know what exactly factorize is used for, and whether those two functions should maybe be married like I'm suggesting in #22986 with ignore_na.

(The discrepancy for unique has at least no direct bugs like here for .duplicated - df.unique does not exist and pd.unique doesn't handle DFs)

@jorisvandenbossche
Copy link
Member

Yes, it is true that for unique and factorize this is more or less expected/documented, while the duplicated is clearly a bug (it's just that is is coming from a same underlying difference in handling of NaN values)

@simonjayhawkins
Copy link
Member

For the case in the OP, the DataFrame case is now consistent with the Series result.

"fixed" in commit: [235113e] PERF: Improve performance for df.duplicated with one column subset (#45534) cc @phofl

the issue remains for a DataFrame with more than one column or more than one subset.

s.duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().assign(dup=lambda x: x[0]).duplicated()
# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# dtype: bool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants