BUG: df.duplicated treats None as np.nan in object columns #21720

h-vetinari · 2018-07-03T19:12:12Z

Found out while writing tests for .duplicated in #21645 (so far, .duplicated was almost exclusively tested implicitly through .drop_duplicates)

At first I thought this is intended behaviour for DataFrame.duplicated(), but Series.duplicated() does not treat it equally. This makes sense to me, since as objects, None is not np.nan - I therefore labelled this as a bug.

s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)
s
# 0     NaN
# 1       3
# 2       3
# 3    None
# 4     NaN
# dtype: object

s.duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().duplicated()
# 0    False
# 1    False
# 2     True
# 3     True  <- CHANGED
# 4     True
# dtype: bool

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-07-03T19:18:35Z

May be related to the discussion in #20442

h-vetinari · 2018-07-03T19:19:08Z

As far as I can tell, the difference is due to the call to pandas.core.algorithms.factorize (which is necessary for object data in DFs, not least due to that ancient numpy issue numpy/numpy#641).

from pandas.core.algorithms import factorize
s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)
factorize(s.values)
# (array([-1,  0,  0, -1, -1], dtype=int64), array([3], dtype=object))

h-vetinari · 2018-07-03T19:34:46Z

This is documented behaviour for factorize - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html

Note: Even if there’s a missing value in values, uniques will not contain an entry for it.

However, Series.duplicated works without such a factorize-call, and gets directly fed to the appropriate hashtable, which - for objects - does apparently distinguish np.nan and None.

jorisvandenbossche · 2018-10-08T12:52:17Z

As also commented on the PR, there is a similar difference between unique and factorize:

In [1]: s = pd.Series([np.nan, 3, 3, None, np.nan], dtype=object)

In [2]: pd.unique(s)
Out[2]: array([nan, 3, None], dtype=object)

In [4]: pd.factorize(s)
Out[4]: (array([-1,  0,  0, -1, -1]), Int64Index([3], dtype='int64'))

Factorize treats them all as identical (since it needs to substitute all missing values with -1), while unique treats them as separate values.

unique still takes a different code path than Series.duplicated from the original post (Series.duplicated is implemented through pd.core.algoritms.duplicated, which uses the pd._lib.hashtable.duplicated_{{dtype}} methods.
So not exactly the same, but the same problem that surfaces.

h-vetinari · 2018-10-08T17:59:03Z

@jorisvandenbossche

The difference between unique and factorize is documented, at least, although I don't know what exactly factorize is used for, and whether those two functions should maybe be married like I'm suggesting in #22986 with ignore_na.

(The discrepancy for unique has at least no direct bugs like here for .duplicated - df.unique does not exist and pd.unique doesn't handle DFs)

jorisvandenbossche · 2018-10-11T13:56:03Z

Yes, it is true that for unique and factorize this is more or less expected/documented, while the duplicated is clearly a bug (it's just that is is coming from a same underlying difference in handling of NaN values)

simonjayhawkins · 2022-06-11T13:50:22Z

For the case in the OP, the DataFrame case is now consistent with the Series result.

"fixed" in commit: [235113e] PERF: Improve performance for df.duplicated with one column subset (#45534) cc @phofl

the issue remains for a DataFrame with more than one column or more than one subset.

s.duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().duplicated()
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# dtype: bool

s.to_frame().assign(dup=lambda x: x[0]).duplicated()
# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# dtype: bool

WillAyd added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 3, 2018

h-vetinari mentioned this issue Jul 5, 2018

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed

h-vetinari mentioned this issue Sep 24, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

h-vetinari mentioned this issue Oct 3, 2018

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

Merged

h-vetinari mentioned this issue Oct 28, 2018

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

Merged

h-vetinari mentioned this issue Dec 5, 2018

API: add return_inverse to pd.unique #24119

Closed

4 tasks

mroeschke added Bug duplicated duplicated, drop_duplicates labels Jun 20, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 11, 2022

code sample for pandas-dev#21720

478782d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.duplicated treats None as np.nan in object columns #21720

BUG: df.duplicated treats None as np.nan in object columns #21720

h-vetinari commented Jul 3, 2018

WillAyd commented Jul 3, 2018

h-vetinari commented Jul 3, 2018 •

edited

h-vetinari commented Jul 3, 2018

jorisvandenbossche commented Oct 8, 2018

h-vetinari commented Oct 8, 2018

jorisvandenbossche commented Oct 11, 2018

simonjayhawkins commented Jun 11, 2022

BUG: df.duplicated treats None as np.nan in object columns #21720

BUG: df.duplicated treats None as np.nan in object columns #21720

Comments

h-vetinari commented Jul 3, 2018

WillAyd commented Jul 3, 2018

h-vetinari commented Jul 3, 2018 • edited

h-vetinari commented Jul 3, 2018

jorisvandenbossche commented Oct 8, 2018

h-vetinari commented Oct 8, 2018

jorisvandenbossche commented Oct 11, 2018

simonjayhawkins commented Jun 11, 2022

h-vetinari commented Jul 3, 2018 •

edited