df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

Abrosimov-a-a · 2016-03-22T17:32:44Z

IN:

import pandas as pd
df = pd.DataFrame([[{'a', 'b'}], [{'b','c'}], [{'b', 'a'}]])
df

OUT:

    0
0   {a, b}
1   {c, b}
2   {a, b}

IN:

df.duplicated()

OUT:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-77-7cc63ba1ed41> in <module>()
----> 1 df.duplicated()

venv/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

venv/lib/python3.5/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   3100 
   3101         vals = (self[col].values for col in subset)
-> 3102         labels, shape = map(list, zip(*map(f, vals)))
   3103 
   3104         ids = get_group_index(labels, shape, sort=False, xnull=False)

TypeError: type object argument after * must be a sequence, not map

I expect:

0    False
1    False
2     True
dtype: bool

pd.show_versions() output:

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.3.0-1-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: ru_RU.UTF-8

pandas: 0.18.0
nose: None
pip: 1.5.6
setuptools: 18.8
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-22T20:05:49Z

I guess. you are using a list-like value INSIDE a cell of a frame. This is quite inefficient and not generally supported. pull-requests accepts to fix in any event.

kokes · 2018-12-13T20:56:08Z

Current pandas gives a slightly different TypeError (TypeError: unhashable type: 'set'), which does get to the point - how would you deduplicate sets or lists? Unlike tuples and primitive types, these are not hashable (sets could be converted to frozensets, which are hashable), so you have to come up with a deduplication strategy.

In any case, since you're dealing with an object dtype, there is no guarantee that the next row won't contain a set or a list, so this deduplication gets only worse from then on. So pandas treats each value as a separate one and processes them as long as they are hashable. Just try a column with three tuples, it will work, then change the last one to be a set and it will fail on that very value.

So, I'm not sure there's a solid implementation that would work here given the lack of hashability in lists, there could potentially be a fix for sets, which would be converted to frozensets upon hash map insertion, but that does seem hacky and arbitrary.

itamar-precog · 2020-01-16T11:37:32Z

How about ignoring unhashable columns for the purposes of dropping duplicates?
Like adding a kwarg 'unhashable_type' whose default is 'raise' (which works as current), but can be set to 'ignore' (at the risk of dropping rows which aren't entirely duplicated).

simonjayhawkins · 2022-06-10T21:06:34Z

The case in the OP is fixed on main

print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]])
print(df.duplicated())
print(df.drop_duplicates())

1.5.0.dev0+867.gdf8acf4201
0    False
1    False
2     True
dtype: bool
        0
0  {a, b}
1  {b, c}

and for lists too

df = pd.DataFrame([[["a", "b"]], [["b"]], [["a", "b"]]])
print(df.duplicated())
print(df.drop_duplicates())

0    False
1    False
2     True
dtype: bool
        0
0  [a, b]
1     [b]

fixed in commit: [235113e] PERF: Improve performance for df.duplicated with one column subset (#45534)

but will still fail for multi-column DataFrame

print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]]).T
print(df.duplicated())

TypeError: unhashable type: 'set'

MichaelTiemannOSC · 2022-12-29T04:08:44Z

I have a test case that also throws this error, when trying to use uncertainties in anything other than a Series (or one-column DataFrame):

import pandas as pd
import uncertainties as un
import pint
from pint import Quantity as Q_
import pint_pandas

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0), un.ufloat(1.0, 0.0)]})

if len(x) == len(x.drop_duplicates())+1:
    print("simple comparison of ufloats, works")
else:
    print("simple comparison of ufloats failed")
    assert False

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)*2+1]})

if len(x) == len(x.drop_duplicates())+1:
    print("comparison of Affine Scalar values (simple or with quantity meters) works")
else:
    print("comparison of Affine Scalar values (simple or with quantity meters) failed")
    assert False

x = pd.DataFrame({'a': [Q_(un.ufloat(1.0, 0.0), 'm'), Q_(un.ufloat(1.0, 0.0), 'm')]})

if not x.compare(x.drop_duplicates()).empty:
    print("simple comparison of ufloat meters works")
else:
    print("simple comparison of ufloat meters, failed")

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)],
                  'b': [un.ufloat(2.0, 0.0)*2+1, un.ufloat(2.0, 0.0)]})

if not x.compare(x.drop_duplicates()).empty:
    print("comparison of Affine Scalar values (multi-column) works")
else:
    print("comparison of Affine Scalar values (multi-column) failed")

Not only does the third case fail (using a combination of uncertainties and quantities), but the fourth case fails with the aforementioned TypeError:

Traceback (most recent call last):
  File "pandas-dropdups.py", line 33, in <module>
    if not x.compare(x.drop_duplicates()).empty:
  File "python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "python3.9/site-packages/pandas/core/frame.py", line 6669, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "python3.9/site-packages/pandas/core/frame.py", line 6811, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "python3.9/site-packages/pandas/core/frame.py", line 6779, in f
    labels, shape = algorithms.factorize(vals, size_hint=len(self))
  File "python3.9/site-packages/pandas/core/algorithms.py", line 818, in factorize
    codes, uniques = factorize_array(
  File "python3.9/site-packages/pandas/core/algorithms.py", line 574, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5943, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5857, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'AffineScalarFunc'

AffineScalarFunc is a synonym for UFloat from the uncertainties package. It results from a ufloat(nominal_value, error_value) having math done to it, making it Affine and no longer simply a ufloat.

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Mar 22, 2016

jreback added this to the Next Major Release milestone Mar 22, 2016

WillAyd mentioned this issue Apr 9, 2019

drop_duplicates throws error when series stores numpy array #25965

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke added duplicated duplicated, drop_duplicates Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 23, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 10, 2022

code sample for pandas-dev#12693

33cb4a9

simonjayhawkins changed the title ~~df.duplicated and drop_duplicates raise TypeError with set and list values.~~ df.duplicated and drop_duplicates raise TypeError with unhashable values. Jun 11, 2022

simonjayhawkins mentioned this issue Jun 11, 2022

Unstable hashtable / duplicated algo for object dtype #27035

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

Abrosimov-a-a commented Mar 22, 2016

jreback commented Mar 22, 2016

kokes commented Dec 13, 2018

itamar-precog commented Jan 16, 2020

simonjayhawkins commented Jun 10, 2022

MichaelTiemannOSC commented Dec 29, 2022

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

Comments

Abrosimov-a-a commented Mar 22, 2016

jreback commented Mar 22, 2016

kokes commented Dec 13, 2018

itamar-precog commented Jan 16, 2020

simonjayhawkins commented Jun 10, 2022

MichaelTiemannOSC commented Dec 29, 2022