Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

Open
Abrosimov-a-a opened this issue Mar 22, 2016 · 5 comments
Labels
Bug duplicated duplicated, drop_duplicates Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@Abrosimov-a-a
Copy link

IN:

import pandas as pd
df = pd.DataFrame([[{'a', 'b'}], [{'b','c'}], [{'b', 'a'}]])
df

OUT:

    0
0   {a, b}
1   {c, b}
2   {a, b}

IN:

df.duplicated()

OUT:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-77-7cc63ba1ed41> in <module>()
----> 1 df.duplicated()

venv/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

venv/lib/python3.5/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   3100 
   3101         vals = (self[col].values for col in subset)
-> 3102         labels, shape = map(list, zip(*map(f, vals)))
   3103 
   3104         ids = get_group_index(labels, shape, sort=False, xnull=False)

TypeError: type object argument after * must be a sequence, not map

I expect:

0    False
1    False
2     True
dtype: bool

pd.show_versions() output:

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.3.0-1-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: ru_RU.UTF-8

pandas: 0.18.0
nose: None
pip: 1.5.6
setuptools: 18.8
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

I guess. you are using a list-like value INSIDE a cell of a frame. This is quite inefficient and not generally supported. pull-requests accepts to fix in any event.

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Mar 22, 2016
@jreback jreback added this to the Next Major Release milestone Mar 22, 2016
@kokes
Copy link
Contributor

kokes commented Dec 13, 2018

Current pandas gives a slightly different TypeError (TypeError: unhashable type: 'set'), which does get to the point - how would you deduplicate sets or lists? Unlike tuples and primitive types, these are not hashable (sets could be converted to frozensets, which are hashable), so you have to come up with a deduplication strategy.

In any case, since you're dealing with an object dtype, there is no guarantee that the next row won't contain a set or a list, so this deduplication gets only worse from then on. So pandas treats each value as a separate one and processes them as long as they are hashable. Just try a column with three tuples, it will work, then change the last one to be a set and it will fail on that very value.

So, I'm not sure there's a solid implementation that would work here given the lack of hashability in lists, there could potentially be a fix for sets, which would be converted to frozensets upon hash map insertion, but that does seem hacky and arbitrary.

@itamar-precog
Copy link

How about ignoring unhashable columns for the purposes of dropping duplicates?
Like adding a kwarg 'unhashable_type' whose default is 'raise' (which works as current), but can be set to 'ignore' (at the risk of dropping rows which aren't entirely duplicated).

@mroeschke mroeschke added duplicated duplicated, drop_duplicates Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 23, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 10, 2022
@simonjayhawkins
Copy link
Member

The case in the OP is fixed on main

print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]])
print(df.duplicated())
print(df.drop_duplicates())
1.5.0.dev0+867.gdf8acf4201
0    False
1    False
2     True
dtype: bool
        0
0  {a, b}
1  {b, c}

and for lists too

df = pd.DataFrame([[["a", "b"]], [["b"]], [["a", "b"]]])
print(df.duplicated())
print(df.drop_duplicates())
0    False
1    False
2     True
dtype: bool
        0
0  [a, b]
1     [b]

fixed in commit: [235113e] PERF: Improve performance for df.duplicated with one column subset (#45534)

but will still fail for multi-column DataFrame

print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]]).T
print(df.duplicated())
TypeError: unhashable type: 'set'

@simonjayhawkins simonjayhawkins changed the title df.duplicated and drop_duplicates raise TypeError with set and list values. df.duplicated and drop_duplicates raise TypeError with unhashable values. Jun 11, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MichaelTiemannOSC
Copy link
Contributor

I have a test case that also throws this error, when trying to use uncertainties in anything other than a Series (or one-column DataFrame):

import pandas as pd
import uncertainties as un
import pint
from pint import Quantity as Q_
import pint_pandas

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0), un.ufloat(1.0, 0.0)]})

if len(x) == len(x.drop_duplicates())+1:
    print("simple comparison of ufloats, works")
else:
    print("simple comparison of ufloats failed")
    assert False

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)*2+1]})

if len(x) == len(x.drop_duplicates())+1:
    print("comparison of Affine Scalar values (simple or with quantity meters) works")
else:
    print("comparison of Affine Scalar values (simple or with quantity meters) failed")
    assert False

x = pd.DataFrame({'a': [Q_(un.ufloat(1.0, 0.0), 'm'), Q_(un.ufloat(1.0, 0.0), 'm')]})

if not x.compare(x.drop_duplicates()).empty:
    print("simple comparison of ufloat meters works")
else:
    print("simple comparison of ufloat meters, failed")

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)],
                  'b': [un.ufloat(2.0, 0.0)*2+1, un.ufloat(2.0, 0.0)]})

if not x.compare(x.drop_duplicates()).empty:
    print("comparison of Affine Scalar values (multi-column) works")
else:
    print("comparison of Affine Scalar values (multi-column) failed")

Not only does the third case fail (using a combination of uncertainties and quantities), but the fourth case fails with the aforementioned TypeError:

Traceback (most recent call last):
  File "pandas-dropdups.py", line 33, in <module>
    if not x.compare(x.drop_duplicates()).empty:
  File "python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "python3.9/site-packages/pandas/core/frame.py", line 6669, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "python3.9/site-packages/pandas/core/frame.py", line 6811, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "python3.9/site-packages/pandas/core/frame.py", line 6779, in f
    labels, shape = algorithms.factorize(vals, size_hint=len(self))
  File "python3.9/site-packages/pandas/core/algorithms.py", line 818, in factorize
    codes, uniques = factorize_array(
  File "python3.9/site-packages/pandas/core/algorithms.py", line 574, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5943, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5857, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'AffineScalarFunc'

AffineScalarFunc is a synonym for UFloat from the uncertainties package. It results from a ufloat(nominal_value, error_value) having math done to it, making it Affine and no longer simply a ufloat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

8 participants