python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641

numpy-gitbot · 2012-10-19T15:10:50Z

Original ticket http://projects.scipy.org/numpy/ticket/2188 on 2012-07-23 by @yarikoptic, assigned to unknown.

tested against current master (present in 1.6.2 as well):

If with python2.x series it works ok, without puking:

$> python2.7 -c 'import numpy as np; print repr(repr(np.unique(np.array([1,2, None, "str"]))))' 
'array([None, 1, 2, str], dtype=object)'

NB I will report a bug on repr here separately if not yet filed

it fails with python3.x altogether:

$> python3.2 -c 'import numpy as np; print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.2/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
    ar.sort()
TypeError: unorderable types: int() > NoneType()

whenever IMHO it must operate correctly -- semantic of unique() action should not imply ability to sort the elements

The text was updated successfully, but these errors were encountered:

yarikoptic · 2013-09-17T15:47:28Z

any fresh ideas on this issue?

njsmith · 2013-09-17T16:27:45Z

The only options for implementing unique are:

sorting the array
putting everything in a hash table
do brute-force == comparison on all objects against all objects

Only the sorting and hashing strategies have reasonable speed, and only the sorting and all-against-all strategies have reasonable memory overhead for large arrays. So I guess we could add fallback options to unique where if sorting doesn't work then it tries one of the other strategies? But OTOH it's not nice to have a function that sometimes suddenly takes massively more CPU or memory depending on what input you give it.

I guess I'd be +1 on a patch that adds a strategy={"sort", "hash", "bruteforce"} option to np.unique, so users with weird data can decide on what makes sense for their situation. If you want to write such a thing :-)

yarikoptic · 2013-09-17T16:40:07Z

at first I wondered if it could be sorting + hash table for unsortable items (didn't check if cmp of elements is used when sorting elements of dtype object array) so sorting __cmp__ could position them on 'first-come-first-in-line' order?
but then realized that it doesn't provide a relief in general for uncomparable types, e.g. when it is a mix of int and str... so I wondered if for dtype=object it could be feasible to deduce first participating dtypes and 'unique' (possibly via sort) within each dtype possibly relying on hash tables for dtypes without __cmp__ ?

yarikoptic · 2013-09-17T16:46:28Z

just for those who might need a workaround, here is how I do it via 'hashing' through the built in sets for my case:

$> python3.3 -c 'import numpy as np; print(np.array(list(set([1,2,"str", None])), dtype=object))' 
[None 1 2 'str']

njsmith · 2013-09-17T16:48:20Z

Not sure what you just said :-)

But on further thought sorting is really not reliable for dtype=object
anyway. I've probably written dozens of classes that override eq but
keep the default lt, which means that sort-based unique will just
silently return the wrong answer. This is a pretty nasty bug I guess.

If the objects are hashable then you can just do set(arr) to get the unique
elements, but no guarantee that they're hashable in general. (But at least
everyone that for hashable objects this should work, which is not true
for sorting.) Maybe this would be a better default implementation of
np.unique for object arrays.

On Tue, Sep 17, 2013 at 5:40 PM, Yaroslav Halchenko <
notifications@github.com> wrote:

at first I wondered if it could be sorting + hash table for unsortable
items (didn't check if cmp of elements is used when sorting elements of
dtype object array) so sorting cmp could position them on
'first-come-first-in-line' order?
but then realized that it doesn't provide a relief in general for
uncomparable types, e.g. when it is a mix of int and str... so I wondered
if for dtype=object it could be feasible to deduce first participating
dtypes and 'unique' (possibly via sort) within each dtype possibly relying
on hash tables for dtypes without cmp ?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/641#issuecomment-24603047
.

yarikoptic · 2013-09-17T17:00:50Z

gy... ok -- cruel description in Python:

def bucketed_unique(a):
    buckets = {}
    for x in a:
        t = type(x)
        if not (t in buckets):
            buckets[t] = bucket = []
        else:
            bucket = buckets[t]
        bucket.append(x)
    out = []
    for bucket in buckets.itervalues():
        # here could be actually set of conditions instead of blind try/except
        try:
            out.append(np.unique(bucket))
        except:
            out.append(np.array(list(set(bucket)), dtype=object))
    return np.hstack(out)

print bucketed_unique([1, 2, 'str', None, np.nan, None, np.inf, int])
[1 2 'str' None <type 'int'> nan inf]

yarikoptic · 2013-09-17T17:01:51Z

sure thing -- no 'bucketing' should be done for non-object ndarrays

njsmith · 2013-09-17T18:51:46Z

That algorithm doesn't use == as its definition of uniqueness. Objects of
different types can be ==. (Easy example: 1, 1.0). Its definition doesn't
correspond to any standard python concept.
On 17 Sep 2013 18:01, "Yaroslav Halchenko" notifications@github.com wrote:

sure thing -- no 'bucketing' should be done for non-object ndarrays

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/641#issuecomment-24604740
.

yarikoptic · 2013-09-17T19:10:13Z

indeed! not sure but may be post-hoc analysis across buckets would make sense... btw atm problem reveals itself also for comparison with complex numbers:

$> python3.3 -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))'  
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
    ar.sort()
TypeError: unorderable types: float() > complex()

$> python -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))' 
[1]

yarikoptic · 2013-09-17T19:13:02Z

although on the 2nd thought -- what should the 'unique' value dtype then be among all available choices (int/float/complex)? with non-object array it is clear... with heterogeneous object array, not so -- may be even different dtypes should be maintained as such...

jreback · 2014-02-01T16:22:07Z

Here's the way I solved argsort blowing up on mixed int/str in py3: https://github.com/pydata/pandas/pull/6222/files

order the ints before the strings in object dtypes
use a hashtable to map the locations to get the indexer
reasonably fast I think

uses pandas hashtable implementation but could easily be swapped out / adapted to c-code I think

charris · 2014-02-16T05:13:05Z

Anyone want to take a swing at this? Not sure what to do about record dtypes.

sobayed · 2016-10-24T12:29:11Z

Any updates on this? I encountered this bug when trying to use scikit-learn's LabelEncoder on pandas DataFrame columns with dtype "object" and missing values

- python3 requires decoding the bytes objects to produce strings - np.unique() fails for dtype=object in python3 (see numpy/numpy#641) - flake, pydocstyle

* WIP: add basic GDF1 and GDF2 support only tested on a few GDF test files doesn't break existing EDF tests * Remove \x00 from channel names * Fix data calibration - Unified offset and calibration values for EDF/BDF/GDF modules - few PEP8 fixes - TODO: GDF 2.x triggers are detected as multiples of 256 - TODO: overflow and saturation detection * Proper imports imports in test_gdf.py * Fix to test_edf.py An additional warning has been added in edf.py when Lowpass information is not available in file header. As a result test_stim_channel() would fail. The test should now expect 2 warnings. * Separated EDF and GDF code - EDF and GDF file headers are now read in their own private subfunction - some PEP8 and cosmetic changes - GDF events are still multiples of 256 * Add biosig GDF test file make Travis happy * python3 fixes - python3 requires decoding the bytes objects to produce strings - np.unique() fails for dtype=object in python3 (see numpy/numpy#641) - flake, pydocstyle * Read biosemi GDF stim_channel properly * Speed up reading from file - fromstring -> fromfile * Test routines and added safechecks Also remove test datafiles (see mne-tools/mne-testing-data#23) * Update utils, and missing lowpass info no longer issues warning * Added stim_channel argument check * Address comments * Address comments and update doc * Address comments * Docstyle

ron819 · 2018-11-13T15:32:43Z

This one is really old. Is it still relevant?

yarikoptic · 2018-11-13T21:02:45Z

seems to be the case at least with 1.15.4 in debian:

$> python3 --version
Python 3.6.5

$> PYTHONPATH=.. python3 -c 'import numpy as np; print(np.__version__); print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'                                                                                   
1.15.4
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 233, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 281, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'int'

acdr000 · 2019-02-18T08:39:41Z

Definitely still relevant. Just came across this, trying to call np.unique(x, return_inverse=True) on an object array.

Regarding the question of how to make this work, when sorting is undefined: I'd much prefer a slow algorithm over the status quo of raising an error. (In my experience, oftentimes, if you need performant algorithms, you shouldn't be using an object array to begin with.)

eric-wieser · 2019-02-18T08:46:36Z

I think this is a feature request, not a bug. The docs clearly state:

Returns the sorted unique elements of an array.

For the case of an array like [1, None], no such sorted array exists in python 3 since sorting is no longer well-defined.

charris · 2019-02-18T14:29:03Z

It would be nice to have an option to not return a sorted array, it would allow some optimizations.

mcgfeller · 2021-03-24T13:26:19Z

This was the most bothering regression in porting about 150K lines from Py2.7 to 3.8. Navigating around this lead to a lot of ugly workarounds and slower code. I see that this is caused by Guido's decision in Python, but a fast, non-sorted option for unique would be really helpful.

mayer79 · 2024-02-09T21:03:14Z

Same words as @mcgfeller . I would even take a slowish non-sorted version.

numpy-gitbot mentioned this issue Oct 23, 2012

repr for string item within array of dtype object is missing quotes (Trac #1361) #1959

Closed

jreback mentioned this issue Feb 2, 2014

DOC/PY3: np.unique errors on pd.Series with mixed string/float in python3 pandas-dev/pandas#6229

Closed

rmcgibbo mentioned this issue Jul 29, 2014

[WIP] Refactor MSM msmbuilder/msmbuilder#222

Merged

pfctdayelise mentioned this issue Aug 5, 2014

Fix for stats.mode to work on character arrays. Closes #3814 scipy/scipy#3857

Merged

juliantaylor mentioned this issue Oct 2, 2014

np.unique no longer works with fixed float and string arrays in Python 3 #5145

Closed

astrofrog mentioned this issue Oct 2, 2014

More Python 3 fixes glue-viz/glue#457

Merged

Rambatino mentioned this issue Oct 22, 2016

Added ordinal variable type to independent variables Rambatino/CHAID#45

Merged

nbara pushed a commit to nbara/mne-python that referenced this issue Apr 19, 2017

python3 fixes

c655d71

- python3 requires decoding the bytes objects to produce strings - np.unique() fails for dtype=object in python3 (see numpy/numpy#641) - flake, pydocstyle

milags mentioned this issue Jul 6, 2017

Inverse Transform for Label Encoder with mixed strings and numbers returns only strings scikit-learn/scikit-learn#9285

Open

jreback mentioned this issue Jul 15, 2017

TST: Don't assert that a bug exists in numpy pandas-dev/pandas#16940

Merged

h-vetinari mentioned this issue Jun 7, 2018

ENH: adding .unique() to DF (or return_inverse for duplicated) pandas-dev/pandas#21357

Open

h-vetinari mentioned this issue Jul 3, 2018

BUG: df.duplicated treats None as np.nan in object columns pandas-dev/pandas#21720

Open

mattip removed the priority: normal label Oct 21, 2018

WarrenWeckesser mentioned this issue Dec 31, 2019

Getting a Type Error while using np.unique with array of dtype=object #15199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641

python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641

numpy-gitbot commented Oct 19, 2012

yarikoptic commented Sep 17, 2013

njsmith commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

njsmith commented Sep 17, 2013

yarikoptic commented Sep 17, 2013 •

edited by eric-wieser

yarikoptic commented Sep 17, 2013

njsmith commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

jreback commented Feb 1, 2014

charris commented Feb 16, 2014

sobayed commented Oct 24, 2016

ron819 commented Nov 13, 2018

yarikoptic commented Nov 13, 2018

acdr000 commented Feb 18, 2019

eric-wieser commented Feb 18, 2019

charris commented Feb 18, 2019

mcgfeller commented Mar 24, 2021 •

edited

mayer79 commented Feb 9, 2024

python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641

python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641

Comments

numpy-gitbot commented Oct 19, 2012

yarikoptic commented Sep 17, 2013

njsmith commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

njsmith commented Sep 17, 2013

yarikoptic commented Sep 17, 2013 • edited by eric-wieser

yarikoptic commented Sep 17, 2013

njsmith commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

yarikoptic commented Sep 17, 2013

jreback commented Feb 1, 2014

charris commented Feb 16, 2014

sobayed commented Oct 24, 2016

ron819 commented Nov 13, 2018

yarikoptic commented Nov 13, 2018

acdr000 commented Feb 18, 2019

eric-wieser commented Feb 18, 2019

charris commented Feb 18, 2019

mcgfeller commented Mar 24, 2021 • edited

mayer79 commented Feb 9, 2024

yarikoptic commented Sep 17, 2013 •

edited by eric-wieser

mcgfeller commented Mar 24, 2021 •

edited