unique and NaN entries (Trac #1514) #2111

thouis · 2012-10-19T21:06:05Z

Original ticket http://projects.scipy.org/numpy/ticket/1514 on 2010-06-18 by trac user rspringuel, assigned to unknown.

When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.

Examples:
a = random.randint(5,size=100).astype(float)

a[12] = nan #add a single nan entry
unique(a)
array([ 0., 1., 2., 3., 4., NaN])
a[20] = nan #add a second
unique(a)
array([ 0., 1., 2., 3., 4., NaN, NaN])
a[13] = nan
unique(a) #and a third
array([ 0., 1., 2., 3., 4., NaN, NaN, NaN])

This is probably due to the fact that x == y evaluates to False if both x and y are NaN. Unique needs to have "or (isnan(x) and isnan(y))" added to the conditional that checks for the presence of a value in the already identified values. I don't know were unique lives in numpy and couldn't find it when I went looking, so I can't make the change myself (or even be sure what the exact syntax of the conditional should be).

Also, the following function can be used to patch over the behavior.

def nanunique(x):
a = numpy.unique(x)
r = []
for i in a:
if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
continue
else:
r.append(i)
return numpy.array(r)

thouis · 2012-10-19T21:06:06Z

trac user rspringuel wrote on 2010-06-18

Shoot, for got to use code blocks above. This only really affects the patch-over code so I'll just repost that:

def nanunique(x):
    a = numpy.unique(x)
    r = []
    for i in a:
        if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
            continue
        else:
            r.append(i)
    return numpy.array(r)

charris · 2014-02-19T05:50:39Z

Fixed.

maxalbert · 2015-01-22T23:32:22Z

I'm still seeing this issue with latest master. Which commit should have fixed it? Unless I'm missing something I'd suggest re-opening this issue.

jaimefrio · 2015-01-23T01:06:38Z

This is easy to fix for floats, but I don't see an easy way out for complex or structured dtypes. Will put a quick PR together and we can discuss the options there.

Works for floats, but not for complex or structured dtypes

charris · 2015-01-23T01:40:10Z

@jaimefrio I have it fixed for unique using

    if issubclass(aux.dtype.type, np.inexact):
        # nans always compare unequal, so encode as integers
        tmp = aux.searchsorted(aux)
    else:
        tmp = aux
    flag = np.concatenate(([True], tmp[1:] != tmp[:-1]))

but it looks like all the other operations also have problems. Maybe we need nan_equal, nan_not_equal ufuncs, or maybe something in nanfuntions.

jaimefrio · 2015-01-23T03:29:54Z

Sortsearching aux for itself is a smart trick! Although sortsearching all of it is a little wasteful, ideally we would want to spot the first entry with a nan, perhaps something along the lines of, after crating aux and flag as right now, doing:

if not aux[-1] == aux[-1]:
    nanidx = np.argmin(aux == aux)
    nanaux = aux[nanidx:].searchsorted(aux[nanidx:])
    flag[nanidx+1:] = nanaux[1:] != nanaux[:-1]

or something similar after correcting all of the off by one errors that I have likely introduced there.

jaimefrio · 2015-01-23T06:47:43Z

This last approach of mine would work for float and complex types, but fail for structured dtypes with floating point fields. But I still think that the searchsorting trick, even though it would work for all types, is too wasteful. Some timings:

In [10]: a = np.random.randn(1000)

In [11]: %timeit np.unique(a)
10000 loops, best of 3: 69.5 us per loop

In [12]: b = np.sort(a)

In [13]: %timeit b.searchsorted(b)
10000 loops, best of 3: 28.1 us per loop

That's going to be a 40% performance hit, which may be OK for a nanunique function, but probably not for the general case.

Demetrio92 · 2019-09-23T13:02:21Z

2019 called, the OP problem is still valid and the code is reproducible.

@jaimefrio why can't we have it an option being false by default?

I mean, this behaviour is confusing at best, and performance is not an excuse.

mattip · 2019-09-23T14:01:59Z

@Demetrio92 while I appreciate your attempt to get this issue moving, irony/sarcasm on the internet can be interpreted differently by different people, please keep it kind. For some of us, performance is very important and we don't casually add code that slows things down.

PR #5487 may be a better place to comment or make suggestions how to move forward.

Edit: fix PR number

urimerhav · 2020-05-26T14:28:35Z

This issue seems to be open for 8 years, but I just want to chime in with a +1 for making the default behavior for numpy.unique to be correct rather than fast. This broke my code and I'm sure others have/will suffer from it. We can have an optional "fast=False" and document nan behavior for fast and nans. I'd be surprised if np.unique is very often the performance bottleneck in time-critical applications.

ufmayer · 2020-06-06T02:12:41Z

I ran into the same issue today. The core of the np.unique routine is computing a mask on an unravelled sorted array in numpy/lib/arraysetops.py to find when the values change in that sorted array:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = aux[1:] != aux[:-1]

This could be replaced by something like the following, which is pretty much along the lines of jaimefrio's comment from about 5 years ago, but avoids the argmin call:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
if (aux.shape[0] > 0 and isinstance(aux[-1], (float, np.float16,
                                              np.float32, np.float64))
    and np.isnan(aux[-1])):
    aux_firstnan = np.searchsorted(aux, np.nan, side='left')
    mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan-1])
    mask[aux_firstnan] = True
    mask[aux_firstnan+1:] = False
else:
    mask[1:] = aux[1:] != aux[:-1]

Running a few %timeit experiments I observed an at most < 10% runtime penalty if the array is large and there are very few NaN (say 10 NaN out of 1 million), and for such large arrays it actually runs faster if there are lots of NaN.

On the other hand if the arrays are small (for example, 10 entries) there is a significant performance hit because the check for float and NaN is relatively expensive, and runtime can go up to a multiple. This even applies even if there is no NaN as it's the check that's slow.

If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all. So for that case it's really a question of getting a desired result (all NaN combined into a single value group) slightly slower vs getting an undesired result (each NaN in its own value group) slightly faster.

Finally, note that this patch wouldn't fix finding unique values involving compound objects containing NaNs, such as in this example:

a = np.array([[0,1],[np.nan, 1], [np.nan, 1]])
np.unique(a, axis=0)

which still would return

array([[ 0.,  1.],
       [nan,  1.],
       [nan,  1.]])

dderiso · 2020-07-15T00:58:32Z

"If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all."

+1

A function that returns a list containing repeated elements, e.g. a list with more than 1 NaN, should not be called "unique". If repeated elements in the case of NaN is desired, then it should only be a special case that's disabled by default, for example numpy.unique(..., keep_NaN=False).

Demetrio92 · 2020-07-15T09:33:05Z

@ufmayer submit a PR!

dmitra79 · 2020-07-23T18:07:02Z

+1
I would also support returning NaN only once

aerobio · 2020-11-23T17:18:07Z

+1
As long as NaNs appear at the end of sorted arrays, this is what I'm using for 1D arrays for the time being:

>>> a = np.array([8, 1, np.nan, 3, np.inf, -np.inf, -2, np.nan, 3])
>>> unique = np.unique(a)
>>> unique
array([-inf,  -2.,   1.,   3.,   8.,  inf,  nan,  nan])
>>> truly_unique = unique[:np.argmax(unique)+1]
>>> truly_unique
array([-inf,  -2.,   1.,   3.,   8.,  inf,  nan])

charris closed this as completed Feb 19, 2014

charris reopened this Jan 22, 2015

jaimefrio added a commit to jaimefrio/numpy that referenced this issue Jan 23, 2015

WIP: handling of nan in unique, fixes numpy#2111

fd2cc66

Works for floats, but not for complex or structured dtypes

nicodv mentioned this issue May 27, 2016

Better handling of NaNs nicodv/kmodes#15

Closed

mattip removed the priority: normal label Oct 21, 2018

flexatone mentioned this issue Sep 6, 2020

Index can be created with multiple NaNs static-frame/static-frame#245

Closed

ftrojan mentioned this issue Dec 25, 2020

BUG: Fix unique handling of nan entries. #18070

Merged

charris closed this as completed in #18070 Feb 12, 2021

markotoplak mentioned this issue Jun 29, 2021

Adapt util.unique tests for numpy 1.21 biolab/orange3#5508

Merged

3 tasks

rjeb mentioned this issue Jan 24, 2022

BUG: unique with NaNs and along an axis inconsistent with flat version #20873

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique and NaN entries (Trac #1514) #2111

unique and NaN entries (Trac #1514) #2111

thouis commented Oct 19, 2012

thouis commented Oct 19, 2012

charris commented Feb 19, 2014

maxalbert commented Jan 22, 2015

jaimefrio commented Jan 23, 2015

charris commented Jan 23, 2015

jaimefrio commented Jan 23, 2015

jaimefrio commented Jan 23, 2015

Demetrio92 commented Sep 23, 2019 •

edited

mattip commented Sep 23, 2019 •

edited

urimerhav commented May 26, 2020

ufmayer commented Jun 6, 2020 •

edited

dderiso commented Jul 15, 2020 •

edited

Demetrio92 commented Jul 15, 2020

dmitra79 commented Jul 23, 2020

aerobio commented Nov 23, 2020

unique and NaN entries (Trac #1514) #2111

unique and NaN entries (Trac #1514) #2111

Comments

thouis commented Oct 19, 2012

thouis commented Oct 19, 2012

charris commented Feb 19, 2014

maxalbert commented Jan 22, 2015

jaimefrio commented Jan 23, 2015

charris commented Jan 23, 2015

jaimefrio commented Jan 23, 2015

jaimefrio commented Jan 23, 2015

Demetrio92 commented Sep 23, 2019 • edited

mattip commented Sep 23, 2019 • edited

urimerhav commented May 26, 2020

ufmayer commented Jun 6, 2020 • edited

dderiso commented Jul 15, 2020 • edited

Demetrio92 commented Jul 15, 2020

dmitra79 commented Jul 23, 2020

aerobio commented Nov 23, 2020

Demetrio92 commented Sep 23, 2019 •

edited

mattip commented Sep 23, 2019 •

edited

ufmayer commented Jun 6, 2020 •

edited

dderiso commented Jul 15, 2020 •

edited