Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique and NaN entries (Trac #1514) #2111

Closed
thouis opened this issue Oct 19, 2012 · 15 comments · Fixed by #18070
Closed

unique and NaN entries (Trac #1514) #2111

thouis opened this issue Oct 19, 2012 · 15 comments · Fixed by #18070

Comments

@thouis
Copy link
Contributor

thouis commented Oct 19, 2012

Original ticket http://projects.scipy.org/numpy/ticket/1514 on 2010-06-18 by trac user rspringuel, assigned to unknown.

When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.

Examples:
a = random.randint(5,size=100).astype(float)

a[12] = nan #add a single nan entry
unique(a)
array([ 0., 1., 2., 3., 4., NaN])
a[20] = nan #add a second
unique(a)
array([ 0., 1., 2., 3., 4., NaN, NaN])
a[13] = nan
unique(a) #and a third
array([ 0., 1., 2., 3., 4., NaN, NaN, NaN])

This is probably due to the fact that x == y evaluates to False if both x and y are NaN. Unique needs to have "or (isnan(x) and isnan(y))" added to the conditional that checks for the presence of a value in the already identified values. I don't know were unique lives in numpy and couldn't find it when I went looking, so I can't make the change myself (or even be sure what the exact syntax of the conditional should be).

Also, the following function can be used to patch over the behavior.

def nanunique(x):
a = numpy.unique(x)
r = []
for i in a:
if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
continue
else:
r.append(i)
return numpy.array(r)

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user rspringuel wrote on 2010-06-18

Shoot, for got to use code blocks above. This only really affects the patch-over code so I'll just repost that:

def nanunique(x):
    a = numpy.unique(x)
    r = []
    for i in a:
        if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
            continue
        else:
            r.append(i)
    return numpy.array(r)

@charris
Copy link
Member

charris commented Feb 19, 2014

Fixed.

@charris charris closed this as completed Feb 19, 2014
@maxalbert
Copy link
Contributor

I'm still seeing this issue with latest master. Which commit should have fixed it? Unless I'm missing something I'd suggest re-opening this issue.

@charris charris reopened this Jan 22, 2015
@jaimefrio
Copy link
Member

This is easy to fix for floats, but I don't see an easy way out for complex or structured dtypes. Will put a quick PR together and we can discuss the options there.

jaimefrio added a commit to jaimefrio/numpy that referenced this issue Jan 23, 2015
Works for floats, but not for complex or structured dtypes
@charris
Copy link
Member

charris commented Jan 23, 2015

@jaimefrio I have it fixed for unique using

    if issubclass(aux.dtype.type, np.inexact):
        # nans always compare unequal, so encode as integers
        tmp = aux.searchsorted(aux)
    else:
        tmp = aux
    flag = np.concatenate(([True], tmp[1:] != tmp[:-1]))

but it looks like all the other operations also have problems. Maybe we need nan_equal, nan_not_equal ufuncs, or maybe something in nanfuntions.

@jaimefrio
Copy link
Member

Sortsearching aux for itself is a smart trick! Although sortsearching all of it is a little wasteful, ideally we would want to spot the first entry with a nan, perhaps something along the lines of, after crating aux and flag as right now, doing:

if not aux[-1] == aux[-1]:
    nanidx = np.argmin(aux == aux)
    nanaux = aux[nanidx:].searchsorted(aux[nanidx:])
    flag[nanidx+1:] = nanaux[1:] != nanaux[:-1]

or something similar after correcting all of the off by one errors that I have likely introduced there.

@jaimefrio
Copy link
Member

This last approach of mine would work for float and complex types, but fail for structured dtypes with floating point fields. But I still think that the searchsorting trick, even though it would work for all types, is too wasteful. Some timings:

In [10]: a = np.random.randn(1000)

In [11]: %timeit np.unique(a)
10000 loops, best of 3: 69.5 us per loop

In [12]: b = np.sort(a)

In [13]: %timeit b.searchsorted(b)
10000 loops, best of 3: 28.1 us per loop

That's going to be a 40% performance hit, which may be OK for a nanunique function, but probably not for the general case.

@Demetrio92
Copy link

Demetrio92 commented Sep 23, 2019

2019 called, the OP problem is still valid and the code is reproducible.

@jaimefrio why can't we have it an option being false by default?

I mean, this behaviour is confusing at best, and performance is not an excuse.

@mattip
Copy link
Member

mattip commented Sep 23, 2019

@Demetrio92 while I appreciate your attempt to get this issue moving, irony/sarcasm on the internet can be interpreted differently by different people, please keep it kind. For some of us, performance is very important and we don't casually add code that slows things down.

PR #5487 may be a better place to comment or make suggestions how to move forward.

Edit: fix PR number

@urimerhav
Copy link

This issue seems to be open for 8 years, but I just want to chime in with a +1 for making the default behavior for numpy.unique to be correct rather than fast. This broke my code and I'm sure others have/will suffer from it. We can have an optional "fast=False" and document nan behavior for fast and nans. I'd be surprised if np.unique is very often the performance bottleneck in time-critical applications.

@ufmayer
Copy link

ufmayer commented Jun 6, 2020

I ran into the same issue today. The core of the np.unique routine is computing a mask on an unravelled sorted array in numpy/lib/arraysetops.py to find when the values change in that sorted array:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = aux[1:] != aux[:-1]

This could be replaced by something like the following, which is pretty much along the lines of jaimefrio's comment from about 5 years ago, but avoids the argmin call:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
if (aux.shape[0] > 0 and isinstance(aux[-1], (float, np.float16,
                                              np.float32, np.float64))
    and np.isnan(aux[-1])):
    aux_firstnan = np.searchsorted(aux, np.nan, side='left')
    mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan-1])
    mask[aux_firstnan] = True
    mask[aux_firstnan+1:] = False
else:
    mask[1:] = aux[1:] != aux[:-1]

Running a few %timeit experiments I observed an at most < 10% runtime penalty if the array is large and there are very few NaN (say 10 NaN out of 1 million), and for such large arrays it actually runs faster if there are lots of NaN.

On the other hand if the arrays are small (for example, 10 entries) there is a significant performance hit because the check for float and NaN is relatively expensive, and runtime can go up to a multiple. This even applies even if there is no NaN as it's the check that's slow.

If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all. So for that case it's really a question of getting a desired result (all NaN combined into a single value group) slightly slower vs getting an undesired result (each NaN in its own value group) slightly faster.

Finally, note that this patch wouldn't fix finding unique values involving compound objects containing NaNs, such as in this example:

a = np.array([[0,1],[np.nan, 1], [np.nan, 1]])
np.unique(a, axis=0)

which still would return

array([[ 0.,  1.],
       [nan,  1.],
       [nan,  1.]])

@dderiso
Copy link

dderiso commented Jul 15, 2020

"If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all."

+1

A function that returns a list containing repeated elements, e.g. a list with more than 1 NaN, should not be called "unique". If repeated elements in the case of NaN is desired, then it should only be a special case that's disabled by default, for example numpy.unique(..., keep_NaN=False).

@Demetrio92
Copy link

@ufmayer submit a PR!

@dmitra79
Copy link

+1
I would also support returning NaN only once

@aerobio
Copy link

aerobio commented Nov 23, 2020

+1
As long as NaNs appear at the end of sorted arrays, this is what I'm using for 1D arrays for the time being:

>>> a = np.array([8, 1, np.nan, 3, np.inf, -np.inf, -2, np.nan, 3])
>>> unique = np.unique(a)
>>> unique
array([-inf,  -2.,   1.,   3.,   8.,  inf,  nan,  nan])
>>> truly_unique = unique[:np.argmax(unique)+1]
>>> truly_unique
array([-inf,  -2.,   1.,   3.,   8.,  inf,  nan])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.