Specify NaN behaviour in `unique()` #249

honno · 2021-09-01T09:47:44Z

Say we have an array with multiple NaN values:

>>> x = xp.full(5, xp.nan)
>>> x
Array([nan, nan, nan, nan, nan])

Should NaNs be treated as unique to one another?

>>> xp.unique(x)
Array([nan, nan, nan, nan, nan])

Or should they be treated as the same?

>>> xp.unique(x)
Array([nan])

I have no dog in the race but you might want to refer to some discussion bottom of numpy/numpy#19301 and in the NumPy mailing list that relates to a recent accidental "regression" in how np.unique() deals with NaNs.

In either case, just specificity would prevent creating wrapper methods to standardise this behaviour. For example I created this (admittedly scrappy) utility method for HypothesisWorks/hypothesis#3065 to work for both scenarios, which @asmeurer noted was not ideal and probably a failure of the Array API spec.

def count_unique(array):
    n_unique = 0
    nan_index = xp.isnan(array)
    for isnan, count in zip(*xp.unique(nan_index, return_counts=True)):
        if isnan:
            n_unique += count
            break
    filtered_array = array[~nan_index]
    unique_array = xp.unique(filtered_array)
    n_unique += unique_array.size
    return n_unique

And if deterministic NaN behaviour cannot be part of the API, a note should be added to say this behaviour is out of scope.

The text was updated successfully, but these errors were encountered:

pmeier · 2021-09-01T10:01:09Z

IMO this depends on the definition of "unique". If it means "value equality", it should return all occurrences of NaN, since assert float("nan") != float("nan"). ~~If it means "identity", it should return only one NaN, since assert float("nan") is float("nan").~~ (See #249 (comment))

If I'm not mistaken, in this context "unique" means value equality, since we are dealing with values and not objects (is np.unique defined for the object dtype?). To satisfy both requirements, we could add an equal_nan flag that, if True, treats NaN's as equal. This is similar to what most testing functions do, e.g. np.testing.assert_allclose.

rgommers · 2021-09-01T12:23:08Z

Are there clear use cases for returning multiple nan's? As @thomasjpfan said in numpy/numpy#19301, a single nan is what is expected for ML applications. And in other contexts I'd expect that to be true as well. I can't remember any case where I had a need for multiple distinct nan's.

asmeurer · 2021-09-01T20:33:32Z

How do pytorch and tensorflow (and any others) handle this? Do they also have discussions about this behavior?

pmeier · 2021-09-02T05:23:40Z

For PyTorch I'm not aware of a discussion about this. As for the behavior, it seems to use the value equality approach I described in my comment above:

>>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
>>> torch.unique(t)
tensor([nan, nan, 1.])

Zac-HD · 2021-09-04T13:15:50Z

assert float("nan") is float("nan")

>>> float('nan') is float('nan')
False

math.nan is math.nan, sure, but otherwise identity of separately-constructed objects is implementation-defined and probably False. The only reliable exceptions are singletons (None, False, True, etc.) and some implementation-defined details like CPython's interning for small integers and strings.

(personally I favor the "return one nan" option, but we'd better document whatever we decide)

pmeier · 2021-09-04T14:59:12Z

@Zac-HD my bad, thanks for the heads-up. I've fixed my comment above.

rgommers · 2021-09-05T19:19:26Z

TensorFlow and CuPy also return multiple nan's. CuPy will almost certainly change that to match NumPy. For TensorFlow I also could not find an issue. It seems likely that in all libraries the multiple-nan return is just a natural result of implementing unique via comparisons without any special handling, not because it's useful behaviour.

leofang · 2021-09-05T20:40:12Z

It'd be very annoying to implement on GPUs if we were to return a unique nan. In any case unique relies on sort or argsort, and it creates a thread divergence that hurts performance just to check special cases like this. We were (=I was) bothered enough to follow NumPy's way to handle complex nan ordering, and would prefer to not add any more special cases. (You can imagine it'd only hurt performance more by an additional call to isnan just to mark where the nan starts in a sorted array, count the occurrence so as to allocate output memory, etc)

rgommers · 2021-09-06T15:29:58Z

Hmm, that's a good point. Although it raises the question what users do then, they usually need to handle the nan's anyway by skipping or combining them.

Your argument means adding an equal_nan also doesn't help, then you'd still have to implement it.

honno · 2021-09-06T16:20:16Z

Could equal_nan be a kwarg which must support None (i.e. whatever xp.unique() would do by default) but can omit either/both True and False? The spec doesn't allow omission of any specific arguments per-say but it does allow omission of boolean indexing (technically a specified argument for get/set item), so maybe it's not too "magical" of a compromise.

It could even be that a library which knows how it will return NaNs in xp.unique() can allow the respective equal_nan bool arg to just work, and raise a helpful error for the opposite arg.

jakirkham · 2021-09-16T17:47:25Z

As an aside, we discovered this behavior on the call today.

In [1]: import numpy as np

In [2]: {float('nan'), float('nan')}
Out[2]: {nan, nan}

In [3]: {np.nan, np.nan}
Out[3]: {nan}

jakirkham · 2021-09-16T17:51:34Z

From a suggestion in the call, what if a unique implementation filtered out nans first? Something like this?

a_notnan = a[~np.isnan(a)]
r = np.unique(a_notnan)

pmeier · 2021-09-16T17:54:55Z

@jakirkham

As an aside, we discovered this behavior on the call today.

In [1]: import numpy as np

In [2]: {float('nan'), float('nan')}
Out[2]: {nan, nan}

In [3]: {np.nan, np.nan}
Out[3]: {nan}

That is the same trap I fell into: Since np.nan is only generated once, it is by design the identical object and thus will be filtered by the set:

In[1]: nan = float("nan")

In[2]: {nan, nan}
Out[2]: {nan}

jakirkham · 2021-09-16T18:08:01Z

Yeah was thinking that might be the case. Thanks for sharing Philip 🙂

honno · 2021-09-19T10:56:56Z

From a suggestion in the call, what if a unique implementation filtered out nans first? Something like this?
a_notnan = a[~np.isnan(a)]
r = np.unique(a_notnan)

A limitation with this example specifically is that boolean indexing might not be supported by an Array API library (it's optional behaviour in the spec, see #84). You could manually zip an array a with its NaN mask xp.isnan(a) to then construct a unique array, but this could be painfully slow, and generally an awkward implementation (see Leo Fang's previous comment).

honno · 2021-10-11T09:56:19Z

In either the proposed unique_all() (#275) or the current unique(..., return_counts=True), I just wanted to clarify what counting behaviour we can expect from both NaNs-unique and NaNs-not-unique scenarios.

I assume that we get the following if NaNs are unique to each-other:

>>> x = xp.full(5, xp.nan)
>>> values, _, _, counts = xp.unique_all(x)
>>> values
Array([nan, nan, nan, nan, nan])
>>> counts
Array([1, 1, 1, 1, 1])

And likewise if NaNs are not unique to each-other:

>>> x = xp.full(5, xp.nan)
>>> values, _, _, counts = xp.unique_all(x)
>>> values
Array([nan])
>>> counts
Array([5])

As you can see, the first scenario is a bit awkward in finding NaN counts as you'd still need to iterate over values to make sure you hit every nan.

Also to note, counting behaviour does not help us get a consistent count of unique values.

asmeurer · 2021-10-11T21:18:22Z

I think we have to make counts and values consistent with one another, regardless of which nan approach we take. It would be nonsensical to return something like values == array([nan, nan, nan, nan, nan]); counts = array([5]). Then values and counts wouldn't even correspond to each other any more.

kgryte · 2021-10-21T03:00:10Z

Comparison among environments:

MATLAB: NaNs are distinct.

> A = [5 5 NaN NaN];
> C = unique(A)
C = 1×3

 5   NaN   NaN

Julia: returns only a single NaN

julia> unique([1,1,NaN,2,3,NaN,NaN,NaN])
4-element Array{Float64,1}:
   1.0
   NaN  
   2.0
   3.0

The unique implementation is based on Set:

julia> Set([1.0,2.0,NaN,2.0,NaN,3.0,NaN])
Set([NaN, 2.0, 3.0, 1.0])

And uses isequal for determining whether values are equal to one another.

julia> isequal([1., NaN], [1., NaN])
true

julia> [1., NaN] == [1., NaN]
false

As an aside, Julia's unique and Set treat 0.0 and -0.0 as distinct.

julia> unique([0.0,-0.0])
2-element Array{Float64,1}:
  0.0
 -0.0

Python: depends on use of set.

In [1]: set([float('nan'), float('nan')])                                                                                                     
Out[1]: {nan, nan}

In [2]: nan = float('nan');

In [3]: set([nan,nan])                                                                                                                        
Out[3]: {nan}

Torch: NaNs are distinct.

>>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
>>> torch.unique(t)
tensor([nan, nan, 1.])

TensorFlow: NaNs are distinct.

In [1]: tf.unique([1.0,2.0,np.nan,2.0,np.nan,3.0])
Out[1]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)

In [2]: nan = float('nan');

In [3]: tf.unique([1.0,2.0,nan,2.0,nan,3.0])
Out[3]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)

NumPy previously returned unique NaNs. As of v1.21, returns only a single NaN.

Downstream Libraries

In Matplotlib, no accommodation appears to be made for np.unique returning multiple NaNs.
In SciPy, there is one documented instance of using a workaround for multiple NaNs.
Scikit-learn has a small wrapper around np.unique to handle multiple NaNs. This wrapper assumes sorted unique values.

Proposal

Unique should return multiple NaNs (in-line with previous NumPy behavior and other libraries).
Specify sort order for floating-point values (see gh-288)
To return only a single NaN, users can implement a similar workaround to sklearn:
```
values  = xp.unique_values(x)
sorted_values = xp.sort(values)
if bool(xp.isnan(sorted_values[-1])):
    // Find index of first NaN occurrence...
    idx = ...
    sorted_values = sorted_values[:idx+1]
```
There would be some additional work for getting the indices, inverse_indices, and counts.

Another workaround for users is to select a sentinel value other than NaN to indicate a "missing" value for the purposes of unique. For example, if values are known to be integers on the interval [1,100], replace NaN values with 0 before calling unique.

In short, the issue of multiple NaN values seems, to me at least, to be a user concern and should thus be pushed to userland. While the utility of multiple NaN values in the output of unique is undoubtedly limited, I don't see why the prevention of this should be the responsibility of array libraries, especially when bearing this responsibility entails a performance cost.

kgryte added the API change Changes to existing functions or objects in the API. label Oct 4, 2021

kgryte added this to the v2021 milestone Oct 4, 2021

honno mentioned this issue Oct 7, 2021

Split unique into separate functions #275

Merged

kgryte mentioned this issue Oct 21, 2021

Clarify floating-point sort order for ascending and descending sort #288

Closed

kgryte mentioned this issue Nov 4, 2021

Require that NaNs be considered distinct when returning unique elements #310

Merged

kgryte closed this as completed in #310 Nov 4, 2021

soraros mentioned this issue Jan 17, 2022

jnp.unique does not properly handle NaN values google/jax#9182

Closed

RKuipers mentioned this issue Mar 14, 2022

unique_counts adhering to Array API standard Transpile-AI/ivy#372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify NaN behaviour in `unique()` #249

Specify NaN behaviour in `unique()` #249

honno commented Sep 1, 2021

pmeier commented Sep 1, 2021 •

edited

rgommers commented Sep 1, 2021

asmeurer commented Sep 1, 2021

pmeier commented Sep 2, 2021

Zac-HD commented Sep 4, 2021

pmeier commented Sep 4, 2021

rgommers commented Sep 5, 2021

leofang commented Sep 5, 2021

rgommers commented Sep 6, 2021

honno commented Sep 6, 2021 •

edited

jakirkham commented Sep 16, 2021

jakirkham commented Sep 16, 2021

pmeier commented Sep 16, 2021

jakirkham commented Sep 16, 2021

honno commented Sep 19, 2021

honno commented Oct 11, 2021 •

edited

asmeurer commented Oct 11, 2021

kgryte commented Oct 21, 2021 •

edited

Specify NaN behaviour in unique() #249

Specify NaN behaviour in unique() #249

Comments

honno commented Sep 1, 2021

pmeier commented Sep 1, 2021 • edited

rgommers commented Sep 1, 2021

asmeurer commented Sep 1, 2021

pmeier commented Sep 2, 2021

Zac-HD commented Sep 4, 2021

pmeier commented Sep 4, 2021

rgommers commented Sep 5, 2021

leofang commented Sep 5, 2021

rgommers commented Sep 6, 2021

honno commented Sep 6, 2021 • edited

jakirkham commented Sep 16, 2021

jakirkham commented Sep 16, 2021

pmeier commented Sep 16, 2021

jakirkham commented Sep 16, 2021

honno commented Sep 19, 2021

honno commented Oct 11, 2021 • edited

asmeurer commented Oct 11, 2021

kgryte commented Oct 21, 2021 • edited

Downstream Libraries

Proposal

Specify NaN behaviour in `unique()` #249

Specify NaN behaviour in `unique()` #249

pmeier commented Sep 1, 2021 •

edited

honno commented Sep 6, 2021 •

edited

honno commented Oct 11, 2021 •

edited

kgryte commented Oct 21, 2021 •

edited