Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify NaN behaviour in unique() #249

Closed
honno opened this issue Sep 1, 2021 · 18 comments · Fixed by #310
Closed

Specify NaN behaviour in unique() #249

honno opened this issue Sep 1, 2021 · 18 comments · Fixed by #310
Labels
API change Changes to existing functions or objects in the API.
Milestone

Comments

@honno
Copy link
Member

honno commented Sep 1, 2021

Say we have an array with multiple NaN values:

>>> x = xp.full(5, xp.nan)
>>> x
Array([nan, nan, nan, nan, nan])

Should NaNs be treated as unique to one another?

>>> xp.unique(x)
Array([nan, nan, nan, nan, nan])

Or should they be treated as the same?

>>> xp.unique(x)
Array([nan])

I have no dog in the race but you might want to refer to some discussion bottom of numpy/numpy#19301 and in the NumPy mailing list that relates to a recent accidental "regression" in how np.unique() deals with NaNs.

In either case, just specificity would prevent creating wrapper methods to standardise this behaviour. For example I created this (admittedly scrappy) utility method for HypothesisWorks/hypothesis#3065 to work for both scenarios, which @asmeurer noted was not ideal and probably a failure of the Array API spec.

def count_unique(array):
    n_unique = 0
    nan_index = xp.isnan(array)
    for isnan, count in zip(*xp.unique(nan_index, return_counts=True)):
        if isnan:
            n_unique += count
            break
    filtered_array = array[~nan_index]
    unique_array = xp.unique(filtered_array)
    n_unique += unique_array.size
    return n_unique

And if deterministic NaN behaviour cannot be part of the API, a note should be added to say this behaviour is out of scope.

@pmeier
Copy link

pmeier commented Sep 1, 2021

IMO this depends on the definition of "unique". If it means "value equality", it should return all occurrences of NaN, since assert float("nan") != float("nan"). If it means "identity", it should return only one NaN, since assert float("nan") is float("nan"). (See #249 (comment))

If I'm not mistaken, in this context "unique" means value equality, since we are dealing with values and not objects (is np.unique defined for the object dtype?). To satisfy both requirements, we could add an equal_nan flag that, if True, treats NaN's as equal. This is similar to what most testing functions do, e.g. np.testing.assert_allclose.

@rgommers
Copy link
Member

rgommers commented Sep 1, 2021

Are there clear use cases for returning multiple nan's? As @thomasjpfan said in numpy/numpy#19301, a single nan is what is expected for ML applications. And in other contexts I'd expect that to be true as well. I can't remember any case where I had a need for multiple distinct nan's.

@asmeurer
Copy link
Member

asmeurer commented Sep 1, 2021

How do pytorch and tensorflow (and any others) handle this? Do they also have discussions about this behavior?

@pmeier
Copy link

pmeier commented Sep 2, 2021

For PyTorch I'm not aware of a discussion about this. As for the behavior, it seems to use the value equality approach I described in my comment above:

>>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
>>> torch.unique(t)
tensor([nan, nan, 1.])

@Zac-HD
Copy link

Zac-HD commented Sep 4, 2021

assert float("nan") is float("nan")

>>> float('nan') is float('nan')
False

math.nan is math.nan, sure, but otherwise identity of separately-constructed objects is implementation-defined and probably False. The only reliable exceptions are singletons (None, False, True, etc.) and some implementation-defined details like CPython's interning for small integers and strings.

(personally I favor the "return one nan" option, but we'd better document whatever we decide)

@pmeier
Copy link

pmeier commented Sep 4, 2021

@Zac-HD my bad, thanks for the heads-up. I've fixed my comment above.

@rgommers
Copy link
Member

rgommers commented Sep 5, 2021

TensorFlow and CuPy also return multiple nan's. CuPy will almost certainly change that to match NumPy. For TensorFlow I also could not find an issue. It seems likely that in all libraries the multiple-nan return is just a natural result of implementing unique via comparisons without any special handling, not because it's useful behaviour.

@leofang
Copy link
Contributor

leofang commented Sep 5, 2021

It'd be very annoying to implement on GPUs if we were to return a unique nan. In any case unique relies on sort or argsort, and it creates a thread divergence that hurts performance just to check special cases like this. We were (=I was) bothered enough to follow NumPy's way to handle complex nan ordering, and would prefer to not add any more special cases. (You can imagine it'd only hurt performance more by an additional call to isnan just to mark where the nan starts in a sorted array, count the occurrence so as to allocate output memory, etc)

@rgommers
Copy link
Member

rgommers commented Sep 6, 2021

Hmm, that's a good point. Although it raises the question what users do then, they usually need to handle the nan's anyway by skipping or combining them.

Your argument means adding an equal_nan also doesn't help, then you'd still have to implement it.

@honno
Copy link
Member Author

honno commented Sep 6, 2021

Could equal_nan be a kwarg which must support None (i.e. whatever xp.unique() would do by default) but can omit either/both True and False? The spec doesn't allow omission of any specific arguments per-say but it does allow omission of boolean indexing (technically a specified argument for get/set item), so maybe it's not too "magical" of a compromise.

It could even be that a library which knows how it will return NaNs in xp.unique() can allow the respective equal_nan bool arg to just work, and raise a helpful error for the opposite arg.

@jakirkham
Copy link
Member

As an aside, we discovered this behavior on the call today.

In [1]: import numpy as np

In [2]: {float('nan'), float('nan')}
Out[2]: {nan, nan}

In [3]: {np.nan, np.nan}
Out[3]: {nan}

@jakirkham
Copy link
Member

From a suggestion in the call, what if a unique implementation filtered out nans first? Something like this?

a_notnan = a[~np.isnan(a)]
r = np.unique(a_notnan)

@pmeier
Copy link

pmeier commented Sep 16, 2021

@jakirkham

As an aside, we discovered this behavior on the call today.

In [1]: import numpy as np

In [2]: {float('nan'), float('nan')}
Out[2]: {nan, nan}

In [3]: {np.nan, np.nan}
Out[3]: {nan}

That is the same trap I fell into: Since np.nan is only generated once, it is by design the identical object and thus will be filtered by the set:

In[1]: nan = float("nan")

In[2]: {nan, nan}
Out[2]: {nan}

@jakirkham
Copy link
Member

Yeah was thinking that might be the case. Thanks for sharing Philip 🙂

@honno
Copy link
Member Author

honno commented Sep 19, 2021

From a suggestion in the call, what if a unique implementation filtered out nans first? Something like this?

a_notnan = a[~np.isnan(a)]
r = np.unique(a_notnan)

A limitation with this example specifically is that boolean indexing might not be supported by an Array API library (it's optional behaviour in the spec, see #84). You could manually zip an array a with its NaN mask xp.isnan(a) to then construct a unique array, but this could be painfully slow, and generally an awkward implementation (see Leo Fang's previous comment).

@kgryte kgryte added the API change Changes to existing functions or objects in the API. label Oct 4, 2021
@kgryte kgryte added this to the v2021 milestone Oct 4, 2021
@honno
Copy link
Member Author

honno commented Oct 11, 2021

In either the proposed unique_all() (#275) or the current unique(..., return_counts=True), I just wanted to clarify what counting behaviour we can expect from both NaNs-unique and NaNs-not-unique scenarios.

I assume that we get the following if NaNs are unique to each-other:

>>> x = xp.full(5, xp.nan)
>>> values, _, _, counts = xp.unique_all(x)
>>> values
Array([nan, nan, nan, nan, nan])
>>> counts
Array([1, 1, 1, 1, 1])

And likewise if NaNs are not unique to each-other:

>>> x = xp.full(5, xp.nan)
>>> values, _, _, counts = xp.unique_all(x)
>>> values
Array([nan])
>>> counts
Array([5])

As you can see, the first scenario is a bit awkward in finding NaN counts as you'd still need to iterate over values to make sure you hit every nan.

Also to note, counting behaviour does not help us get a consistent count of unique values.

@asmeurer
Copy link
Member

I think we have to make counts and values consistent with one another, regardless of which nan approach we take. It would be nonsensical to return something like values == array([nan, nan, nan, nan, nan]); counts = array([5]). Then values and counts wouldn't even correspond to each other any more.

@kgryte
Copy link
Contributor

kgryte commented Oct 21, 2021

Comparison among environments:

  • MATLAB: NaNs are distinct.

    > A = [5 5 NaN NaN];
    > C = unique(A)
    C = 1×3
    
     5   NaN   NaN
  • Julia: returns only a single NaN

    julia> unique([1,1,NaN,2,3,NaN,NaN,NaN])
    4-element Array{Float64,1}:
       1.0
       NaN  
       2.0
       3.0

    The unique implementation is based on Set:

    julia> Set([1.0,2.0,NaN,2.0,NaN,3.0,NaN])
    Set([NaN, 2.0, 3.0, 1.0])

    And uses isequal for determining whether values are equal to one another.

    julia> isequal([1., NaN], [1., NaN])
    true
    
    julia> [1., NaN] == [1., NaN]
    false

    As an aside, Julia's unique and Set treat 0.0 and -0.0 as distinct.

    julia> unique([0.0,-0.0])
    2-element Array{Float64,1}:
      0.0
     -0.0
  • Python: depends on use of set.

    In [1]: set([float('nan'), float('nan')])                                                                                                     
    Out[1]: {nan, nan}
    
    In [2]: nan = float('nan');
    
    In [3]: set([nan,nan])                                                                                                                        
    Out[3]: {nan}
  • Torch: NaNs are distinct.

    >>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
    >>> torch.unique(t)
    tensor([nan, nan, 1.])
  • TensorFlow: NaNs are distinct.

    In [1]: tf.unique([1.0,2.0,np.nan,2.0,np.nan,3.0])
    Out[1]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)
    
    In [2]: nan = float('nan');
    
    In [3]: tf.unique([1.0,2.0,nan,2.0,nan,3.0])
    Out[3]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)
  • NumPy previously returned unique NaNs. As of v1.21, returns only a single NaN.

Downstream Libraries

  • In Matplotlib, no accommodation appears to be made for np.unique returning multiple NaNs.
  • In SciPy, there is one documented instance of using a workaround for multiple NaNs.
  • Scikit-learn has a small wrapper around np.unique to handle multiple NaNs. This wrapper assumes sorted unique values.

Proposal

  • Unique should return multiple NaNs (in-line with previous NumPy behavior and other libraries).

  • Specify sort order for floating-point values (see gh-288)

  • To return only a single NaN, users can implement a similar workaround to sklearn:

    values  = xp.unique_values(x)
    sorted_values = xp.sort(values)
    if bool(xp.isnan(sorted_values[-1])):
        // Find index of first NaN occurrence...
        idx = ...
        sorted_values = sorted_values[:idx+1]

    There would be some additional work for getting the indices, inverse_indices, and counts.

    Another workaround for users is to select a sentinel value other than NaN to indicate a "missing" value for the purposes of unique. For example, if values are known to be integers on the interval [1,100], replace NaN values with 0 before calling unique.

In short, the issue of multiple NaN values seems, to me at least, to be a user concern and should thus be pushed to userland. While the utility of multiple NaN values in the output of unique is undoubtedly limited, I don't see why the prevention of this should be the responsibility of array libraries, especially when bearing this responsibility entails a performance cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change Changes to existing functions or objects in the API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants