BUG: Fix unique handling of nan entries. #18070

ftrojan · 2020-12-25T09:38:00Z

BUG: Unique and nan entries (closes #2111)

When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.

This is my first PR in an open-source project, so please apology any mistakes. In this fix, I used the code suggested by ufmayer. It works correctly. I have added unit tests as well.

The performance impact is negligible as shown by the benchmark I have created.
python runtests.py --bench bench_lib.Unique
Before change:

[100.00%] ··· ============ ============= ============ ============= ============= =============
              --                                       percent_nans
              ------------ --------------------------------------------------------------------
               array_size        0           0.1           2.0           50.0          90.0
              ============ ============= ============ ============= ============= =============
                  200        7.47±0.2μs   7.58±0.1μs   7.24±0.08μs    7.27±0.2μs    7.14±0.7μs
                 200000     13.2±0.09ms   13.2±0.4ms    13.2±0.3ms   9.63±0.07ms   4.72±0.08ms
              ============ ============= ============ ============= ============= =============

After change:

[100.00%] ··· ============ ============ ============= ============ ============ =============
              --                                      percent_nans
              ------------ ------------------------------------------------------------------
               array_size       0            0.1          2.0          50.0          90.0
              ============ ============ ============= ============ ============ =============
                  200       9.81±0.1μs    9.85±0.2μs   13.4±0.2μs   13.3±0.2μs     13.2±1μs
                 200000     13.1±0.2ms   13.2±0.06ms   13.3±0.1ms   9.68±0.1ms   4.56±0.04ms
              ============ ============ ============= ============ ============ =============

…eters

numpy/lib/arraysetops.py

Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>

numpy/lib/arraysetops.py

BvB93

Looks good to me.
I'm not 100% sure whether or not this PR would require a release note; does anyone have further thoughts on this subject?

seberg

Yeah, we need to develop a better way to do this type of check with the kind (existance of a NaN). I am not quite sure what to do about complex, as it has an infinite number of NaNs.

seberg · 2020-12-26T19:09:06Z

numpy/lib/arraysetops.py

+    if aux.shape[0] > 0 and aux.dtype.kind in "cfmM" and np.isnan(aux[-1]):
+        # Ensure that `NaT` is used for time-like dtypes
+        nan = np.array(np.nan).astype(aux.dtype)
+        aux_firstnan = np.searchsorted(aux, nan, side='left')
+        mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan - 1])
+        mask[aux_firstnan] = True
+        mask[aux_firstnan + 1:] = False


You already have a NaN available, it is called aux[-1].

There is a bit of a problem with complex, although arguably that may be a bug in complex searchsorted, I guess:

arr = np.array([1, 100, complex(np.nan, 0), complex(0, np.nan)]) arr.sort() arr.searchsorted(complex(np.nan, 0)) != arr.searchsorted(complex(0, np.nan)) # should be the same

That is, for complex all NaNs are considered equivalent (no matter whether the NaN is in the real or imaginary part), but we currently do define a sort order between them.

This aux[-1] is really smart. I will look into complex now, but I feel it becomes harder now for a novice like me.

I would also agree to consider all complex NaNs as equivalent, although I can imagine that others might have a different view. If we agree on this, then we need to decide, which representant from all the input NaNs to put into the return array. In my proposal, it is the first element v with true value of np.isnan(v) found in the sorted array aux. The documentation and the release note is updated accordingly.

So clearly isnan and searchsorted do not care whether to real or imaginary part is nan.
@seberg do you know if there are any functions within numpy where this distinction does matter?
If not, I'd say that it would be consistent to treat all complex NaNs as equivalent.

@BvB93 I am not sure if the distinction matters anywhere, probably not (aside from things like .real, since it is a view that would not ensure a NaN real part when the imaginary part is NaN). I would not trust this to be tested for anywhere, but I guess most functionality either correctly inherits behaviour from glibc or is naively correct (that is with the probable exception of warning flags.)

I will not dig it up right now, but can do so some time. There was a discussion semi recently (probably to do with warning flags), and if I remember right, all complex NaNs are considered identical by an IEEE standard. (I don't remember if it said anything about infinities.)

Just to verify: it seems we all agree that handling of nan, as is done in this PR, is desirable?

Yes, I am happy with it, right now it is not consistent/correct for complex though. Maybe we can live with that, I have not checked how complex it would be to fix the complex search-sorted.

numpy/lib/arraysetops.py

ftrojan · 2021-01-14T14:09:01Z

What are the next steps? Can I do anything to push this forward? I do not want this to become a stale PR.

charris · 2021-02-12T16:48:08Z

Thanks @ftrojan .

ftrojan · 2021-02-15T08:42:06Z

my pleasure

filip_trojan added 5 commits December 25, 2020 07:31

benchmark bench_lib.Unique added

9563e40

extended test_unique_1d

56d84b4

modify _unique1d

321f27e

extend test with return_index, return_inverse and return_counts param…

8f2d3e8

…eters

documentation updated

19af656

github-actions bot added the 00 - Bug label Dec 25, 2020

BvB93 reviewed Dec 26, 2020

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

ftrojan and others added 3 commits December 26, 2020 09:02

Update numpy/lib/arraysetops.py

8e59b8b

Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>

full coverage of nan types

00d663e

Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>

added tests for the datetime like dtypes

4cb1ac9

ftrojan commented Dec 26, 2020

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

BvB93 approved these changes Dec 26, 2020

View reviewed changes

nan as vector of length 1

37a2ba6

seberg reviewed Dec 27, 2020

View reviewed changes

filip_trojan added 2 commits December 28, 2020 10:14

use aux[-1] as nan, ..versionchanged, release note

8b294c5

for complex arrays all NaN values are considered equivalent

f2a99c6

charris added the component: numpy.lib label Feb 12, 2021

charris changed the title ~~BUG: Unique and nan entries (issue 2111)~~ BUG: Fix unique handling of nan entries. Feb 12, 2021

charris merged commit 7dcd29a into numpy:master Feb 12, 2021

seberg mentioned this pull request Aug 2, 2021

BUG: Do not raise deprecation warning for all nans in unique #19301

Merged

h-vetinari mentioned this pull request Mar 19, 2022

BUG: handle inconsistencies in factorial functions and their extensions scipy/scipy#15600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix unique handling of nan entries. #18070

BUG: Fix unique handling of nan entries. #18070

ftrojan commented Dec 25, 2020 •

edited

BvB93 left a comment

seberg left a comment

seberg Dec 26, 2020

ftrojan Dec 28, 2020

ftrojan Dec 28, 2020

BvB93 Dec 28, 2020

seberg Dec 28, 2020

BvB93 Jan 16, 2021

seberg Jan 19, 2021

ftrojan commented Jan 14, 2021

charris commented Feb 12, 2021

ftrojan commented Feb 15, 2021

BUG: Fix unique handling of nan entries. #18070

BUG: Fix unique handling of nan entries. #18070

Conversation

ftrojan commented Dec 25, 2020 • edited

BvB93 left a comment

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

seberg Dec 26, 2020

Choose a reason for hiding this comment

ftrojan Dec 28, 2020

Choose a reason for hiding this comment

ftrojan Dec 28, 2020

Choose a reason for hiding this comment

BvB93 Dec 28, 2020

Choose a reason for hiding this comment

seberg Dec 28, 2020

Choose a reason for hiding this comment

BvB93 Jan 16, 2021

Choose a reason for hiding this comment

seberg Jan 19, 2021

Choose a reason for hiding this comment

ftrojan commented Jan 14, 2021

charris commented Feb 12, 2021

ftrojan commented Feb 15, 2021

ftrojan commented Dec 25, 2020 •

edited