Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix unique handling of nan entries. #18070

Merged
merged 11 commits into from Feb 12, 2021

Conversation

ftrojan
Copy link
Contributor

@ftrojan ftrojan commented Dec 25, 2020

BUG: Unique and nan entries (closes #2111)

When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.

This is my first PR in an open-source project, so please apology any mistakes. In this fix, I used the code suggested by ufmayer. It works correctly. I have added unit tests as well.

The performance impact is negligible as shown by the benchmark I have created.
python runtests.py --bench bench_lib.Unique
Before change:

[100.00%] ··· ============ ============= ============ ============= ============= =============
              --                                       percent_nans
              ------------ --------------------------------------------------------------------
               array_size        0           0.1           2.0           50.0          90.0
              ============ ============= ============ ============= ============= =============
                  200        7.47±0.2μs   7.58±0.1μs   7.24±0.08μs    7.27±0.2μs    7.14±0.7μs
                 200000     13.2±0.09ms   13.2±0.4ms    13.2±0.3ms   9.63±0.07ms   4.72±0.08ms
              ============ ============= ============ ============= ============= =============

After change:

[100.00%] ··· ============ ============ ============= ============ ============ =============
              --                                      percent_nans
              ------------ ------------------------------------------------------------------
               array_size       0            0.1          2.0          50.0          90.0
              ============ ============ ============= ============ ============ =============
                  200       9.81±0.1μs    9.85±0.2μs   13.4±0.2μs   13.3±0.2μs     13.2±1μs
                 200000     13.1±0.2ms   13.2±0.06ms   13.3±0.1ms   9.68±0.1ms   4.56±0.04ms
              ============ ============ ============= ============ ============ =============

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
ftrojan and others added 3 commits December 26, 2020 09:02
Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>
Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>
numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
Copy link
Member

@BvB93 BvB93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I'm not 100% sure whether or not this PR would require a release note; does anyone have further thoughts on this subject?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to develop a better way to do this type of check with the kind (existance of a NaN). I am not quite sure what to do about complex, as it has an infinite number of NaNs.

Comment on lines 328 to 334
if aux.shape[0] > 0 and aux.dtype.kind in "cfmM" and np.isnan(aux[-1]):
# Ensure that `NaT` is used for time-like dtypes
nan = np.array(np.nan).astype(aux.dtype)
aux_firstnan = np.searchsorted(aux, nan, side='left')
mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan - 1])
mask[aux_firstnan] = True
mask[aux_firstnan + 1:] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have a NaN available, it is called aux[-1].

There is a bit of a problem with complex, although arguably that may be a bug in complex searchsorted, I guess:

arr = np.array([1, 100, complex(np.nan, 0), complex(0, np.nan)])
arr.sort()
arr.searchsorted(complex(np.nan, 0)) != arr.searchsorted(complex(0, np.nan))  # should be the same

That is, for complex all NaNs are considered equivalent (no matter whether the NaN is in the real or imaginary part), but we currently do define a sort order between them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aux[-1] is really smart. I will look into complex now, but I feel it becomes harder now for a novice like me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also agree to consider all complex NaNs as equivalent, although I can imagine that others might have a different view. If we agree on this, then we need to decide, which representant from all the input NaNs to put into the return array. In my proposal, it is the first element v with true value of np.isnan(v) found in the sorted array aux. The documentation and the release note is updated accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So clearly isnan and searchsorted do not care whether to real or imaginary part is nan.
@seberg do you know if there are any functions within numpy where this distinction does matter?
If not, I'd say that it would be consistent to treat all complex NaNs as equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BvB93 I am not sure if the distinction matters anywhere, probably not (aside from things like .real, since it is a view that would not ensure a NaN real part when the imaginary part is NaN). I would not trust this to be tested for anywhere, but I guess most functionality either correctly inherits behaviour from glibc or is naively correct (that is with the probable exception of warning flags.)

I will not dig it up right now, but can do so some time. There was a discussion semi recently (probably to do with warning flags), and if I remember right, all complex NaNs are considered identical by an IEEE standard. (I don't remember if it said anything about infinities.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify: it seems we all agree that handling of nan, as is done in this PR, is desirable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am happy with it, right now it is not consistent/correct for complex though. Maybe we can live with that, I have not checked how complex it would be to fix the complex search-sorted.

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
@ftrojan
Copy link
Contributor Author

ftrojan commented Jan 14, 2021

What are the next steps? Can I do anything to push this forward? I do not want this to become a stale PR.

@charris charris changed the title BUG: Unique and nan entries (issue 2111) BUG: Fix unique handling of nan entries. Feb 12, 2021
@charris charris merged commit 7dcd29a into numpy:master Feb 12, 2021
@charris
Copy link
Member

charris commented Feb 12, 2021

Thanks @ftrojan .

@ftrojan
Copy link
Contributor Author

ftrojan commented Feb 15, 2021

my pleasure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unique and NaN entries (Trac #1514)
4 participants