New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix unique handling of nan entries. #18070
Conversation
Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>
Co-authored-by: Bas van Beek <43369155+BvB93@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I'm not 100% sure whether or not this PR would require a release note; does anyone have further thoughts on this subject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to develop a better way to do this type of check with the kind
(existance of a NaN). I am not quite sure what to do about complex, as it has an infinite number of NaNs.
numpy/lib/arraysetops.py
Outdated
if aux.shape[0] > 0 and aux.dtype.kind in "cfmM" and np.isnan(aux[-1]): | ||
# Ensure that `NaT` is used for time-like dtypes | ||
nan = np.array(np.nan).astype(aux.dtype) | ||
aux_firstnan = np.searchsorted(aux, nan, side='left') | ||
mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan - 1]) | ||
mask[aux_firstnan] = True | ||
mask[aux_firstnan + 1:] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have a NaN available, it is called aux[-1]
.
There is a bit of a problem with complex, although arguably that may be a bug in complex searchsorted
, I guess:
arr = np.array([1, 100, complex(np.nan, 0), complex(0, np.nan)])
arr.sort()
arr.searchsorted(complex(np.nan, 0)) != arr.searchsorted(complex(0, np.nan)) # should be the same
That is, for complex all NaNs are considered equivalent (no matter whether the NaN is in the real or imaginary part), but we currently do define a sort order between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aux[-1]
is really smart. I will look into complex now, but I feel it becomes harder now for a novice like me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also agree to consider all complex NaNs as equivalent, although I can imagine that others might have a different view. If we agree on this, then we need to decide, which representant from all the input NaNs to put into the return array. In my proposal, it is the first element v
with true value of np.isnan(v)
found in the sorted array aux
. The documentation and the release note is updated accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So clearly isnan
and searchsorted
do not care whether to real or imaginary part is nan
.
@seberg do you know if there are any functions within numpy where this distinction does matter?
If not, I'd say that it would be consistent to treat all complex NaNs as equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BvB93 I am not sure if the distinction matters anywhere, probably not (aside from things like .real
, since it is a view that would not ensure a NaN real part when the imaginary part is NaN). I would not trust this to be tested for anywhere, but I guess most functionality either correctly inherits behaviour from glibc or is naively correct (that is with the probable exception of warning flags.)
I will not dig it up right now, but can do so some time. There was a discussion semi recently (probably to do with warning flags), and if I remember right, all complex NaNs are considered identical by an IEEE standard. (I don't remember if it said anything about infinities.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to verify: it seems we all agree that handling of nan
, as is done in this PR, is desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am happy with it, right now it is not consistent/correct for complex though. Maybe we can live with that, I have not checked how complex it would be to fix the complex search-sorted.
What are the next steps? Can I do anything to push this forward? I do not want this to become a stale PR. |
Thanks @ftrojan . |
my pleasure |
BUG: Unique and nan entries (closes #2111)
When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.
This is my first PR in an open-source project, so please apology any mistakes. In this fix, I used the code suggested by ufmayer. It works correctly. I have added unit tests as well.
The performance impact is negligible as shown by the benchmark I have created.
python runtests.py --bench bench_lib.Unique
Before change:
After change: