Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: incorrect equality of np.int64 and np.uint64 #12525

Closed
h-vetinari opened this issue Dec 10, 2018 · 15 comments
Closed

BUG: incorrect equality of np.int64 and np.uint64 #12525

h-vetinari opened this issue Dec 10, 2018 · 15 comments
Labels
01 - Enhancement 23 - Wish List 60 - Major release Issues that need or may be better addressed in a major release 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Dec 10, 2018

EDIT: since this issue has become somewhat of a stand-in for uint64-problems, here's a summary pulled up from the discussion further down:

PS. Here's all the duckies in one neat row:


There's a range of values of class np.uint64 that fail a trivial comparison test: v > v - 1

The issue is IMO that it effectively uses float-logic of some kind in the background, which is obviously neither desirable nor necessary.

Reproducing code example:

The following comparisons should all return True:

>>> import numpy as np
>>> for v in np.logspace(50, 63.9, num=20, base=2):
...     v = int(v)
...     c = np.array([v], dtype='uint64').max() > v - 1
...     print(c, v)
...
True 1125899906842624
True 1869506547705232
True 3104232188555715
True 5154439010814696
True 8558715940823230
False 14211365854171536
False 23597338764077220
False 39182327895891032
False 65060506809279480
False 108030067981851424
False 179379106627232000
False 297851001073046464
False 494568294537084288
False 821208581069482368
False 1363580199279397888
False 2264164065900722176
False 3759543384412024832
False 6242554005755503616
True 10365482328612339712
True 17211420807207079936

The way I encountered this, is that 'uint64' is the default dtype for values between np.iinfo('int64').max + 1 and np.iinfo('uint64').max (which makes sense), but leads to the following problem:

>>> v = np.iinfo('int64').max
>>> np.array([v + 1]) > v
array([ True])
>>> np.array([v + 1]).max() > v
False
>>> np.array([v + 1000]).max() == v
True

Numpy/Python version information:

Numpy version: 1.15.4
Python version: 3.6.6 (conda-forge)

@eric-wieser
Copy link
Member

The problem is that v - 1 is uint64 - int64 which due to the lack of int128, we panic and assign type float64 to.

The workaround is to use v - np.uint64(1), which will avoid this casting.

This type of issue has been around for a while. Perhaps we should just emit a warning when this type of cast occurs, and force the user to perform the cast they actually wanted to silence it.

@h-vetinari
Copy link
Contributor Author

@eric-wieser
arrays don't panic, see below. ;-)

Also, your suggested workaround only works if both v and 1 are cast (but - to me - doesn't explain why it suddenly starts working again around 1e19):

>>> for v in np.logspace(50, 63.9, num=20, base=2):
...     v = int(v)
...     cmp_arr = (np.array([v], dtype='uint64') > v - 1)[0]
...     cmp_max = np.array([v], dtype='uint64').max() > v - 1
...     cmp_cast_1 = np.array([v], dtype='uint64').max() > v - np.uint64(1)
...     cmp_cast_2 = np.array([v], dtype='uint64').max() > np.uint64(v - 1)
...     print(cmp_arr, cmp_max, cmp_cast_1, cmp_cast_2, v, np.log10(v))
...
True True True True 1125899906842624 15.05149978319906
True True True True 1869506547705232 15.271726990553235
True True True True 3104232188555715 15.491954197907413
True True True True 5154439010814696 15.712181405261587
True True True True 8558715940823230 15.932408612615763
True False False True 14211365854171536 16.152635819969937
True False False True 23597338764077220 16.372863027324115
True False False True 39182327895891032 16.59309023467829
True False False True 65060506809279480 16.813317442032464
True False False True 108030067981851424 17.03354464938664
True False False True 179379106627232000 17.25377185674082
True False False True 297851001073046464 17.473999064094993
True False False True 494568294537084288 17.694226271449168
True False False True 821208581069482368 17.914453478803345
True False False True 1363580199279397888 18.13468068615752
True False False True 2264164065900722176 18.354907893511694
True False False True 3759543384412024832 18.57513510086587
True False False True 6242554005755503616 18.795362308220046
True True True True 10365482328612339712 19.015589515574224
True True True True 17211420807207079936 19.235816722928398

In all seriousness though, this discrepancy is very hard to debug, and IMO clearly broken:

>>> v = np.iinfo('int64').max
>>> np.array([v + 1]) > v
array([ True])
>>> np.array([v + 1]).max() > v
False
>>>
>>> # even more bizarre with the repr that looks like an int
>>> np.array([v + 1]).max()
9223372036854775808
>>> _ > 9223372036854775807
False

If np.array can do it, why not the dtype directly?

Actually, the consistent solution would be to cast to object instead of float, which numpy does in other places as well:

>>> np.array([np.iinfo('uint64').max + 1])
array([18446744073709551616], dtype=object)

@ahaldane
Copy link
Member

I think you're hitting a number of different issues. The second one below seems relevant.

As Eric said, one issue is that in Numpy's casting rules (different from C's casting rules) mixed operations like uint > int casts both operands to float64, and that rule is set in stone (we can't change it). The logic is that only float64 can hold values in the range of both int and uint. Another part of the puzzle is that float64 cannot represent integers exactly, so that in many cases adding or subtracting 1 to the float does nothing. Eg note that 6242554005755503616. + 1 == 6242554005755503616.. I don't think anything special happens at 1e19, rather I think the values you chose just semi-randomly happened to round one way or the other. For instance 2264164065900722048 doesn't follow the pattern. I think that explains your first example.

The other issue you hit has to do with mixing numpy ints and python ints. In this case I think there is something a bit weird: The array coercion of python-ints behaves differently from the scalar coercion. Here's another example that avoids using max:

>>> v = 9223372036854775807  # python int
>>> a = np.uint64(v+1)
>>> a > v   # both converted to float64
False
>>> np.array(a) > v  # both converted to float64
False
>>> np.array([a]) > v # both converted to uint64
array([ True])

Perhps relevant: #2524 #944 #1051. Maybe this statement is relevant?:

The scalar/array coercion rule is that the type of the scalar will be ignored if the scalar is of the same kind or a lesser kind than the array.

@h-vetinari
Copy link
Contributor Author

Followed some links in #2524 and found more xrefs in a comment:

I see quite a few issues with uint64 and all of them seem to be "related":
#5745, #7126, #4151, #2955, #2524 (since 2012!), #8002 - all open.

and then some other xrefs in #5668: #8809 #8853 #8851 #5746.

Also found more related xrefs after searching issues for uint64: #2533 #2955 #7449 #9982 #10148 #10614 #11020

In #8851 (which is very similar; just addressing equality instead of all comparators), @eric-wieser observes after a nice (read: terrible) example in the OP that:

Frankly, this is PHP-like behaviour, and not something I want to see happening quietly in python

That issue also contains some more useful discussions. #11801 is another almost-duplicate of this issue.


@ahaldane: [...] and that rule is set in stone (we can't change it).

A year ago, you made it sound less set-in-stone ;-)

Personally, if we could do everything over again, I would favor redoing all the casting/coercion so that numpy casting essentially behaves like C casting, and follows the spirit of C-casting for cases that don't exist in C. The numpy casting rules are kind of weird and sometimes confusing (eg the conversion of uint64 + int64 to float64). But we've discussed this a lot on the list, it seems hard to change.

This issue is clearly hitting many people in different situations (the above xrefs total 18 open issues!). Why not fix the fundamentals (surely there's other issues that would benefit from similar treatment), and use that occasion to bump to 2.0?


As an aside, I didn't even know that a NumPy 2.0 milestone exists (has been closed in 2015, even though open issues remain on it). Found an issue discussing this more: #9066

In it, @charris notes that:

At this point, NumPy 2.0 is a creature of mythology.

I'm making a bold prediction that this creature will eventually rear its head before numpy 1.50... ;-)

PS. On third thought, maybe numpy 2.0 needs to be limited to some selected few breaking changes; otherwise, trying to fit in all of the accumulated backpressure probably makes it an almost impossible undertaking.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 25, 2019

@ahaldane Any comments to the above?

PS. Here's all the duckies in one neat row:

@ahaldane
Copy link
Member

Well first thanks for all the time and effort collecting all the issues, I know it takes work. When considering these complex back-compat problems it's really important to be able to reference past discussion, and IMO in a project like numpy this kind of knowledge-organizing work is comparably important to the actual coding.

Now after looking over all the issues, my personal opinion is that 1. I dislike numpy's casting behavior and wish it were C-like, but 2. I also think that at this point we are stuck with the current coercion rules unless we do a numpy 2.0, and I don't think this change alone justifies a numpy 2.0. The problem is numpy is now a foundation of a large computing ecosystem, and a change like this one would very likely break a large fraction of downstream projects in hard-to-debug ways. I would guess more than half. For the same reason that old decisions in the C language have been locked in, we get locked in too.

Plus, it's perfectly possible to write correct algorithms in numpy despite this problem, by keeping track of the type of your variables or casting as needed. So this issue can be seen as more of an annoyance rather than a bug.

But, I think we can add this problem to the list of things that would motivate a numpy 2.0. I just put this issue on the wiki at https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release. Once enough things get added to the list, there might be momentum for 2.0. I suspect that when the "epic dtype plan" gets moving the list will get longer.

@mhvk
Copy link
Contributor

mhvk commented Jan 27, 2019

@h-vetinari - thanks also from me for the summary; it is very useful to have things in one place! Since you just went over them, were all those for casting, and were there examples for things other than int64 combined with uint64? (If you can, maybe edit your comment above and add a bit of description.)

@ahaldane - I worry that we may be letting parts that are truly difficult to fix get in the way of fixing simpler ones. For instance, in the example at hand, of operations on scalars, it is not obvious to me that much would break if we changed the rule for uin64 with python int (or int64) from casting to float64 to casting both to python int. Indeed, arguably that is more consistent with the general attempt to ensure the result can be properly represented, as python int can have arbitrary length.

The bigger difficulty is what to do with arrays. One option there might be to move towards a solution by changing the casting rules for int (or more specifically, just for uint64 and int64, which I think is the only pair that is really problematic) to try a cast to int64 first, and if that fails for any member of the array, warn and recast to float. I think the check for failure while casting would not cause too much of a slow-down, and this would get us towards a situation where we can (eventually) abandon conversion to float altogether.

@h-vetinari
Copy link
Contributor Author

@mhvk
I originally compiled that list 19 days ago, see previous comment. Just added the duplicates and some more loosely related issues on the second pass. AFAIR, all issues I tagged where about uint64 being wrongly cast to float in some operation.

The bigger difficulty is what to do with arrays.

Arrays already work correctly in this case:

>>> v = np.iinfo('int64').max
>>> arr = np.array([v + 1])
>>> arr.dtype
dtype('uint64')
>>>
>>> arr > v  # arrays are working
array([ True])
>>>
>>> arr.max() > v  # scalar case is the problem
False

Furthermore, the scalar case only breaks between np.iinfo('int64').max + 1 and np.iinfo('uint64').max

>>> v = np.iinfo('uint64').max
>>> arr = np.array([v + 1])
>>> arr.dtype
dtype('O')
>>>
>>> arr > v
array([ True])
>>>
>>> arr.max() > v  # correct again
True

@ahaldane
Copy link
Member

[for] operations on scalars ... change the rule for uin64 with python int (or int64) from casting to float64 to casting both to python int.

[for arrays] changing the casting rules for int (or more specifically, just for uint64 and int64, which I think is the only pair that is really problematic) to try a cast to int64 first, and if that fails for any member of the array, warn and recast to float.

I guess the question is how much these changes would break. I don't have hard evidence, but my impression is that it would break a lot of code and so any change like this is infeasible, even for the python scalars. To get a better idea of whether that's true or not I think someone would have to code it up just to try running some project's test suites.

By the way, C coercion for ints does uint64 + int64 -> uint64, not int64. I have a gist with an outline for how C-style coercion would work in numpy here: https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76 discussing this. Not only int+uint would change in coercion behavior relative to current numpy, but also int+float coercions as well.

@mhvk
Copy link
Contributor

mhvk commented Jan 29, 2019

Agreed that some better data would be needed... But also good to simply collect how in an ideal world we would like things to work, so we can decide which are actually fine as is, which possible to move to and which are bad but unchangeable.

@h-vetinari
Copy link
Contributor Author

@ahaldane: By the way, C coercion for ints does uint64 + int64 -> uint64, not int64.

But C does not have the open-ended python ints. It should at least be considered that this casts to object (which array.__init__ already does).

>>> np.array([np.iinfo('uint64').max+1])
array([18446744073709551616], dtype=object)

This would - I believe - solve >90% of the issues I linked above.

I have a gist with an outline for how C-style coercion would work in numpy here: https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76 discussing this.

Thanks for this! You bring up another good point that might be worth tackling at the same time: silent int overflow. I think this is terrible (silently changing results; very hard to debug), and really should either raise/warn/upcast.

@eric-wieser
Copy link
Member

By the way, C coercion for ints does uint64 + int64 -> uint64, not int64.

Proof of that statement: https://godbolt.org/z/CywY13

@seberg seberg added the 60 - Major release Issues that need or may be better addressed in a major release label May 2, 2019
@shoyer shoyer added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label May 10, 2019
@seberg seberg changed the title Broken comparators for uint64-dtype BUG: incorrect equality of np.int64 and np.uint64 May 12, 2022
@hacatu
Copy link

hacatu commented May 18, 2022

I think that for +, -, and *, switching to uint64 (as C does) instead of float64 as the result of combining a signed and unsigned value would not be likely to break existing programs. / is more tricky of course, because the exact result of division is not always an integer.

Furthermore, I think that these operations can and should dispatch on output type when used in compound assignment. That is to say, if we have arr[i] <<= 1, we don't have to do a whole lot of guesswork about what type we should make arr[i] << 1 in this context because it should obviously be the same type as arr[i].

Even though it might be difficult to change this behavior now because programs that depend on the incorrect behavior no doubt exist, it would be very nice to have heterogeneous compound assignments (signed += unsigned and visa versa) work correctly

@mattip
Copy link
Member

mattip commented Sep 18, 2023

Worth revisiting and possibly closing after the change in casting rules: NumPy will now not automatically cast for you.

@seberg
Copy link
Member

seberg commented Sep 18, 2023

There are other issues around this, but I had fixed the equality on main (actually I think alrady in 1.25).

@seberg seberg closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 23 - Wish List 60 - Major release Issues that need or may be better addressed in a major release 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.
Projects
None yet
Development

No branches or pull requests

8 participants