-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix x86-64 JIT conversion of negative floats to unsigned ints #415
base: v2.1
Are you sure you want to change the base?
Conversation
Thanks to Peter Cawley for advice, help on the test case, an initial patch, and suggested assembly.
Forgot to attach the test case: -- Thanks to Peter Cawley for these tests.
local cast = require"ffi".cast
local tonumber = tonumber
local t = {}
-- (1): Check that you can cast negative floating-point numbers to
-- unsigned integers.
for i = 1, 70 do
t[i] = tonumber(cast("int64_t", cast("uint64_t", -(i+0.2))))
end
for i = 1, 70 do
-- Technically undefined, but seems to match gcc/clang behaviour.
assert(t[i] == -i)
end
-- (2) Check that you can cast floating-point numbers from the upper
-- half of the unsigned integer range to unsigned integers.
for i = 1, 70 do
t[i] = (tonumber(cast("uint64_t", (2^63 + 2^50 * i))) - 2^63) / 2^50
end
for i = 1, 70 do
assert(t[i] == i)
end
-- (3): Same as (1), but for uint32_t.
for i = 1, 70 do
t[i] = tonumber(cast("int32_t", cast("uint32_t", -(i+0.2))))
end
for i = 1, 70 do
assert(t[i] == -i)
end
-- (4): Same as (2), but for uint32_t.
for i = 1, 70 do
t[i] = (tonumber(cast("uint32_t", (2^31 + 2^18 * i))) - 2^31) / 2^18
end
for i = 1, 70 do
assert(t[i] == i)
end |
Otherwise, passing a negative seek amount as a normal Lua number will involve a cast from double to uint64. In C it's undefined behavior when a double outside the [0,2^64) range is cast to uint64. In Lua we try to additionally accomodate the range [-2^63, -1], but there is a bug on x64-64 and might be a bug on other platforms: LuaJIT/LuaJIT#415 However it's cheaper to simply target an int64_t when you want to allow negative numbers, as is our case, because you don't have to do assembly heroics for it to do what you want. The `tonumber` call in the return of `linux.lua:lseek` indicates anyway that our range is not the full 64-bit range, so we might as well restrict instead to long rather than ulong.
Otherwise, passing a negative seek amount as a normal Lua number will involve a cast from double to uint64. In C it's undefined behavior when a double outside the [0,2^64) range is cast to uint64. In Lua we try to additionally accomodate the range [-2^63, -1], but there is a bug on x64-64 and might be a bug on other platforms: LuaJIT/LuaJIT#415 However it's cheaper to simply target an int64_t when you want to allow negative numbers, as is our case, because you don't have to do assembly heroics for it to do what you want. The `tonumber` call in the return of `linux.lua:lseek` indicates anyway that our range is not the full 64-bit range, so we might as well restrict instead to long rather than ulong.
Replicating undefined behavior of any particular compiler on a particular platform is an explicit non-goal. Especially, as this sends a false signal to developers. What you refer to "common practice" is an unwarranted and unportable assumption. See C99 or C11 6.3.1.4, footnote of (1). The assumption is invalid e.g. on ARM64, as this has Anyway, consistency between the interpreter and compiler is still desirable. So I've made the interpreter return the same results as the JIT compiler backend. Which likely helps to detect the assumption as invalid, right away. -- EDIT: Misread the ARM64 specs. Corrected out-of-range result. But this is tangential to the discussion. |
just tried it and found that now i fail to see this as an improvement. |
The second expression invokes undefined behavior. You're just getting a different flavor than before. Presumably, you want to write portable code. E.g. it looks like your employer is investing into ARM64 servers, if I understood the official blogs right. And you probably want to get identical results from your code across platforms. That's why you should strive to avoid undefined behavior. OK, so let's try your code on ARM64 (on real hardware, not QEMU!). But wait ... there's dual-number mode, so Before the change:
After the change:
Surprised? Per the FFI semantics, a Lua number is converted to the FFI type, i.e. Also, please note I'm not too concerned about the speed of the interpreter in this particular case (the interpreter is rather slow for FFI operations, anyway). We can use whatever code works best -- as long as it matches the JIT backend behavior. But the JIT compiler turns such a conversion into a single ARM64 instruction or a couple of x64 instructions (lacking a fitting native instruction). It would obviously be helpful for performance to be able to use that single instruction. So, what would you like me to do? Slow down ARM64 (and most other platforms) to make some undefined behavior defined? [I already explained the expectations about the |
yes, undefined cases have undefined results, no matter what "common sense" would prefer; i agree on that. but i'm seeing even less consistency now, especially between the interpreted and compiled code, on the same platform (Linux, amd64):
used to give:
but after the patch:
|
The JIT compiler is too clever for its own good. It narrows the double ( OK, so this is not workable. Alternative, consistent solutions, anyone? For all architectures, of course. Preferably with attached benchmarks on their performance impact. |
Otherwise, passing a negative seek amount as a normal Lua number will involve a cast from double to uint64. In C it's undefined behavior when a double outside the [0,2^64) range is cast to uint64. In Lua we try to additionally accomodate the range [-2^63, -1], but there is a bug on x64-64 and might be a bug on other platforms: LuaJIT/LuaJIT#415 However it's cheaper to simply target an int64_t when you want to allow negative numbers, as is our case, because you don't have to do assembly heroics for it to do what you want. The `tonumber` call in the return of `linux.lua:lseek` indicates anyway that our range is not the full 64-bit range, so we might as well restrict instead to long rather than ulong.
Should an issue about this be opened for discussion? |
This reverts commit f5d424a. The patch breaks test 279, i.e. assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL") The patch was put in to make the JIT and interpreter behaviour consistent[1] for float to unsigned int conversions but it ended up making things worse. There needs to be a better fix for this. [1] LuaJIT#415
There should be a way to do this in two stages that is still conforming: convert from negative float to signed integer (which is defined as long as the truncated number is within the range of the signed integer type) and then cast the signed integer to unsigned. |
FWIW, this is what I'm proposing: This ends up with an additional branch but it should have an insignificant performance impact for most out of order processors. |
This reverts commit f5d424a. The patch breaks test 279, i.e. assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL") The patch was put in to make the JIT and interpreter behaviour consistent[1] for float to unsigned int conversions but it ended up making things worse. There needs to be a better fix for this. [1] LuaJIT#415
This reverts commit f5d424a. The patch breaks test 279, i.e. assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL") The patch was put in to make the JIT and interpreter behaviour consistent[1] for float to unsigned int conversions but it ended up making things worse. There needs to be a better fix for this. [1] LuaJIT#415
This reverts commit f5d424a. The patch breaks test 279, i.e. assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL") The patch was put in to make the JIT and interpreter behaviour consistent[1] for float to unsigned int conversions but it ended up making things worse. There needs to be a better fix for this. [1] LuaJIT#415
As background, there are the C-style conversions float/double -> int32/uint32/int64/uint64, all of which are defined as:
At the moment, FFI-style conversions match the C-style conversions. I've been considering changing the FFI-style conversions to be just two core conversions; one double -> 32-bit (N=32) and one double -> 64-bit (N=64), defined as:
Notably, FFI-style double -> 32-bit would then be a superset of both C-style double -> int32 and C-style double -> uint32_t, and FFI-style double -> 64-bit would then be a superset of both C-style double -> int64 and C-style double -> uint64_t. FFI-style float -> T conceptually does a float -> double promotion, followed by double -> T, though in practice there might be a more efficient direct route. There's a soft-float implementation of FFI-style double -> 64-bit with defined behaviour for all finite inputs: static uint64_t dbl_to_64(const double* x) {
uint64_t m, s;
int32_t e;
memcpy(&m, x, 8);
e = ((m >> 52) & 0x7ff) - (1023+52);
if (e <= -53 || e >= 64) return 0;
s = (uint64_t)(((int64_t)m) >> 63);
m = (m & U64x(fffff, ffffffff)) | U64x(100000, 00000000);
return ((e < 0 ? m >> -e : m << e) + s) ^ s;
} /* NB: If caller wants int64_t, cast the result to int64_t. */ Backends could always choose to use the soft-float implementation for double -> 64-bit, though in practice backends with an FPU available will have a more efficient option available. FFI-style double -> 32-bit could be implemented as double -> 64-bit (via soft-float if necessary), followed by an integer operation to truncate 64 bits down to 32, though in practice this is rarely the most efficient route. If there's a hardware instruction available for C-style double -> int64 (as is the case for all 64-bit backends), then FFI-style double -> 32-bit has a simple implementation as C-style double -> int64 (e.g. x86 Efficient FFI-style double -> 64-bit is trickier. On ARM64, one starting point is
On x64, Rough thoughts for other backends:
|
As an alternative, we can just raise an error like the following:
Also, it should be easily implemented for the JIT part via additional guard checks (when we fallback to the interpreter, it throws the error). Nevertheless, I have no objections regarding Peter's approach. |
Thanks to Carlo Cabrera. (cherry picked from commit 3065c91) There is a document "OS X ABI Mach-O File Format Reference", published by Apple company, that contains description of Mach-O C-structures. Copy of the (now removed) official documentation in [1]. Yet another source of thruth is a XNU headers, see the definition of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`). AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon. The problem is the following. AVX512F has the `vcvttsd2usi` instruction which converts `double`/`float` to `uint32_t/uint64_t`. Earlier architectures (SSE2, AVX2) are sorely lacking such an instruction, as they only support signed conversions. Unsigned conversions are done with a signed convert and range shifting -- the exact algorithm depends on the compiler. A side-effect of these workarounds is that negative `double`/`float` often inadvertently convert 'as expected', even though this is invoking undefined behavior. Whereas `vcvttsd2usi` always returns 0x80000000 or 0x8000000000000000 for out-of-range inputs. In the context of LuaJIT#415 there was an attempt to fix this partially by hand-coding the `double` -> `uint64_t` case for certain architectures (see commit f3cf0d6 "Give expected results for negative non-base-10 numbers in tonumber()."). But the `double` -> `uint32_t` case remained untouched. It just so happens that LuaJIT FFI code for Mach-O file generation in `bcsave.lua` relied on undefined behavior for conversions to `uint32_t`. That failed when LuaJIT was compiled for an architecture with enabled AVX512F. The patch fixes the problem, however, the real issue remains unfixed. The problem is reproduced when target hardware platform has AVX512F and LuaJIT is compiled with enabled AVX512F instructions. The support of AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl hw.optional.avx512f` on Mac. To detect codename execute `cat /sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target | grep march`. Please consult for available model architecture on GCC Online Documentation [5]. 1. https://github.com/aidansteele/osx-abi-macho-file-format-reference 2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h 3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h 4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code 5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html Sergey Bronnikov: * added the description and a test for the problem Part of tarantool/tarantool#9595 ---------------- luajit -b -o osx -a arm64 empty.lua empty.o
Thanks to Carlo Cabrera. (cherry picked from commit 3065c91) There is a document "OS X ABI Mach-O File Format Reference", published by Apple company, that contains description of Mach-O C-structures. Copy of the (now removed) official documentation in [1]. Yet another source of thruth is a XNU headers, see the definition of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`). AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon. The problem is the following. AVX512F has the `vcvttsd2usi` instruction which converts `double`/`float` to `uint32_t/uint64_t`. Earlier architectures (SSE2, AVX2) are sorely lacking such an instruction, as they only support signed conversions. Unsigned conversions are done with a signed convert and range shifting -- the exact algorithm depends on the compiler. A side-effect of these workarounds is that negative `double`/`float` often inadvertently convert 'as expected', even though this is invoking undefined behavior. Whereas `vcvttsd2usi` always returns 0x80000000 or 0x8000000000000000 for out-of-range inputs. In the context of LuaJIT#415 there was an attempt to fix this partially by hand-coding the `double` -> `uint64_t` case for certain architectures (see commit f3cf0d6 "Give expected results for negative non-base-10 numbers in tonumber()."). But the `double` -> `uint32_t` case remained untouched. It just so happens that LuaJIT FFI code for Mach-O file generation in `bcsave.lua` relied on undefined behavior for conversions to `uint32_t`. That failed when LuaJIT was compiled for an architecture with enabled AVX512F. The patch fixes the problem, however, the real issue remains unfixed. The problem is reproduced when target hardware platform has AVX512F and LuaJIT is compiled with enabled AVX512F instructions. The support of AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl hw.optional.avx512f` on Mac. To detect codename execute `cat /sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target | grep march`. Please consult for available model architecture on GCC Online Documentation [5]. 1. https://github.com/aidansteele/osx-abi-macho-file-format-reference 2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h 3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h 4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code 5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html Sergey Bronnikov: * added the description and a test for the problem Part of tarantool/tarantool#9595
Thanks to Carlo Cabrera. (cherry picked from commit 3065c91) There is a document "OS X ABI Mach-O File Format Reference", published by Apple company, that contains description of Mach-O C-structures. Copy of the (now removed) official documentation in [1]. Yet another source of thruth is a XNU headers, see the definition of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`). AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon. The problem is the following. AVX512F has the `vcvttsd2usi` instruction which converts `double`/`float` to `uint32_t/uint64_t`. Earlier architectures (SSE2, AVX2) are sorely lacking such an instruction, as they only support signed conversions. Unsigned conversions are done with a signed convert and range shifting -- the exact algorithm depends on the compiler. A side-effect of these workarounds is that negative `double`/`float` often inadvertently convert 'as expected', even though this is invoking undefined behavior. Whereas `vcvttsd2usi` always returns 0x80000000 or 0x8000000000000000 for out-of-range inputs. In the context of LuaJIT#415 there was an attempt to fix this partially by hand-coding the `double` -> `uint64_t` case for certain architectures (see commit f3cf0d6 "Give expected results for negative non-base-10 numbers in tonumber()."). But the `double` -> `uint32_t` case remained untouched. It just so happens that LuaJIT FFI code for Mach-O file generation in `bcsave.lua` relied on undefined behavior for conversions to `uint32_t`. That failed when LuaJIT was compiled for an architecture with enabled AVX512F. The patch fixes the problem, however, the real issue remains unfixed. The problem is reproduced when target hardware platform has AVX512F and LuaJIT is compiled with enabled AVX512F instructions. The support of AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl hw.optional.avx512f` on Mac. To detect codename execute `cat /sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target | grep march`. Please consult for available model architecture on GCC Online Documentation [5]. 1. https://github.com/aidansteele/osx-abi-macho-file-format-reference 2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h 3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h 4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code 5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html Sergey Bronnikov: * added the description and a test for the problem Part of tarantool/tarantool#9595
Thanks to Carlo Cabrera. (cherry picked from commit 3065c91) There is a document "OS X ABI Mach-O File Format Reference", published by Apple company, that contains description of Mach-O C-structures. Copy of the (now removed) official documentation in [1]. Yet another source of thruth is a XNU headers, see the definition of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`). AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon. The problem is the following. AVX512F has the `vcvttsd2usi` instruction which converts `double`/`float` to `uint32_t/uint64_t`. Earlier architectures (SSE2, AVX2) are sorely lacking such an instruction, as they only support signed conversions. Unsigned conversions are done with a signed convert and range shifting -- the exact algorithm depends on the compiler. A side-effect of these workarounds is that negative `double`/`float` often inadvertently convert 'as expected', even though this is invoking undefined behavior. Whereas `vcvttsd2usi` always returns 0x80000000 or 0x8000000000000000 for out-of-range inputs. In the context of LuaJIT#415 there was an attempt to fix this partially by hand-coding the `double` -> `uint64_t` case for certain architectures (see commit f3cf0d6 "Give expected results for negative non-base-10 numbers in tonumber()."). But the `double` -> `uint32_t` case remained untouched. It just so happens that LuaJIT FFI code for Mach-O file generation in `bcsave.lua` relied on undefined behavior for conversions to `uint32_t`. That failed when LuaJIT was compiled for an architecture with enabled AVX512F. The patch fixes the problem, however, the real issue remains unfixed. The problem is reproduced when target hardware platform has AVX512F and LuaJIT is compiled with enabled AVX512F instructions. The support of AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl hw.optional.avx512f` on Mac. To detect codename execute `cat /sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target | grep march`. Please consult for available model architecture on GCC Online Documentation [5]. 1. https://github.com/aidansteele/osx-abi-macho-file-format-reference 2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h 3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h 4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code 5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html Sergey Bronnikov: * added the description and a test for the problem Part of tarantool/tarantool#9595
Thanks to Carlo Cabrera. (cherry picked from commit 3065c91) There is a document "OS X ABI Mach-O File Format Reference", published by Apple company, that contains description of Mach-O C-structures. Copy of the (now removed) official documentation in [1]. Yet another source of thruth is a XNU headers, see the definition of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`). AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon. The problem is the following. AVX512F has the `vcvttsd2usi` instruction which converts `double`/`float` to `uint32_t/uint64_t`. Earlier architectures (SSE2, AVX2) are sorely lacking such an instruction, as they only support signed conversions. Unsigned conversions are done with a signed convert and range shifting -- the exact algorithm depends on the compiler. A side-effect of these workarounds is that negative `double`/`float` often inadvertently convert 'as expected', even though this is invoking undefined behavior. Whereas `vcvttsd2usi` always returns 0x80000000 or 0x8000000000000000 for out-of-range inputs. In the context of LuaJIT#415 there was an attempt to fix this partially by hand-coding the `double` -> `uint64_t` case for certain architectures (see commit f3cf0d6 "Give expected results for negative non-base-10 numbers in tonumber()."). But the `double` -> `uint32_t` case remained untouched. It just so happens that LuaJIT FFI code for Mach-O file generation in `bcsave.lua` relied on undefined behavior for conversions to `uint32_t`. That failed when LuaJIT was compiled for an architecture with enabled AVX512F. The patch fixes the problem, however, the real issue remains unfixed. The problem is reproduced when target hardware platform has AVX512F and LuaJIT is compiled with enabled AVX512F instructions. The support of AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl hw.optional.avx512f` on Mac. To detect codename execute `cat /sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target | grep march`. Please consult for available model architecture on GCC Online Documentation [5]. 1. https://github.com/aidansteele/osx-abi-macho-file-format-reference 2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h 3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h 4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code 5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html Sergey Bronnikov: * added the description and a test for the problem Part of tarantool/tarantool#9595
This patch fixes a problem where the JITted code for float-to-unsigned-int conversions was not matching the behavior of the interpreted code.
The basic problem is that the domain of
cvttsd2si
doesn't match the domain ofuint64_t
oruint32_t
.The previous code would assume a negative result indicated an error and would retry with the input minus 2^64. However this isn't what we should do for an input of -1, for example; although
uint64 u = -1.0
is undefined, strictly speaking, as it's an out-of-range conversion, common LuaJIT and C practice means we should accept the result to be0xffffffffffffffff
. The interpreter inherits this behavior from the C compiler.This patch changes the JIT to match the interpreter. In the future we should consider making the interpreter's result more explicit, e.g. by changing the implementation of
lj_num2u64
to beThanks to Peter Cawley (@corsix) for advice, help on the test case, an initial patch, and suggested assembly, which looks like:
Corresponding changes were made to the conversion from 32-bit floats and conversion to 32-bit unsigned integers.
See justincormack/ljsyscall#223 for the original discussion.