Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix x86-64 JIT conversion of negative floats to unsigned ints #415

Open
wants to merge 1 commit into
base: v2.1
Choose a base branch
from

Conversation

wingo
Copy link

@wingo wingo commented Apr 30, 2018

This patch fixes a problem where the JITted code for float-to-unsigned-int conversions was not matching the behavior of the interpreted code.

The basic problem is that the domain of cvttsd2si doesn't match the domain of uint64_t or uint32_t.

The previous code would assume a negative result indicated an error and would retry with the input minus 2^64. However this isn't what we should do for an input of -1, for example; although uint64 u = -1.0 is undefined, strictly speaking, as it's an out-of-range conversion, common LuaJIT and C practice means we should accept the result to be 0xffffffffffffffff. The interpreter inherits this behavior from the C compiler.

This patch changes the JIT to match the interpreter. In the future we should consider making the interpreter's result more explicit, e.g. by changing the implementation of lj_num2u64 to be

uint64_t
lj_num2u64 (double n)
{
  if (n < -2^63) then return INT_MIN;
  if (n < 2^63) then return n;
  if (n < 2^64) then return n - 2^64;
  return INT_MIN;
}

Thanks to Peter Cawley (@corsix) for advice, help on the test case, an initial patch, and suggested assembly, which looks like:

2a50ffb3  cvttsd2si rbx, xmm7
2a50ffb8  movaps xmm6, xmm7
2a50ffbb  subsd xmm6, [0x41c52680] ;; 2^64
2a50ffc4  cvttsd2si rdi, xmm6
2a50ffc9  cmp rbx, +0x01
2a50ffcd  cmovo rbx, rdi

Corresponding changes were made to the conversion from 32-bit floats and conversion to 32-bit unsigned integers.

See justincormack/ljsyscall#223 for the original discussion.

Thanks to Peter Cawley for advice, help on the test case, an initial
patch, and suggested assembly.
@wingo
Copy link
Author

wingo commented May 2, 2018

Forgot to attach the test case:

-- Thanks to Peter Cawley for these tests.

local cast = require"ffi".cast
local tonumber = tonumber
local t = {}

-- (1): Check that you can cast negative floating-point numbers to
-- unsigned integers.
for i = 1, 70 do
  t[i] = tonumber(cast("int64_t", cast("uint64_t", -(i+0.2))))
end
for i = 1, 70 do
  -- Technically undefined, but seems to match gcc/clang behaviour.
  assert(t[i] == -i)
end

-- (2) Check that you can cast floating-point numbers from the upper
-- half of the unsigned integer range to unsigned integers.
for i = 1, 70 do
  t[i] = (tonumber(cast("uint64_t", (2^63 + 2^50 * i))) - 2^63) / 2^50
end
for i = 1, 70 do
  assert(t[i] == i)
end

-- (3): Same as (1), but for uint32_t.
for i = 1, 70 do
  t[i] = tonumber(cast("int32_t", cast("uint32_t", -(i+0.2))))
end
for i = 1, 70 do
  assert(t[i] == -i)
end

-- (4): Same as (2), but for uint32_t.
for i = 1, 70 do
  t[i] = (tonumber(cast("uint32_t", (2^31 + 2^18 * i))) - 2^31) / 2^18
end
for i = 1, 70 do
  assert(t[i] == i)
end

wingo added a commit to wingo/ljsyscall that referenced this pull request May 2, 2018
Otherwise, passing a negative seek amount as a normal Lua number will
involve a cast from double to uint64.  In C it's undefined behavior when
a double outside the [0,2^64) range is cast to uint64.  In Lua we try to
additionally accomodate the range [-2^63, -1], but there is a bug on
x64-64 and might be a bug on other platforms:

  LuaJIT/LuaJIT#415

However it's cheaper to simply target an int64_t when you want to allow
negative numbers, as is our case, because you don't have to do assembly
heroics for it to do what you want.  The `tonumber` call in the return
of `linux.lua:lseek` indicates anyway that our range is not the full
64-bit range, so we might as well restrict instead to long rather than
ulong.
wingo added a commit to Igalia/snabb that referenced this pull request May 2, 2018
Otherwise, passing a negative seek amount as a normal Lua number will
involve a cast from double to uint64.  In C it's undefined behavior when
a double outside the [0,2^64) range is cast to uint64.  In Lua we try to
additionally accomodate the range [-2^63, -1], but there is a bug on
x64-64 and might be a bug on other platforms:

  LuaJIT/LuaJIT#415

However it's cheaper to simply target an int64_t when you want to allow
negative numbers, as is our case, because you don't have to do assembly
heroics for it to do what you want.  The `tonumber` call in the return
of `linux.lua:lseek` indicates anyway that our range is not the full
64-bit range, so we might as well restrict instead to long rather than
ulong.
@MikePall
Copy link
Member

MikePall commented May 20, 2018

Replicating undefined behavior of any particular compiler on a particular platform is an explicit non-goal. Especially, as this sends a false signal to developers.

What you refer to "common practice" is an unwarranted and unportable assumption. See C99 or C11 6.3.1.4, footnote of (1).

The assumption is invalid e.g. on ARM64, as this has FCVTZU Xd, Dn, which both C compilers and LuaJIT happily make use of. That particular instruction is specified to return zero for out-of-range inputs. Of course, that's just another outcome of undefined behavior on a particular platform.

Anyway, consistency between the interpreter and compiler is still desirable. So I've made the interpreter return the same results as the JIT compiler backend. Which likely helps to detect the assumption as invalid, right away.

--

EDIT: Misread the ARM64 specs. Corrected out-of-range result. But this is tangential to the discussion.

@MikePall MikePall closed this May 20, 2018
@javierguerragiraldez
Copy link

just tried it and found that now 0ull-1 ~= 0ull+-1

i fail to see this as an improvement.

@MikePall
Copy link
Member

MikePall commented Jun 5, 2018

The second expression invokes undefined behavior. You're just getting a different flavor than before.

Presumably, you want to write portable code. E.g. it looks like your employer is investing into ARM64 servers, if I understood the official blogs right. And you probably want to get identical results from your code across platforms. That's why you should strive to avoid undefined behavior.

OK, so let's try your code on ARM64 (on real hardware, not QEMU!). But wait ... there's dual-number mode, so -1 is really an integer. And the interpreter constant-folds, too. So let's trick it into using a double for the implicit conversion:

Before the change:

x64$ luajit-260b9b48 -e 'local x=0.5; local y=x-1.5; print(0ull+y)'
18446744073709551615ULL

arm64$ luajit-260b9b48 -e 'local x=0.5; local y=x-1.5; print(0ull+y)'
0ULL

After the change:

x64$ luajit-v2.1 -e 'local x=0.5; local y=x-1.5; print(0ull+y)'
9223372036854775808ULL

arm64$ luajit-v2.1 -e 'local x=0.5; local y=x-1.5; print(0ull+y)'
0ULL

Surprised?

Per the FFI semantics, a Lua number is converted to the FFI type, i.e. uint64_t, before a mixed-type arithmetic operation. That's of course an undefined conversion for a negative double, so it's unsurprising to get different results on different platforms.

Also, please note I'm not too concerned about the speed of the interpreter in this particular case (the interpreter is rather slow for FFI operations, anyway). We can use whatever code works best -- as long as it matches the JIT backend behavior.

But the JIT compiler turns such a conversion into a single ARM64 instruction or a couple of x64 instructions (lacking a fitting native instruction). It would obviously be helpful for performance to be able to use that single instruction.

So, what would you like me to do? Slow down ARM64 (and most other platforms) to make some undefined behavior defined?

[I already explained the expectations about the double -> uint64_t conversion behavior in C are unwarranted and based on the observation of some platform-dependent behavior. That C code would give different results on ARM64, too.]

@javierguerragiraldez
Copy link

yes, undefined cases have undefined results, no matter what "common sense" would prefer; i agree on that.

but i'm seeing even less consistency now, especially between the interpreted and compiled code, on the same platform (Linux, amd64):

local t = {}

for i=1,100 do
	t[i] = 0ull + -1
end

for i = 50,70 do
	print(i, bit.tohex(t[i]))
end

used to give:

50      ffffffffffffffff
51      ffffffffffffffff
52      ffffffffffffffff
53      ffffffffffffffff
54      ffffffffffffffff
55      ffffffffffffffff
56      ffffffffffffffff
57      ffffffffffffffff
58      ffffffffffffffff
59      ffffffffffffffff
60      ffffffffffffffff
61      ffffffffffffffff
62      ffffffffffffffff
63      ffffffffffffffff
64      ffffffffffffffff
65      ffffffffffffffff
66      ffffffffffffffff
67      ffffffffffffffff
68      ffffffffffffffff
69      ffffffffffffffff
70      ffffffffffffffff

but after the patch:

50      8000000000000000
51      8000000000000000
52      8000000000000000
53      8000000000000000
54      8000000000000000
55      8000000000000000
56      8000000000000000
57      8000000000000000
58      ffffffffffffffff
59      ffffffffffffffff
60      ffffffffffffffff
61      ffffffffffffffff
62      ffffffffffffffff
63      ffffffffffffffff
64      ffffffffffffffff
65      8000000000000000
66      ffffffffffffffff
67      ffffffffffffffff
68      ffffffffffffffff
69      ffffffffffffffff
70      ffffffffffffffff

@MikePall
Copy link
Member

MikePall commented Jun 24, 2018

The JIT compiler is too clever for its own good. It narrows the double (-1) to an integer, when it's able to guarantee the operation (+) result is identical ... that is, if the operation is defined. Which it is not, of course.

OK, so this is not workable. Alternative, consistent solutions, anyone? For all architectures, of course. Preferably with attached benchmarks on their performance impact.

@MikePall MikePall reopened this Jun 24, 2018
tst2005 pushed a commit to tst2005/ljsyscall that referenced this pull request Aug 8, 2018
Otherwise, passing a negative seek amount as a normal Lua number will
involve a cast from double to uint64.  In C it's undefined behavior when
a double outside the [0,2^64) range is cast to uint64.  In Lua we try to
additionally accomodate the range [-2^63, -1], but there is a bug on
x64-64 and might be a bug on other platforms:

  LuaJIT/LuaJIT#415

However it's cheaper to simply target an int64_t when you want to allow
negative numbers, as is our case, because you don't have to do assembly
heroics for it to do what you want.  The `tonumber` call in the return
of `linux.lua:lseek` indicates anyway that our range is not the full
64-bit range, so we might as well restrict instead to long rather than
ulong.
@nico-abram
Copy link

nico-abram commented Jan 20, 2019

Should an issue about this be opened for discussion?

siddhesh added a commit to moonjit/moonjit that referenced this pull request Mar 14, 2019
This reverts commit f5d424a.

The patch breaks test 279, i.e.

  assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL")

The patch was put in to make the JIT and interpreter behaviour
consistent[1] for float to unsigned int conversions but it ended up
making things worse.  There needs to be a better fix for this.

[1] LuaJIT#415
@siddhesh
Copy link

There should be a way to do this in two stages that is still conforming: convert from negative float to signed integer (which is defined as long as the truncated number is within the range of the signed integer type) and then cast the signed integer to unsigned.

@siddhesh
Copy link

FWIW, this is what I'm proposing:

moonjit@454bea8

This ends up with an additional branch but it should have an insignificant performance impact for most out of order processors.

siddhesh added a commit to moonjit/moonjit that referenced this pull request Mar 28, 2019
This reverts commit f5d424a.

The patch breaks test 279, i.e.

  assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL")

The patch was put in to make the JIT and interpreter behaviour
consistent[1] for float to unsigned int conversions but it ended up
making things worse.  There needs to be a better fix for this.

[1] LuaJIT#415
cryptomilk pushed a commit to cryptomilk/LuaJIT that referenced this pull request Oct 21, 2021
This reverts commit f5d424a.

The patch breaks test 279, i.e.

  assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL")

The patch was put in to make the JIT and interpreter behaviour
consistent[1] for float to unsigned int conversions but it ended up
making things worse.  There needs to be a better fix for this.

[1] LuaJIT#415
cryptomilk pushed a commit to cryptomilk/LuaJIT that referenced this pull request Jan 9, 2022
This reverts commit f5d424a.

The patch breaks test 279, i.e.

  assert(tostring(bit.band(1ll, 1, 1ull, -1)) == "1ULL")

The patch was put in to make the JIT and interpreter behaviour
consistent[1] for float to unsigned int conversions but it ended up
making things worse.  There needs to be a better fix for this.

[1] LuaJIT#415
@corsix
Copy link

corsix commented Jan 3, 2024

Alternative, consistent solutions, anyone?

As background, there are the C-style conversions float/double -> int32/uint32/int64/uint64, all of which are defined as:

  1. FP domain round toward zero (i.e. truncation). [NB: no-op for operands with abs(x) >= 252]
  2. Convert to infinite-precision integer.
  3. Undefined behaviour if the infinite-precision integer is outside the range of the target type (that being one of int32/uint32/int64/uint64).
  4. Return the value in the target type equal to the infinite-precision integer.

At the moment, FFI-style conversions match the C-style conversions. I've been considering changing the FFI-style conversions to be just two core conversions; one double -> 32-bit (N=32) and one double -> 64-bit (N=64), defined as:

  1. FP domain round toward zero (i.e. truncation). [NB: no-op for operands with abs(x) >= 252]
  2. Convert to infinite-precision integer.
  3. Undefined behaviour if the infinite-precision integer is outside the range of -2N-1 through 2N-1.
  4. Return the value in the target type equal to the infinite-precision integer modulo 2N.

Notably, FFI-style double -> 32-bit would then be a superset of both C-style double -> int32 and C-style double -> uint32_t, and FFI-style double -> 64-bit would then be a superset of both C-style double -> int64 and C-style double -> uint64_t. FFI-style float -> T conceptually does a float -> double promotion, followed by double -> T, though in practice there might be a more efficient direct route.

There's a soft-float implementation of FFI-style double -> 64-bit with defined behaviour for all finite inputs:

static uint64_t dbl_to_64(const double* x) {
  uint64_t m, s;
  int32_t e;
  memcpy(&m, x, 8);
  e = ((m >> 52) & 0x7ff) - (1023+52);
  if (e <= -53 || e >= 64) return 0;
  s = (uint64_t)(((int64_t)m) >> 63);
  m = (m & U64x(fffff, ffffffff)) | U64x(100000, 00000000);
  return ((e < 0 ? m >> -e : m << e) + s) ^ s;
} /* NB: If caller wants int64_t, cast the result to int64_t. */

Backends could always choose to use the soft-float implementation for double -> 64-bit, though in practice backends with an FPU available will have a more efficient option available. FFI-style double -> 32-bit could be implemented as double -> 64-bit (via soft-float if necessary), followed by an integer operation to truncate 64 bits down to 32, though in practice this is rarely the most efficient route.

If there's a hardware instruction available for C-style double -> int64 (as is the case for all 64-bit backends), then FFI-style double -> 32-bit has a simple implementation as C-style double -> int64 (e.g. x86 cvttsd2si) followed by integer operation to truncate 64 bits down to 32, with the defined range in this case being -263 through 263-1. If there's no cheap way of doing C-style double -> int64, an alternative option for FFI-style double -> 32-bit is FP domain round toward zero (e.g. SSE4.1 roundsd, ARM64 frintz), followed by bit.tobit-style double -> int32 (i.e. add 252+251 in the FP domain, then bitwise reinterpret as integer), with the defined range in this case being +/-251. On sufficiently modern ARM64, FFI-style double -> 32-bit could also be a single fjcvtzs instruction, with defined range being all finite values. On x86 32-bit, x87 presents a C-style double -> int64 (albeit via memory), which might or might not be preferable to the bit.tobit route.

Efficient FFI-style double -> 64-bit is trickier. On ARM64, one starting point is fcvtzu, which does an FP domain truncation followed by a saturating conversion double -> uint64_t. From this, a four-instruction sequence can be built that covers the domain -264-1 through 264-1:

fcvtzu idst, fsrc
fneg ftmp, fsrc
fcvtzu itmp, ftmp
sub idst, idst, itmp

On x64, cvttsd2si (C-style double -> int64) is the obvious building-block, but it covers 2-63 through 263-1, and yields 2-63 for anything out of range. As in the opening description for this ticket, it can be extended to 2-63 through 264+263-1 by doing a 2nd conversion of x-264 and then merging the results. The assembly from the opening description is fine, though some tweaks are also possible: using addsd with constant -264 rather than subsd with constant 264 (allowing a load of the constant followed by an addsd destroying the constant, rather than movaps of the original value followed by subsd with memory operand), and merging by summing the two results and then adding 263 rather than cmp/cmovo (possibly nicer for old CPUs with expensive cmov). x86 32-bit can take the soft-float route for FFI-style double -> 64-bit, or try the x87 alternative to cvttsd2si(fisttp).

Rough thoughts for other backends:

mips32r1:
  just use soft-float?
mips32r2+:
  to 32-bit: TRUNC.L.D, move low 32 bits to GPR
  to 64-bit: if (x >= 2^63) { x += -2^64 } TRUNC.L.D, move to GPR pair
mips64:
  to 32-bit: TRUNC.L.D, move low 32 bits to GPR
  to 64-bit: if (x >= 2^63) { x += -2^64 } TRUNC.L.D, move to GPR (i.e. existing logic for FP to U64 conversions)

arm32:
  to 32-bit: VCVT_U32_F64, VNEG, VCVT_U32_F64, SUB (or VJCVT on sufficiently modern cores)
  to 64-bit: soft-float

ppc with fctidz:
  to 32-bit: fctidz, move low 32 bits to GPR
  to 64-bit: if (x >= 2^63) { x += -2^64 } fctidz, move to GPR
ppc with fctiwz:
  to 32-bit: if (x >= 2^31) { x += -2^31; fctiwz; integer += 2^31 } else fctiwz
  to 64-bit: soft-float

riscv64:
  same strategy as arm64, just different instruction names

@Buristan
Copy link

As an alternative, we can just raise an error like the following: cannot convert 'number' to 'uint64_t', value range exceeded in cases when the given value is outside the range of the target type. Since nobody should rely on undefined (or implementation-defined) behaviour, this will signal users that there is an error in their code and they should fix it. Now there are already cases when some (arithmetic) operation can raise an error when conversion between types is invalid:

src/luajit -e '
local ffi = require"ffi"
local a = ffi.new("double", 1.1)
local b = ffi.new("void *", 0)
print(a + b)
'
src/luajit: (command line):4: cannot convert 'number' to 'void *'
stack traceback:
        [C]: in function 'new'
        (command line):4: in main chunk
        [C]: at 0x55a98f1df044

Also, it should be easily implemented for the JIT part via additional guard checks (when we fallback to the interpreter, it throws the error).

Nevertheless, I have no objections regarding Peter's approach.

ligurio pushed a commit to tarantool/luajit that referenced this pull request Mar 4, 2024
Thanks to Carlo Cabrera.

(cherry picked from commit 3065c91)

There is a document "OS X ABI Mach-O File Format Reference",
published by Apple company, that contains description of Mach-O
C-structures. Copy of the (now removed) official documentation in [1].
Yet another source of thruth is a XNU headers, see the definition
of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`).

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX
instructions with the EVEX coding scheme to support 512-bit registers,
operation masks, parameter broadcasting, and embedded rounding and
exception control, implemented by Knights Landing and Skylake Xeon.

The problem is the following. AVX512F has the `vcvttsd2usi` instruction
which converts `double`/`float` to `uint32_t/uint64_t`. Earlier
architectures (SSE2, AVX2) are sorely lacking such an instruction, as
they only support signed conversions. Unsigned conversions are done with
a signed convert and range shifting -- the exact algorithm depends on
the compiler. A side-effect of these workarounds is that negative
`double`/`float` often inadvertently convert 'as expected', even though
this is invoking undefined behavior. Whereas `vcvttsd2usi` always
returns 0x80000000 or 0x8000000000000000 for out-of-range inputs.

In the context of LuaJIT#415 there was an attempt to fix this partially by
hand-coding the `double` -> `uint64_t` case for certain architectures
(see commit f3cf0d6 "Give expected results for negative non-base-10
numbers in tonumber()."). But the `double` -> `uint32_t` case remained
untouched.

It just so happens that LuaJIT FFI code for Mach-O file generation in
`bcsave.lua` relied on undefined behavior for conversions to `uint32_t`.
That failed when LuaJIT was compiled for an architecture with enabled
AVX512F.

The patch fixes the problem, however, the real issue remains unfixed.

The problem is reproduced when target hardware platform has AVX512F and
LuaJIT is compiled with enabled AVX512F instructions. The support of
AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl
hw.optional.avx512f` on Mac. To detect codename execute `cat
/sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target |
grep march`. Please consult for available model architecture on GCC
Online Documentation [5].

1. https://github.com/aidansteele/osx-abi-macho-file-format-reference
2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h
3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h
4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code
5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Sergey Bronnikov:
* added the description and a test for the problem

Part of tarantool/tarantool#9595

----------------

luajit -b -o osx -a arm64 empty.lua empty.o
ligurio pushed a commit to tarantool/luajit that referenced this pull request Mar 7, 2024
Thanks to Carlo Cabrera.

(cherry picked from commit 3065c91)

There is a document "OS X ABI Mach-O File Format Reference",
published by Apple company, that contains description of Mach-O
C-structures. Copy of the (now removed) official documentation in [1].
Yet another source of thruth is a XNU headers, see the definition
of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`).

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX
instructions with the EVEX coding scheme to support 512-bit registers,
operation masks, parameter broadcasting, and embedded rounding and
exception control, implemented by Knights Landing and Skylake Xeon.

The problem is the following. AVX512F has the `vcvttsd2usi` instruction
which converts `double`/`float` to `uint32_t/uint64_t`. Earlier
architectures (SSE2, AVX2) are sorely lacking such an instruction, as
they only support signed conversions. Unsigned conversions are done with
a signed convert and range shifting -- the exact algorithm depends on
the compiler. A side-effect of these workarounds is that negative
`double`/`float` often inadvertently convert 'as expected', even though
this is invoking undefined behavior. Whereas `vcvttsd2usi` always
returns 0x80000000 or 0x8000000000000000 for out-of-range inputs.

In the context of LuaJIT#415 there was an attempt to fix this partially by
hand-coding the `double` -> `uint64_t` case for certain architectures
(see commit f3cf0d6 "Give expected results for negative non-base-10
numbers in tonumber()."). But the `double` -> `uint32_t` case remained
untouched.

It just so happens that LuaJIT FFI code for Mach-O file generation in
`bcsave.lua` relied on undefined behavior for conversions to `uint32_t`.
That failed when LuaJIT was compiled for an architecture with enabled
AVX512F.

The patch fixes the problem, however, the real issue remains unfixed.

The problem is reproduced when target hardware platform has AVX512F and
LuaJIT is compiled with enabled AVX512F instructions. The support of
AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl
hw.optional.avx512f` on Mac. To detect codename execute `cat
/sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target |
grep march`. Please consult for available model architecture on GCC
Online Documentation [5].

1. https://github.com/aidansteele/osx-abi-macho-file-format-reference
2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h
3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h
4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code
5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Sergey Bronnikov:
* added the description and a test for the problem

Part of tarantool/tarantool#9595
ligurio pushed a commit to tarantool/luajit that referenced this pull request Mar 14, 2024
Thanks to Carlo Cabrera.

(cherry picked from commit 3065c91)

There is a document "OS X ABI Mach-O File Format Reference",
published by Apple company, that contains description of Mach-O
C-structures. Copy of the (now removed) official documentation in [1].
Yet another source of thruth is a XNU headers, see the definition
of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`).

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX
instructions with the EVEX coding scheme to support 512-bit registers,
operation masks, parameter broadcasting, and embedded rounding and
exception control, implemented by Knights Landing and Skylake Xeon.

The problem is the following. AVX512F has the `vcvttsd2usi` instruction
which converts `double`/`float` to `uint32_t/uint64_t`. Earlier
architectures (SSE2, AVX2) are sorely lacking such an instruction, as
they only support signed conversions. Unsigned conversions are done with
a signed convert and range shifting -- the exact algorithm depends on
the compiler. A side-effect of these workarounds is that negative
`double`/`float` often inadvertently convert 'as expected', even though
this is invoking undefined behavior. Whereas `vcvttsd2usi` always
returns 0x80000000 or 0x8000000000000000 for out-of-range inputs.

In the context of LuaJIT#415 there was an attempt to fix this partially by
hand-coding the `double` -> `uint64_t` case for certain architectures
(see commit f3cf0d6 "Give expected results for negative non-base-10
numbers in tonumber()."). But the `double` -> `uint32_t` case remained
untouched.

It just so happens that LuaJIT FFI code for Mach-O file generation in
`bcsave.lua` relied on undefined behavior for conversions to `uint32_t`.
That failed when LuaJIT was compiled for an architecture with enabled
AVX512F.

The patch fixes the problem, however, the real issue remains unfixed.

The problem is reproduced when target hardware platform has AVX512F and
LuaJIT is compiled with enabled AVX512F instructions. The support of
AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl
hw.optional.avx512f` on Mac. To detect codename execute `cat
/sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target |
grep march`. Please consult for available model architecture on GCC
Online Documentation [5].

1. https://github.com/aidansteele/osx-abi-macho-file-format-reference
2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h
3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h
4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code
5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Sergey Bronnikov:
* added the description and a test for the problem

Part of tarantool/tarantool#9595
ligurio pushed a commit to tarantool/luajit that referenced this pull request Mar 14, 2024
Thanks to Carlo Cabrera.

(cherry picked from commit 3065c91)

There is a document "OS X ABI Mach-O File Format Reference",
published by Apple company, that contains description of Mach-O
C-structures. Copy of the (now removed) official documentation in [1].
Yet another source of thruth is a XNU headers, see the definition
of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`).

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX
instructions with the EVEX coding scheme to support 512-bit registers,
operation masks, parameter broadcasting, and embedded rounding and
exception control, implemented by Knights Landing and Skylake Xeon.

The problem is the following. AVX512F has the `vcvttsd2usi` instruction
which converts `double`/`float` to `uint32_t/uint64_t`. Earlier
architectures (SSE2, AVX2) are sorely lacking such an instruction, as
they only support signed conversions. Unsigned conversions are done with
a signed convert and range shifting -- the exact algorithm depends on
the compiler. A side-effect of these workarounds is that negative
`double`/`float` often inadvertently convert 'as expected', even though
this is invoking undefined behavior. Whereas `vcvttsd2usi` always
returns 0x80000000 or 0x8000000000000000 for out-of-range inputs.

In the context of LuaJIT#415 there was an attempt to fix this partially by
hand-coding the `double` -> `uint64_t` case for certain architectures
(see commit f3cf0d6 "Give expected results for negative non-base-10
numbers in tonumber()."). But the `double` -> `uint32_t` case remained
untouched.

It just so happens that LuaJIT FFI code for Mach-O file generation in
`bcsave.lua` relied on undefined behavior for conversions to `uint32_t`.
That failed when LuaJIT was compiled for an architecture with enabled
AVX512F.

The patch fixes the problem, however, the real issue remains unfixed.

The problem is reproduced when target hardware platform has AVX512F and
LuaJIT is compiled with enabled AVX512F instructions. The support of
AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl
hw.optional.avx512f` on Mac. To detect codename execute `cat
/sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target |
grep march`. Please consult for available model architecture on GCC
Online Documentation [5].

1. https://github.com/aidansteele/osx-abi-macho-file-format-reference
2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h
3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h
4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code
5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Sergey Bronnikov:
* added the description and a test for the problem

Part of tarantool/tarantool#9595
ligurio pushed a commit to tarantool/luajit that referenced this pull request Mar 14, 2024
Thanks to Carlo Cabrera.

(cherry picked from commit 3065c91)

There is a document "OS X ABI Mach-O File Format Reference",
published by Apple company, that contains description of Mach-O
C-structures. Copy of the (now removed) official documentation in [1].
Yet another source of thruth is a XNU headers, see the definition
of C-structures in: [2] (`nlist_64`), [3] (`fat_arch` and `fat_header`).

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX
instructions with the EVEX coding scheme to support 512-bit registers,
operation masks, parameter broadcasting, and embedded rounding and
exception control, implemented by Knights Landing and Skylake Xeon.

The problem is the following. AVX512F has the `vcvttsd2usi` instruction
which converts `double`/`float` to `uint32_t/uint64_t`. Earlier
architectures (SSE2, AVX2) are sorely lacking such an instruction, as
they only support signed conversions. Unsigned conversions are done with
a signed convert and range shifting -- the exact algorithm depends on
the compiler. A side-effect of these workarounds is that negative
`double`/`float` often inadvertently convert 'as expected', even though
this is invoking undefined behavior. Whereas `vcvttsd2usi` always
returns 0x80000000 or 0x8000000000000000 for out-of-range inputs.

In the context of LuaJIT#415 there was an attempt to fix this partially by
hand-coding the `double` -> `uint64_t` case for certain architectures
(see commit f3cf0d6 "Give expected results for negative non-base-10
numbers in tonumber()."). But the `double` -> `uint32_t` case remained
untouched.

It just so happens that LuaJIT FFI code for Mach-O file generation in
`bcsave.lua` relied on undefined behavior for conversions to `uint32_t`.
That failed when LuaJIT was compiled for an architecture with enabled
AVX512F.

The patch fixes the problem, however, the real issue remains unfixed.

The problem is reproduced when target hardware platform has AVX512F and
LuaJIT is compiled with enabled AVX512F instructions. The support of
AVX512F could be checked in `/proc/cpuinfo` on Linux and `sysctl
hw.optional.avx512f` on Mac. To detect codename execute `cat
/sys/devices/cpu/caps/pmu_name` or `gcc -march=native -Q --help=target |
grep march`. Please consult for available model architecture on GCC
Online Documentation [5].

1. https://github.com/aidansteele/osx-abi-macho-file-format-reference
2. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/nlist.h
3. https://github.com/apple-oss-distributions/xnu/blob/xnu-10002.1.13/EXTERNAL_HEADERS/mach-o/fat.h
4. https://developer.apple.com/documentation/apple-silicon/addressing-architectural-differences-in-your-macos-code
5. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Sergey Bronnikov:
* added the description and a test for the problem

Part of tarantool/tarantool#9595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants