X64 widening elimination can interact badly with CSE #1086

corsix · 2023-09-18T13:41:58Z

The following currently fails on X64: (though works fine with any of -joff / -O-cse / -O-fold)

local ffi = require("ffi")
local bit = require("bit")
local zero = ffi.new("struct { uint8_t u8; }", 0)
local v1, v2
for i = 1, 60 do
  v1 = bit.bnot(zero.u8)
  v2 = bit.bnot(ffi.cast("int64_t", zero.u8))
end
assert(bit.bnot(v1) == 0)
assert(bit.bnot(v2) == 0LL) -- fails

Issue is that FOLD turns ffi.cast("int64_t", zero.u8) into zero.u8, and then CSE considers the two bit.bnot(zero.u8) expressions to be identical, even though the type on their IR_BNOT differs.

The text was updated successfully, but these errors were encountered:

corsix · 2023-09-18T21:44:42Z

Variant that affects X64 and ARM64 equally:

local ffi = require("ffi")
local bit = require("bit")
local z = 0
local t1 = {}
local t2 = {}
for i = 1, 60 do
  t1[i] = bit.bnot(z)
  t2[i] = bit.bnot(ffi.cast("uint64_t", ffi.cast("uint32_t", bit.tobit(z))))
end
assert(t1[1] == t1[60])
assert(t2[1] == t2[60])

This one relies on the special-case construction of a uint64_t cdata from a uint32_t value (if (dsize == 8 && ssize < 8 && !(LJ_64 && (sinfo & CTF_UNSIGNED)))), then LJFOLD(FLOAD CNEWI IRFL_CDATA_INT64) to hand us back the uint32_t value when we were expecting a 64-bit value.

corsix · 2023-09-19T11:09:32Z

So the question I'm dancing around is: should no-op CONV operations be implicit or explicit in the IR? For most 64-bit backends, 32-bit to 64-bit zero extension in such a no-op, though presumably RISCV64 would have 32-bit to 64-bit sign extension be a no-op instead. u32 <-> i32 and u64 <-> i64 are also no-op candidates, and occasionally i64/u64 -> i32/u32 when the high bits are known to be zero (c.f. IRCONV_NONE).

Currently it is a mix of both; CNEWI can have an implicit u32 -> u64, X64 fold rules can also introduce implicit u32 -> u64, XSTORE can have implicit i32 -> u32 (and possibly other things), but most other cases are explicit (though see #1084).

From an IR cleanliness perspective, explicit seems preferable, though a middle-ground is constructing the IR with explicit conversions and then immediately folding them away (with backend-awareness where appropriate, c.f. RISCV64).

From a code generation perspective, explicit can cause higher register usage; if the input and output of the no-op conversion are live at the same time, the register allocator isn't clever enough to share the same register for both (though aggressive fusion can help ameliorate this).

If the answer is explicit, then IR construction and/or FOLD rules need fixing up. If the answer is implicit, then various FOLD rules can be made more aggressive, CSE needs to do a little bit of type checking, and backends need to be audited (e.g. ARM64's asm_fuseopm assuming that bit-width of operation equals bit-width of operands).

Said CSE amendment could be something like:

@@ -2559,11 +2560,17 @@ TRef LJ_FASTCALL lj_opt_cse(jit_State *J)
     /* Limited search for same operands in per-opcode chain. */
     IRRef ref = J->chain[op];
     IRRef lim = fins->op1;
+#ifdef LJ_64
+    uint32_t wrongt = IRT_IS64;
+    if (irt_is64(fins->t)) wrongt = ~wrongt;
+#endif
     if (fins->op2 > lim) lim = fins->op2;  /* Relies on lit < REF_BIAS. */
-    while (ref > lim) {
-      if (IR(ref)->op12 == op12)
-       return TREF(ref, irt_t(IR(ref)->t));  /* Common subexpression found. */
-      ref = IR(ref)->prev;
+    for (; ref > lim; ref = IR(ref)->prev) {
+      if (IR(ref)->op12 != op12) continue;
+#ifdef LJ_64
+      if (((wrongt >> irt_type(IR(ref)->t)) & 1)) continue;
+#endif
+      return TREF(ref, irt_t(IR(ref)->t));  /* Common subexpression found. */
     }
   }
   /* Otherwise emit IR (inlined for speed). */

MikePall · 2023-09-20T12:15:51Z

I don't want to overhaul major parts of the IR for v2.1. Will apply whatever minimal change is required.

The 32/64 bit unsoundness should be rectified in v3.0. But only after the initial cleanups (archs, bytecode, IR_HREF, ...).

MikePall mentioned this issue Sep 21, 2023

v3.0 Tracking Issue #1092

Open

45 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

X64 widening elimination can interact badly with CSE #1086

X64 widening elimination can interact badly with CSE #1086

corsix commented Sep 18, 2023

corsix commented Sep 18, 2023

corsix commented Sep 19, 2023 •

edited

MikePall commented Sep 20, 2023

X64 widening elimination can interact badly with CSE #1086

X64 widening elimination can interact badly with CSE #1086

Comments

corsix commented Sep 18, 2023

corsix commented Sep 18, 2023

corsix commented Sep 19, 2023 • edited

MikePall commented Sep 20, 2023

corsix commented Sep 19, 2023 •

edited