Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X64 widening elimination can interact badly with CSE #1086

Open
corsix opened this issue Sep 18, 2023 · 3 comments
Open

X64 widening elimination can interact badly with CSE #1086

corsix opened this issue Sep 18, 2023 · 3 comments

Comments

@corsix
Copy link

corsix commented Sep 18, 2023

The following currently fails on X64: (though works fine with any of -joff / -O-cse / -O-fold)

local ffi = require("ffi")
local bit = require("bit")
local zero = ffi.new("struct { uint8_t u8; }", 0)
local v1, v2
for i = 1, 60 do
  v1 = bit.bnot(zero.u8)
  v2 = bit.bnot(ffi.cast("int64_t", zero.u8))
end
assert(bit.bnot(v1) == 0)
assert(bit.bnot(v2) == 0LL) -- fails

Issue is that FOLD turns ffi.cast("int64_t", zero.u8) into zero.u8, and then CSE considers the two bit.bnot(zero.u8) expressions to be identical, even though the type on their IR_BNOT differs.

@corsix
Copy link
Author

corsix commented Sep 18, 2023

Variant that affects X64 and ARM64 equally:

local ffi = require("ffi")
local bit = require("bit")
local z = 0
local t1 = {}
local t2 = {}
for i = 1, 60 do
  t1[i] = bit.bnot(z)
  t2[i] = bit.bnot(ffi.cast("uint64_t", ffi.cast("uint32_t", bit.tobit(z))))
end
assert(t1[1] == t1[60])
assert(t2[1] == t2[60])

This one relies on the special-case construction of a uint64_t cdata from a uint32_t value (if (dsize == 8 && ssize < 8 && !(LJ_64 && (sinfo & CTF_UNSIGNED)))), then LJFOLD(FLOAD CNEWI IRFL_CDATA_INT64) to hand us back the uint32_t value when we were expecting a 64-bit value.

@corsix
Copy link
Author

corsix commented Sep 19, 2023

So the question I'm dancing around is: should no-op CONV operations be implicit or explicit in the IR? For most 64-bit backends, 32-bit to 64-bit zero extension in such a no-op, though presumably RISCV64 would have 32-bit to 64-bit sign extension be a no-op instead. u32 <-> i32 and u64 <-> i64 are also no-op candidates, and occasionally i64/u64 -> i32/u32 when the high bits are known to be zero (c.f. IRCONV_NONE).

Currently it is a mix of both; CNEWI can have an implicit u32 -> u64, X64 fold rules can also introduce implicit u32 -> u64, XSTORE can have implicit i32 -> u32 (and possibly other things), but most other cases are explicit (though see #1084).

From an IR cleanliness perspective, explicit seems preferable, though a middle-ground is constructing the IR with explicit conversions and then immediately folding them away (with backend-awareness where appropriate, c.f. RISCV64).

From a code generation perspective, explicit can cause higher register usage; if the input and output of the no-op conversion are live at the same time, the register allocator isn't clever enough to share the same register for both (though aggressive fusion can help ameliorate this).

If the answer is explicit, then IR construction and/or FOLD rules need fixing up. If the answer is implicit, then various FOLD rules can be made more aggressive, CSE needs to do a little bit of type checking, and backends need to be audited (e.g. ARM64's asm_fuseopm assuming that bit-width of operation equals bit-width of operands).

Said CSE amendment could be something like:

@@ -2559,11 +2560,17 @@ TRef LJ_FASTCALL lj_opt_cse(jit_State *J)
     /* Limited search for same operands in per-opcode chain. */
     IRRef ref = J->chain[op];
     IRRef lim = fins->op1;
+#ifdef LJ_64
+    uint32_t wrongt = IRT_IS64;
+    if (irt_is64(fins->t)) wrongt = ~wrongt;
+#endif
     if (fins->op2 > lim) lim = fins->op2;  /* Relies on lit < REF_BIAS. */
-    while (ref > lim) {
-      if (IR(ref)->op12 == op12)
-       return TREF(ref, irt_t(IR(ref)->t));  /* Common subexpression found. */
-      ref = IR(ref)->prev;
+    for (; ref > lim; ref = IR(ref)->prev) {
+      if (IR(ref)->op12 != op12) continue;
+#ifdef LJ_64
+      if (((wrongt >> irt_type(IR(ref)->t)) & 1)) continue;
+#endif
+      return TREF(ref, irt_t(IR(ref)->t));  /* Common subexpression found. */
     }
   }
   /* Otherwise emit IR (inlined for speed). */

@MikePall
Copy link
Member

I don't want to overhaul major parts of the IR for v2.1. Will apply whatever minimal change is required.

The 32/64 bit unsoundness should be rectified in v3.0. But only after the initial cleanups (archs, bytecode, IR_HREF, ...).

@MikePall MikePall mentioned this issue Sep 21, 2023
45 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants