-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X64 widening elimination can interact badly with CSE #1086
Comments
Variant that affects X64 and ARM64 equally: local ffi = require("ffi")
local bit = require("bit")
local z = 0
local t1 = {}
local t2 = {}
for i = 1, 60 do
t1[i] = bit.bnot(z)
t2[i] = bit.bnot(ffi.cast("uint64_t", ffi.cast("uint32_t", bit.tobit(z))))
end
assert(t1[1] == t1[60])
assert(t2[1] == t2[60]) This one relies on the special-case construction of a uint64_t cdata from a uint32_t value ( |
So the question I'm dancing around is: should no-op CONV operations be implicit or explicit in the IR? For most 64-bit backends, 32-bit to 64-bit zero extension in such a no-op, though presumably RISCV64 would have 32-bit to 64-bit sign extension be a no-op instead. u32 <-> i32 and u64 <-> i64 are also no-op candidates, and occasionally i64/u64 -> i32/u32 when the high bits are known to be zero (c.f. Currently it is a mix of both; CNEWI can have an implicit u32 -> u64, X64 fold rules can also introduce implicit u32 -> u64, XSTORE can have implicit i32 -> u32 (and possibly other things), but most other cases are explicit (though see #1084). From an IR cleanliness perspective, explicit seems preferable, though a middle-ground is constructing the IR with explicit conversions and then immediately folding them away (with backend-awareness where appropriate, c.f. RISCV64). From a code generation perspective, explicit can cause higher register usage; if the input and output of the no-op conversion are live at the same time, the register allocator isn't clever enough to share the same register for both (though aggressive fusion can help ameliorate this). If the answer is explicit, then IR construction and/or FOLD rules need fixing up. If the answer is implicit, then various FOLD rules can be made more aggressive, CSE needs to do a little bit of type checking, and backends need to be audited (e.g. ARM64's Said CSE amendment could be something like: @@ -2559,11 +2560,17 @@ TRef LJ_FASTCALL lj_opt_cse(jit_State *J)
/* Limited search for same operands in per-opcode chain. */
IRRef ref = J->chain[op];
IRRef lim = fins->op1;
+#ifdef LJ_64
+ uint32_t wrongt = IRT_IS64;
+ if (irt_is64(fins->t)) wrongt = ~wrongt;
+#endif
if (fins->op2 > lim) lim = fins->op2; /* Relies on lit < REF_BIAS. */
- while (ref > lim) {
- if (IR(ref)->op12 == op12)
- return TREF(ref, irt_t(IR(ref)->t)); /* Common subexpression found. */
- ref = IR(ref)->prev;
+ for (; ref > lim; ref = IR(ref)->prev) {
+ if (IR(ref)->op12 != op12) continue;
+#ifdef LJ_64
+ if (((wrongt >> irt_type(IR(ref)->t)) & 1)) continue;
+#endif
+ return TREF(ref, irt_t(IR(ref)->t)); /* Common subexpression found. */
}
}
/* Otherwise emit IR (inlined for speed). */ |
I don't want to overhaul major parts of the IR for v2.1. Will apply whatever minimal change is required. The 32/64 bit unsoundness should be rectified in v3.0. But only after the initial cleanups (archs, bytecode, IR_HREF, ...). |
The following currently fails on X64: (though works fine with any of
-joff
/-O-cse
/-O-fold
)Issue is that FOLD turns
ffi.cast("int64_t", zero.u8)
intozero.u8
, and then CSE considers the twobit.bnot(zero.u8)
expressions to be identical, even though the type on their IR_BNOT differs.The text was updated successfully, but these errors were encountered: