Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad perf with memory refinement query #983

Open
regehr opened this issue Dec 3, 2023 · 3 comments
Open

Bad perf with memory refinement query #983

regehr opened this issue Dec 3, 2023 · 3 comments
Labels
memory Memory Model performance Opportunities for improving performance

Comments

@regehr
Copy link
Contributor

regehr commented Dec 3, 2023

here's an interesting test case that @tanmaytirpankar found:

define void @insert_store_nonconst_index_not_known_valid_by_and(ptr %0, i8 zeroext %1, i32 %2) {
 %4 = load <16 x i8>, ptr %0, align 16
 %5 = and i32 %2, 16
 %6 = insertelement <16 x i8> %4, i8 %1, i32 %5
 store <16 x i8> %6, ptr %0, align 16
 ret void
}

it exposes two issues. first, if the insertelement index is OOB, the result is a vector full of poison. however, on the ARM side, we get something like:

define void @insert_store_nonconst_index_not_known_valid_by_and(ptr nocapture writeonly %0, i8 %1, i32 %2) local_unnamed_addr #0 {
arm_tv_entry:
  %a2_2 = and i32 %2, 16
  %a2_3 = zext nneg i32 %a2_2 to i64
  %3 = getelementptr i8, ptr %0, i64 %a2_3
  store i8 %1, ptr %3, align 1
  ret void
}

of course at this level, an OOB store is UB. so we don't have a refinement, and moreover, I don't know that we can easily fix this in the general case, as we can for other target-side UB like divide by zero and shift-past-bitwidth.

but that's not all! if we force the index to be in bounds like this:

define void @insert_store_nonconst_index_not_known_valid_by_and(ptr %0, i8 zeroext %1, i32 %2) {
 %4 = load <16 x i8>, ptr %0, align 16
 %5 = and i32 %2, 15
 %6 = insertelement <16 x i8> %4, i8 %1, i32 %5
 store <16 x i8> %6, ptr %0, align 16
 ret void
}

then of course we don't have any problems with UB, but we still can't prove refinement. here's a src-tgt pair that doesn't seem too difficult, but it runs for 99 minutes and then I get "ERROR: SMT Error: smt tactic failed to show goal to be sat/unsat memout":

define void @src(ptr %0, i8 zeroext %1, i32 %2) {
 %x4 = load <16 x i8>, ptr %0, align 1
 %z = and i32 %2, 15
 %x6 = insertelement <16 x i8> %x4, i8 %1, i32 %z
 store <16 x i8> %x6, ptr %0, align 1
 ret void
}

define void @tgt(ptr %0, i8 %1, i32 %2) {
arm_tv_entry:
   %z = and i32 %2, 15
   %3 = getelementptr i8, ptr %0, i32 %z
  store i8 %1, ptr %3, align 1
  ret void
}
regehr@ohm:~/alive2-regehr/build$ time ./alive-tv foo2.ll --disable-poison-input --disable-undef-input --smt-to=10000000

----------------------------------------
define void @src(ptr %#0, i8 zeroext %#1, i32 %#2) {
#3:
  %x4 = load <16 x i8>, ptr %#0, align 1
  %z = and i32 %#2, 15
  %x6 = insertelement <16 x i8> %x4, i8 zeroext %#1, i32 %z
  store <16 x i8> %x6, ptr %#0, align 1
  ret void
}
=>
define void @tgt(ptr %#0, i8 %#1, i32 %#2) {
arm_tv_entry:
  %z = and i32 %#2, 15
  %#3 = gep ptr %#0, 1 x i32 %z
  store i8 %#1, ptr %#3, align 1
  ret void
}

ERROR: SMT Error: smt tactic failed to show goal to be sat/unsat memout

real	99m30.728s
user	98m27.443s
sys	0m42.087s
regehr@ohm:~/alive2-regehr/build$ 
@nunoplopes
Copy link
Member

LLVM is buggy. The generated assembly is:

src:
        and     w8, w2, #0x10
        strb    w1, [x0, w8, uxtw]
        ret

It may store a byte OOB where the original code did not.

@regehr
Copy link
Contributor Author

regehr commented Dec 3, 2023

ah.... thanks Nuno, we missed that. they're not going to be happy about this since fixing it will add instructions, unless the fix is to make OOB InsertElement instructions UB.

I'll leave this issue open for the performance problem, but of course close if you want

@nunoplopes
Copy link
Member

we need to improve perf for this:

define void @src(ptr %0, i8 zeroext %1, i32 %2) {
 %4 = load <16 x i8>, ptr %0, align 16
 %5 = and i32 %2, 15
 %6 = insertelement <16 x i8> %4, i8 %1, i32 %5
 store <16 x i8> %6, ptr %0, align 16
 ret void
}

define void @tgt(ptr nocapture writeonly %0, i8 %1, i32 %2) {
  %a2_2 = and i32 %2, 15
  %a2_3 = zext nneg i32 %a2_2 to i64
  %x3 = getelementptr i8, ptr %0, i64 %a2_3
  store i8 %1, ptr %x3, align 1
  ret void
}

Essentially the refinement query uses a forall-quantified idx to go through the memories. It should be unrolled in small (all?) cases.

@nunoplopes nunoplopes changed the title two issues exposed by one arm-tv test case Bad perf with memory refinement query Dec 4, 2023
@nunoplopes nunoplopes added performance Opportunities for improving performance memory Memory Model labels Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory Memory Model performance Opportunities for improving performance
Projects
None yet
Development

No branches or pull requests

2 participants