Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic instructions #556

Open
tathanhdinh opened this issue Oct 26, 2021 · 3 comments
Open

Atomic instructions #556

tathanhdinh opened this issue Oct 26, 2021 · 3 comments

Comments

@tathanhdinh
Copy link
Contributor

tathanhdinh commented Oct 26, 2021

Hello everyone,

When lifting some atomic instructions, e.g. lock inc byte ptr [rax]:

remill-lift-12 --arch amd64 --name_register_variables --bc-out test.bc --bytes F0FE00

I've got the following result:

; Function Attrs: noinline nounwind
define dso_local %struct.Memory* @sub_0(%struct.State* noalias nonnull align 1 %state, i64 %pc, %struct.Memory* noalias %memory) local_unnamed_addr #0 {
  %RAX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 1, i32 0, i32 0
  %PC_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 33, i32 0, i32 0
  %1 = add i64 %pc, 3
  %2 = tail call %struct.Memory* @__remill_atomic_begin(%struct.Memory* %memory)
  %RAX_ = load i64, i64* %RAX_ptr, align 8
  %3 = tail call zeroext i8 @__remill_read_memory_8(%struct.Memory* %2, i64 %RAX_) #5
  %4 = add i8 %3, 1
  %5 = tail call %struct.Memory* @__remill_write_memory_8(%struct.Memory* %2, i64 %RAX_, i8 zeroext %4) #5
  %6 = zext i8 %4 to i32
  %7 = tail call i32 @llvm.ctpop.i32(i32 %6) #6
  %8 = trunc i32 %7 to i8
  %9 = and i8 %8, 1
  %10 = xor i8 %9, 1
  %PF_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 2, i32 3
  store i8 %10, i8* %PF_ptr, align 1
  %11 = xor i8 %4, %3
  %12 = lshr i8 %11, 4
  %13 = and i8 %12, 1
  %AF_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 2, i32 5
  store i8 %13, i8* %AF_ptr, align 1
  %14 = icmp eq i8 %4, 0
  %15 = zext i1 %14 to i8
  %ZF_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 2, i32 7
  store i8 %15, i8* %ZF_ptr, align 1
  %16 = lshr i8 %4, 7
  %SF_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 2, i32 9
  store i8 %16, i8* %SF_ptr, align 1
  %17 = lshr i8 %3, 7
  %18 = xor i8 %16, %17
  %19 = add nuw nsw i8 %18, %16
  %20 = icmp eq i8 %19, 2
  %21 = zext i1 %20 to i8
  %OF_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 2, i32 13
  store i8 %21, i8* %OF_ptr, align 1
  %22 = tail call %struct.Memory* @__remill_atomic_end(%struct.Memory* %5)
  store i64 %1, i64* %PC_ptr, align 8
  %23 = tail call %struct.Memory* @__remill_missing_block(%struct.State* %state, i64 %1, %struct.Memory* %22)
  ret %struct.Memory* %23
}

; Function Attrs: noduplicate noinline nounwind optnone readnone willreturn
declare dso_local %struct.Memory* @__remill_atomic_begin(%struct.Memory* readnone) #1

; Function Attrs: noduplicate noinline nounwind optnone readnone willreturn
declare dso_local %struct.Memory* @__remill_atomic_end(%struct.Memory* readnone) #1

; Function Attrs: noduplicate noinline nounwind optnone readnone willreturn
declare dso_local zeroext i8 @__remill_read_memory_8(%struct.Memory* readnone, i64) #2

; Function Attrs: noduplicate noinline nounwind optnone readnone willreturn
declare dso_local %struct.Memory* @__remill_write_memory_8(%struct.Memory*, i64, i8 zeroext) #2

where atomic intrinsic functions are declared as:

// Atomic operations. The address/size are hints, but the granularity of the
// access can be bigger. These have implicit StoreLoad semantics.
[[gnu::used, gnu::const]] extern Memory *__remill_atomic_begin(Memory *);
[[gnu::used, gnu::const]] extern Memory *__remill_atomic_end(Memory *);

I'm not quite sure that I understand the lifted code correctly.

First, I think of atomic intrinsics as some kind of memory barriers (for StoreLoad ordering), but I still don't understand why such barriers make the execution of the instruction atomic.

Second, if they are some kind of critical section acquire/release operations (to make the memory read/write exclusive), so their positions are somehow "too strict", e.g. the part of updating flags may not need belong to the critical section.

Indeed, we may think of two processors P0 and P1 executes the function sub_0 simultaneously, each has its own struct State* %state (so %state0 for P0, %state1 for P1) but they share %struct Memory* %memory. Then the only operations that need to be exclusive (or atomic) are __remill_read_memory_8 and __remill_write_memory_8, there is no need to make the flag update synchronized (since the updates are on different %state0 and %state1)

Third, the attribute readnone seem too "relaxed" for these functions (may the optimizer change their positions?), should they be inaccessiblememonly?

Thank you in advance for any response.

PS. I think of another approach is to propose some kind of __remill_atomic_read/write_xxx (instead of __remill_read/write_xxx intrinsics, then no need of _remill_atomic_begin/end) for atomic instructions.

@pgoodman
Copy link
Collaborator

The way I generally think about Memory * is the following:

  1. I made it a pointer to a Memory because I was originally imagining using Remill in an emulation setup, where I would have the memory implementation be some structure/class.
  2. What it is really doesn't matter, so long as LLVM can't prove that an input memory argument and returned memory are the same. It can't do this, because memory intrinsics are generally not implemented. Thus, what is achieved is SSA over program memory operations.
  3. Nothing in here truly guarantees atomicity, and there's more to atomicity than just memory operations. x86, for example, has precise interrupt delivery. Ignoring CPU errata, this implies failure atomicity of instructions: if something happens (e.g. an interrupt, exception), then the CPU rolls back the state and presents things as if the instruction never executed.
  4. For the lock prefix, at the decoding side, we don't really "know" what particular memory operation needs to be locked. At the time of coding all of this up, I was honestly just lazy and didn't want to make a bunch of LOCK-prefixed variants of the semantics that used different memory intrinsics. So, instead, I surrounding them in a sort of mock critical section.

@tathanhdinh
Copy link
Contributor Author

tathanhdinh commented Oct 27, 2021

Thank you for the feedback @pgoodman. I understand your idea better now.

I'm agree with 1. and 2., however the effect of LLVM optimizer give somehow intuitive results.

For example, when lifting this code

mov rax, [rcx]
mov rbx, [rcx]
remill-lift-12 --arch amd64 --name_register_variables --bc-out test.bc --bytes 488b01488b19

we receive the following LLVM code

define dso_local %struct.Memory* @sub_0(%struct.State* noalias nonnull align 1 %state, i64 %pc, %struct.Memory* noalias %memory) local_unnamed_addr #0 {
  %RBX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 3, i32 0, i32 0
  %RCX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 5, i32 0, i32 0
  %RAX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 1, i32 0, i32 0
  %PC_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 33, i32 0, i32 0
  %RCX_ = load i64, i64* %RCX_ptr, align 8
  %1 = tail call i64 @__remill_read_memory_64(%struct.Memory* %memory, i64 %RCX_) #3
  store i64 %1, i64* %RAX_ptr, align 8
  %2 = add i64 %pc, 6
  store i64 %1, i64* %RBX_ptr, align 8
  store i64 %2, i64* %PC_ptr, align 8
  %3 = tail call %struct.Memory* @__remill_missing_block(%struct.State* %state, i64 %2, %struct.Memory* %memory)
  ret %struct.Memory* %3
}

This result is fine, two reads are optimized by a single read. Now we change the code a little bit:

mov rax, [rcx]
mov [rcx], rax
mov rbx, [rcx]
remill-lift-12 --arch amd64 --name_register_variables --bc-out test.bc --bytes 488b01488901488b19

we neutralize totally the optimizer:

define dso_local %struct.Memory* @sub_0(%struct.State* noalias nonnull align 1 %state, i64 %pc, %struct.Memory* noalias %memory) local_unnamed_addr #0 {
  %RBX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 3, i32 0, i32 0
  %RCX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 5, i32 0, i32 0
  %RAX_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 1, i32 0, i32 0
  %PC_ptr = getelementptr inbounds %struct.State, %struct.State* %state, i64 0, i32 6, i32 33, i32 0, i32 0
  %RCX_ = load i64, i64* %RCX_ptr, align 8
  %1 = tail call i64 @__remill_read_memory_64(%struct.Memory* %memory, i64 %RCX_) #3
  store i64 %1, i64* %RAX_ptr, align 8
  %2 = tail call %struct.Memory* @__remill_write_memory_64(%struct.Memory* %memory, i64 %RCX_, i64 %1) #3
  %3 = add i64 %pc, 9
  %4 = tail call i64 @__remill_read_memory_64(%struct.Memory* %2, i64 %RCX_) #3
  store i64 %4, i64* %RBX_ptr, align 8
  store i64 %3, i64* %PC_ptr, align 8
  %5 = tail call %struct.Memory* @__remill_missing_block(%struct.State* %state, i64 %3, %struct.Memory* %2)
  ret %struct.Memory* %5
}

That is because the write operation has introduced a new memory then (as you've said, the optimizer cannot prove the input and the returned memory are the same), the second read cannot be eliminated.

Or we will see always two write with

mov [rcx], rbx
mov [rcx], rax

Of course, we can say that this depends on some concrete implementation of memory read/write operations. But we may see something asymmetric here, with two reads, the optimizer works, but not with two writes (or a write between two reads).

I think of whether remill may provide basic properties for memory read/write operations, something follows the axioms of the SMT theory of arrays. I don't know how to do that just with LLVM attributes, but a LLVM pass (like dead-store-elimination) may help.

For 3., I admit that I was not think about the atomic of the execution of the instruction under interrupts or exceptions. Since under X86 or AMD64, the exception and interrupt handling always occurs at instruction boundary (i.e. an instruction is always executed completely or not executed at all), as cited from Intel's SDM Volume 3. Chapter 6.6:

To allow the restarting of program or task following the handling of an exception or an interrupt, all exceptions (except aborts) are guaranteed to report exceptions on an instruction boundary. All interrupts are guaranteed to be taken on an instruction boundary.

I think that the atomic assurance may not in the scope of remill (instead in an emulator which uses remill, as you've said).

Actually the main point of my issue is 4.. I initially wonder that how __remill_atomic_begin/end warrants atomic memory operations, as a lock concerns only of this kind only (Intel's SDM Volume 3. Chapter 8.1). The current implementation is more strict since it warrants also CPU's flag updates.

@pgoodman
Copy link
Collaborator

So merging two reads into one is technically sketchy, but it's hard to avoid, unless we introduce a new parameter to the read intrinsics, which would be doable but slightly annoying. Here is an example demonstrating why the merging is actually a bug.

Usage of the memory pointer establishes a partial order, not a total order. Thus, a read of an earlier memory pointer can technically sink below a write. We have no way of stopping this at the moment.

My point with the failure atomicity of instructions was that: we're already not faithful to the hardware in this regard, and so if stuff "leaks out" from an "atomic region" (bounded by the two intrinsics) then as long as they aren't memory accesses, then it's "fine."

The intention of the atomic regions is to signal that "something needs mutual exclusion over memory." That is, think of it really as a critical section for memory operations, and that the flags computations or other register mutations that happen to be there are just that -- they happen to be there. State structures don't exist in real programs, our intuition for them is that they are truly private to this thread, and so we should think of one state structure as not being observable by another thread, even if the nature of the state structure pointer admits this possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants