Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should use C11 atomics #17

Open
clausecker opened this issue Mar 12, 2022 · 8 comments
Open

Should use C11 atomics #17

clausecker opened this issue Mar 12, 2022 · 8 comments

Comments

@clausecker
Copy link

The project currently uses haphazardly hand-crafted atomics using volatile variables. This is not guaranteed to work correctly and is unportable. Consider migrating to C11 atomics if possible.

@trolando
Copy link
Owner

Hi @clausecker, while probably migrating to C11 atomics is a good idea for portability, I do not yet understand why volatile is a problem. The reason I use volatile here is to force the compiler to emit a read operation instead of optimizing the read away. Why can't I do this using volatile under C11 semantics, if I were to migrate to C11 atomics?

@clausecker
Copy link
Author

The volatile qualifier does not imply atomicity constraints and does not imply any special memory ordering constraints. Its purpose is to model memory-mapped peripherals or other to be a debug aid, e.g. permitting variables to be changed using a debugger. This means:

  • it is permitted for the C compiler to split the write into two if needed
  • the compiler is not mandated to use special atomic write instructions on the architectures that require them
  • there is no memory ordering implied
  • the compiler can freely move reads and writes to other variables around a volatile access, potentially breaking intended semantics

The volatile qualifier is just not the right qualifier for atomic variables and may blow up in your face on architectures with more lenient memory models or with certain compiler optimisations. Race conditions emerging from this are very difficult to debug.

The reason I use volatile here is to force the compiler to emit a read operation instead of optimizing the read away.

You are also using volatile to implement atomic variables, e.g. with the barrier_t type and the must_suspend variable. In fact, all uses of volatile I have found in your code try to implement atomic variables.

@trolando
Copy link
Owner

With the barrier_t type, this could probably be implemented using C11 atomic types, although I'd have to read up on them first. However, the must_suspend is intended to be non-atomically read. How do you propose to implement that using C11 atomics, while preventing compilers optimizing away the reads?

@clausecker
Copy link
Author

@trolando I don't see how must_suspend is a non-atomic access. You are writing to the variable from one thread and read it concurrently from another. An atomic variable is required in this case to avoid undefined behaviour.

For example, on an architecture with a less strict addressing mode, lace_suspend could return prematurely because the cache for the current thread might still hold a value of workers_running of zero from the last read on this thread even when other threads have already incremented this variable. This is because a non-atomic memory access does not convey any consistency properties.

So must_suspend and workers_running must be atomic variables, e.g. of type atomic_int instead of volatile int. These are likely the only changes required in the context of the lace_suspend function. In other cases you'll also have to replace GNU-style __sync_... calls with their C11 counterparts.

Note also that both C11 threads and pthreads provide barrier types, so I don't really see why you felt the need to implement your own.

@trolando
Copy link
Owner

@clausecker The whole lace part comes from a different project https://github.com/trolando/lace which I worked on 10 years ago. The idea here was to provide a fine-grained fork-join-style parallel framework, like Cilk. A fixed number of threads execute many small tasks with minimum overhead. I "felt the need to implement my own barrier type" back then to try and get the best possible performance on the platforms that I was running my software on, i.e., consumer hardware and 64-cpu x86-64 machines. Lace indeed offers one of the best if not the best performance if fine-grained fork-join-style parallelism is what you're looking for.

Since using normal writes/reads on must_suspend worked on the platform I develop for, I haven't been too concerned with technically undefined behavior. I would be fine with using atomics here as long as this has the same result on x86-64 machines, i.e., normal reads, no memory fences.

I don't see barrier types provided by C11 threads, please clarify. I could try to first fix Lace to ditch POSIX and rely only on C11 provided primitives, hopefully without deteriorating performance... if this works for Lace and the benchmarks are stable, I could then just copy the new Lace header/source files to Oink.

@trolando
Copy link
Owner

So far I can partially modify Lace to use atomics instead of volatile variables, but I find that the generated code is less optimized. For example, Lace has a tailsplit union which is two ints. The ints are accessed atomically simultaneously by the lace_steal method (called by thieves), but now we miss a compiler optimization if I use C11 atomics. Previously, it could load the tailsplit to %rax, then split the two by copying %rax to %rbx, shifting one of the registers, then doing the comparison "tail < split" by comparing %eax and %ebx, which is what I'm after. If I use atomics here, the compiler no longer sees this optimization; instead it stores %rax on the program stack, and compares using the program stack. As a result, the lace_steal method now has extra generated code to modify %rsp, which is unnecessary overhead. I have not yet found a way to fix this.

@clausecker
Copy link
Author

So far I can partially modify Lace to use atomics instead of volatile variables, but I find that the generated code is less optimized. For example, Lace has a tailsplit union which is two ints. The ints are accessed atomically simultaneously by the lace_steal method (called by thieves), but now we miss a compiler optimization if I use C11 atomics. Previously, it could load the tailsplit to %rax, then split the two by copying %rax to %rbx, shifting one of the registers, then doing the comparison "tail < split" by comparing %eax and %ebx, which is what I'm after

Might be possible to use a union here.

@trolando
Copy link
Owner

It already is a union, it's just that for some reason the compiler now decides to compute the result via the stack instead of pure registers :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants