Should use C11 atomics #17

clausecker · 2022-03-12T22:47:39Z

The project currently uses haphazardly hand-crafted atomics using volatile variables. This is not guaranteed to work correctly and is unportable. Consider migrating to C11 atomics if possible.

The text was updated successfully, but these errors were encountered:

trolando · 2022-03-16T13:35:10Z

Hi @clausecker, while probably migrating to C11 atomics is a good idea for portability, I do not yet understand why volatile is a problem. The reason I use volatile here is to force the compiler to emit a read operation instead of optimizing the read away. Why can't I do this using volatile under C11 semantics, if I were to migrate to C11 atomics?

clausecker · 2022-03-16T13:59:23Z

The volatile qualifier does not imply atomicity constraints and does not imply any special memory ordering constraints. Its purpose is to model memory-mapped peripherals or other to be a debug aid, e.g. permitting variables to be changed using a debugger. This means:

it is permitted for the C compiler to split the write into two if needed
the compiler is not mandated to use special atomic write instructions on the architectures that require them
there is no memory ordering implied
the compiler can freely move reads and writes to other variables around a volatile access, potentially breaking intended semantics

The volatile qualifier is just not the right qualifier for atomic variables and may blow up in your face on architectures with more lenient memory models or with certain compiler optimisations. Race conditions emerging from this are very difficult to debug.

The reason I use volatile here is to force the compiler to emit a read operation instead of optimizing the read away.

You are also using volatile to implement atomic variables, e.g. with the barrier_t type and the must_suspend variable. In fact, all uses of volatile I have found in your code try to implement atomic variables.

trolando · 2022-03-16T20:44:23Z

With the barrier_t type, this could probably be implemented using C11 atomic types, although I'd have to read up on them first. However, the must_suspend is intended to be non-atomically read. How do you propose to implement that using C11 atomics, while preventing compilers optimizing away the reads?

clausecker · 2022-03-16T23:40:44Z

@trolando I don't see how must_suspend is a non-atomic access. You are writing to the variable from one thread and read it concurrently from another. An atomic variable is required in this case to avoid undefined behaviour.

For example, on an architecture with a less strict addressing mode, lace_suspend could return prematurely because the cache for the current thread might still hold a value of workers_running of zero from the last read on this thread even when other threads have already incremented this variable. This is because a non-atomic memory access does not convey any consistency properties.

So must_suspend and workers_running must be atomic variables, e.g. of type atomic_int instead of volatile int. These are likely the only changes required in the context of the lace_suspend function. In other cases you'll also have to replace GNU-style __sync_... calls with their C11 counterparts.

Note also that both C11 threads and pthreads provide barrier types, so I don't really see why you felt the need to implement your own.

trolando · 2022-03-17T14:44:39Z

@clausecker The whole lace part comes from a different project https://github.com/trolando/lace which I worked on 10 years ago. The idea here was to provide a fine-grained fork-join-style parallel framework, like Cilk. A fixed number of threads execute many small tasks with minimum overhead. I "felt the need to implement my own barrier type" back then to try and get the best possible performance on the platforms that I was running my software on, i.e., consumer hardware and 64-cpu x86-64 machines. Lace indeed offers one of the best if not the best performance if fine-grained fork-join-style parallelism is what you're looking for.

Since using normal writes/reads on must_suspend worked on the platform I develop for, I haven't been too concerned with technically undefined behavior. I would be fine with using atomics here as long as this has the same result on x86-64 machines, i.e., normal reads, no memory fences.

I don't see barrier types provided by C11 threads, please clarify. I could try to first fix Lace to ditch POSIX and rely only on C11 provided primitives, hopefully without deteriorating performance... if this works for Lace and the benchmarks are stable, I could then just copy the new Lace header/source files to Oink.

trolando · 2022-03-17T22:17:54Z

So far I can partially modify Lace to use atomics instead of volatile variables, but I find that the generated code is less optimized. For example, Lace has a tailsplit union which is two ints. The ints are accessed atomically simultaneously by the lace_steal method (called by thieves), but now we miss a compiler optimization if I use C11 atomics. Previously, it could load the tailsplit to %rax, then split the two by copying %rax to %rbx, shifting one of the registers, then doing the comparison "tail < split" by comparing %eax and %ebx, which is what I'm after. If I use atomics here, the compiler no longer sees this optimization; instead it stores %rax on the program stack, and compares using the program stack. As a result, the lace_steal method now has extra generated code to modify %rsp, which is unnecessary overhead. I have not yet found a way to fix this.

clausecker · 2022-03-17T23:27:04Z

So far I can partially modify Lace to use atomics instead of volatile variables, but I find that the generated code is less optimized. For example, Lace has a tailsplit union which is two ints. The ints are accessed atomically simultaneously by the lace_steal method (called by thieves), but now we miss a compiler optimization if I use C11 atomics. Previously, it could load the tailsplit to %rax, then split the two by copying %rax to %rbx, shifting one of the registers, then doing the comparison "tail < split" by comparing %eax and %ebx, which is what I'm after

Might be possible to use a union here.

trolando · 2022-03-18T08:27:36Z

It already is a union, it's just that for some reason the compiler now decides to compute the result via the stack instead of pure registers :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should use C11 atomics #17

Should use C11 atomics #17

clausecker commented Mar 12, 2022

trolando commented Mar 16, 2022

clausecker commented Mar 16, 2022

trolando commented Mar 16, 2022

clausecker commented Mar 16, 2022

trolando commented Mar 17, 2022

trolando commented Mar 17, 2022

clausecker commented Mar 17, 2022

trolando commented Mar 18, 2022

Should use C11 atomics #17

Should use C11 atomics #17

Comments

clausecker commented Mar 12, 2022

trolando commented Mar 16, 2022

clausecker commented Mar 16, 2022

trolando commented Mar 16, 2022

clausecker commented Mar 16, 2022

trolando commented Mar 17, 2022

trolando commented Mar 17, 2022

clausecker commented Mar 17, 2022

trolando commented Mar 18, 2022