Issues with multithreaded code and CPU dispatching. #65

alexey-milovidov · 2019-12-25T22:47:50Z

Suppose we are calling base64_encode or base64_decode in a loop (for different inputs) and doing it from multiple threads (for different data).

If we pass non-zero flags to these routines, it will write to a single global variable repeatedly in codec_choose_forced function and it will lead to "false sharing" and poor scalability.
There is no method to pre-initialize the choice of codec. (Actually, there is: we can simply call one of the encode/decode routines in advance with empty input, but it looks silly). If we don't do that and if we run our code with thread-sanitizer, it will argue about data race on codec function pointers. In fact, it is safe, because it is a single pointer - single machine word that is (supposedly) placed in aligned memory location. But we have to annotate it as _Atomic and store/load with memory_order_relaxed. Look at the similar issue here: Make dynamic dispatch free of TSan warnings simdjson/simdjson#256
Suppose we use these routines in a loop for short inputs. They have a branch to check if encoders/decoders were initialized. We want to move these branches out of the loop: check for CPU and call specialized implementation directly. But architecture specific methods are not exported and we cannot do that. We also have to pay for two non-inlined function calls.

All these issues was found while integrating this library to ClickHouse: ClickHouse/ClickHouse#8397

The text was updated successfully, but these errors were encountered:

aklomp · 2019-12-29T23:03:20Z

Hi, thanks for raising this issue. Always interesting to get some perspective from a library user. I'll address your points specifically below, but first I want to clarify the design philosophy behind this library. It runs a bit counter to the way you intend to use it.

The concept behind this library is "compile once, run anywhere". It was intended to be compiled for all the SIMD architectures that the compiler will support, and then at runtime, use CPU feature detection to decide which codecs to actually use. This should make the library portable across all similar architectures, and make it distributable as part of a package system or binary distribution. This benefit trickles down to the user if they also distribute their software through a centralized repo.

The optimization focus in this library is not multithreaded, but multicore. It supports OpenMP, which greatly accelerates encoding and decoding of large pieces of data. (But I think you raise a valid point here. The library should assume that it can be called from multiple threads, and not share common state.)

Are these good design principles? Perhaps not, and I'm personally a bit unhappy with the early decision to implement my own CPU dispatcher, but that's the way things currently stand. I do have some future plans to allow the user to do their own CPU dispatching and call architecture-specific codecs directly, but they won't be finished in the short term.

There is indeed a user flags option to force a codec, but this is a bit of a misfeature. The flags are mainly there as a testing tool so that I can run a specific individual codec for testing and benchmarking. The idea was that end users should not be using the flags, they should be using runtime CPU detection. I recognize that it's not the best choice in hindsight, and intend to fix it, but here we are.

As to the specific points:

Indeed, if you are passing flags to the codecs, they will be forced to re-evaluate the current codec. I suppose this could be fixed by storing the previous flags value in a static variable and seeing if it changed. The functions always need to check the flags on the off chance that the user changed them, so the branch can't be avoided. Unless the library gets a separate init function, which I tried to avoid because it's not ergonomic. (Maybe there's a way to get the best of both worlds with some sort of optional init function?)
If it's silly but it works, it's not silly. The library does not have a separate init method, because it's not needed for the nominal use case. Codecs are initialized on the first run. For the multithreaded case, I think it should be enough to change this line to:

static __thread struct codec codec = { NULL, NULL };

This will make codec thread-local. You can apply this trick for point 1 too if you want to get rid of global state. I'll think about adding it to the library directly, if there's a portable way to do so.

Indeed, you'll need to patch the library to make that work. It's not optimized for super short inputs and outputs. You'd probably be better off with a simple table-based approach in those cases anyway. (Btw, I just resolved Generic encoders: use 12-bit lookup table #64, which speeds up the encoder for short inputs.)

I think this library could be made to fit your use-case, but as you see there are some different tradeoffs in play which would require either some redesign on my end (slow, unfortunately), or patching from your end. I'm open to suggestions on how and where to improve.

alexey-milovidov · 2020-01-18T19:02:55Z

Thank you for the answer! The question is resolved.

sarkanyi · 2021-01-05T13:39:17Z

To be honest I wouldn't mind if you got rid of the mutex and would use thread local for the codec. In a HPC environment you can have 200 cores doing the same thing, which in this case would cause a huge perf hit because of false sharing, and I'm not keen on maintaining a separate fork. This also why I hope that eventually you'll merge the CMake build, we'll probably use it as a library, one other thing I'm not keen about maintaining :).

aklomp mentioned this issue Jun 6, 2022

Cleanup round (no functional inpact) before releasing v0.4.0 #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with multithreaded code and CPU dispatching. #65

Issues with multithreaded code and CPU dispatching. #65

alexey-milovidov commented Dec 25, 2019 •

edited

aklomp commented Dec 29, 2019

alexey-milovidov commented Jan 18, 2020

sarkanyi commented Jan 5, 2021

Issues with multithreaded code and CPU dispatching. #65

Issues with multithreaded code and CPU dispatching. #65

Comments

alexey-milovidov commented Dec 25, 2019 • edited

aklomp commented Dec 29, 2019

alexey-milovidov commented Jan 18, 2020

sarkanyi commented Jan 5, 2021

alexey-milovidov commented Dec 25, 2019 •

edited