Native SynchronizedObject implementation allocates too much #412

haitaka · 2024-03-21T10:35:25Z

Native SynchronizedObject implementation allocates an instance of LockState on every lock() and unlock() call.

I've compared performance of the SynchronizedObject and a simple non-allocationg spin-lock in uncontended case. The benchmarks can be found here. In my case, the metrics reported were as follows (greater values are better):

macosArm64 summary:
Benchmark                                Mode  Cnt         Score        Error    Units
AtomicFUBenchmark.benchmarkUncontended  thrpt    5  25905005.328 ±  96180.326  ops/sec
NoalLockBenchmark.benchmarkUncontended  thrpt    5  65159902.404 ± 126124.691  ops/sec

Compose Multiplatform heavily utilizes atomicfu's SynchronizedObject under the hood, with the LockState instances being the most frequent heap allocation case.
Certain Compomse Multiplatfrom benchmarks experience noticeable improvement in missed frame ratio (1/3 better) if the SynchronizedObject implementation is replaced with a non allocating spin lock.

The text was updated successfully, but these errors were encountered:

…tion (Kotlin#412) * Pack "thin" lock state in a single ptr-word * Spin when "fat" state is required (it's harder to pack in a single word)

qwwdfsad · 2024-04-19T10:05:37Z

Our findings so far (copying my internal response FTR):

====

I'll give a short overview of what is wrong with synchronized objects and why it is hard to achieve what plain spinlock can achieve:

The lock has to be reentrant (and this is a really tough constraint to dance around, e.g. it immediately filters out easy algos like Benaphores)
The underlying platform (K/N) has no notion of threads and thread parking; the best we can rely on is .def-based posix primitives. E.g. even a permit-based parking should be implemented from scratch, probably using futexes or something like that
To switch between thin reentrant lock and posix lock without permit-based parking, we have to juggle with a primitive (posix_thread_id) that we cannot box and potentially-relocateable object (posix lock), which is a tough thing to do

Options I see I can pursue:

Look into platform's POSIX. High chances are on some platforms, things like pthread_mutex_lock do not step into the kernel when there is no contention
Do some bit-twiddling and hard assumptions to pack "either this is reinterpreted pointer to POSIX lock or thread ID that holds the lock". A honorable mention goes to the author of pthread_t that decided to define it not only as an arbitrary-size integer without any constraints (e.g. can be negative), but potentially as a struct as well.
Do something in between like try to abuse @ThreadLocal as the PID storage and see whether it yields any decent perf

qwwdfsad · 2024-04-19T10:08:53Z

So while we cannot reasonably fix this one yet, the outcomes are:

Compose folks are recommended to use SO more carefully. For their specific benchmark (access to CoW TL map), @ThreadLocal or a spin-lock is recommended
Current locks are unstable and have non-trivial performance characteristics. Even though they are undocumented, they are being used, we should warn users about that

qwwdfsad self-assigned this Mar 21, 2024

qwwdfsad mentioned this issue Mar 21, 2024

Tidy up our synchronization code #413

Merged

qwwdfsad added the postponed label Apr 19, 2024

mvicsokolova added the native label Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native SynchronizedObject implementation allocates too much #412

Native SynchronizedObject implementation allocates too much #412

haitaka commented Mar 21, 2024

qwwdfsad commented Apr 19, 2024

qwwdfsad commented Apr 19, 2024

Native SynchronizedObject implementation allocates too much #412

Native SynchronizedObject implementation allocates too much #412

Comments

haitaka commented Mar 21, 2024

qwwdfsad commented Apr 19, 2024

qwwdfsad commented Apr 19, 2024