Skip to content

Synchronization

Bjorn Stahl edited this page Aug 15, 2021 · 2 revisions

This page covers design and plan for improving display server synchronization.

Synchronization is one of the primary display server tasks, but an area where mistakes and imperfections tend to be dead obvious or subtle. The more processing power and higher refresh rate you have the more any issues are smoothed over and go undetected. It is deliciously complicated, systemic (the entire processing chain contributes) and failures are mostly visible in energy consumption, latency and animation 'smoothness' and not as something more direct.

TODO

[ ] Tracing
    [x] Engine/Platform/Conductor Instrumentation
    [ ] Shmif / shmif-debugif / (afsrv_) Integration
    [ ] Conversion to native tracy format
    [ ] Trace buffer multiplexer (with power-data)
[ ] Input
    [x] out-of-queue input event processing (Lua)
    [ ] guard-process mmapped evdev/wscons backend
    [ ] memory-mapped cursor samples
[ ] Conductor (Scheduler)
    [x] Basic structure and platform integration
    [x] Synch-target controls/API
    [ ] Threaded- client processing
        [ ] Semaphore to futex conversion
    [ ] VRR Slew-rate controls
    [ ] Strategy Tuning
        [ ] Throughput
        [ ] Latency
        [ ] Accuracy
        [ ] Energy Conservative
        [ ] Adaptive
[ ] Explicit Synchronization
    [x] Static shared-memory
    [x] N-buffer shared-memory
    [ ] Floating region shared-memory
    [ ] Dma-buffer fencing
[ ] Deadline- aware processing
    [ ] tui/terminal
    [ ] Networked client
    [x] Memory mapped deadline forwarding
    [p] Conductor deadline calculation

Tracing

This is an area where it is absolutely crucial to measure critical stages of the processing pipeline from an input event being sampled, how it is filtered, forwarded, consumed to produced buffer, sharing synchronization, composition and scan-out - as well as distribution of frames and animation interpolation.

The tactic is that there is an engine defined 'no-op' tracing layer that breaks traces down into subsystem and enter/exit/oneshot style tagging and timestamping. These are collected into a ringbuffer of a pre-allocated size and when full, flushed onwards. It is controlled and initiated from the script level through the benchmark_ set of calls, and the lua scripts themselves can add to the trace buffer as well.

The same code is linked into shmif, and as seen in the TODO list, will be added to collect client side contents as well, ideally into a descriptor provided through the debugif so that there is a sandbox friendly collection path that multiplexes with the engine collection.

Conductor

Akin to a symphony orchestra, all synchronization is merged into a conductor that monitors ongoing internal jobs, external outputs and external inputs. It provides an abstract set of strategies (targets to prioritize) and coordinates locking/unlocking of synchronization primitives to fit with this. It is also responsible for communicating its upcoming deadlines for producers to be able to try and adjust based on when contents will be presented rather than a naive 'first serve'' basis.

It is assisted by the scripting layer indicating which jobs that currently have input focus so that they can be treated differently in regards to 'the herd' to avoid bursts of activity that might stall the focus job and have it miss its deadline. Ideally, future controls will also let it re-assign jobs between GPUs, but that is hinged on the state of multi-GPU to advance.

Variable Rate Display

A long standing tradition has been to think of output displays as having a fixed refresh rate with some big exemptions being networked displays. The taint from a fixed refresh rate model is quite far reaching into games that tend to hard-code deadlines rather than separate a logic clock from the display one. Hacky work-arounds have existed for a long time in terms of accepting tearing if within a certain tolerance if a deadline is missed.

During the last few years implementations that have some wiggle room in the active and targeted refresh rate has been shown, as well as displays with high refresh rates. There are always caveats, and the primary one is the distinction between low-persistence displays, high-persistence display and maintaining a coherent luminance between frames when timing is varying. The secondary one is that if your animation system in both composition and client content generation is competent enough to 'sample' the respective worlds rather than having a nominal clock there are QoS tradeoffs to consider as not all transforms are 'worthy' of 240Hz silky-smoothness considering their cost.

This task will again fall on the conductor, and the active strategy and focus target as main inputs. The strength of the feature will vary with all the others: implementation needs good tracing for verification/ validation; clients need deadline estimation and presentation-time signalling; composition needs explicit synchronization; backpressure management needs client-estimates for load-shedding and so on.

Especially interesting testing targets here are, again, emulators as they are bound by the refresh rate of the system that is being emulated - and many of these can have rates that are not evenly divisible with that of the target output display.

Explicit Client Synchronization

Arcan-SHMIF is 'mostly' asynchronous by default. There are two grand exceptions, one is 'resize' operations as they fundamentally change the rules and semantics for everything else. The rule of thumb is that it should carry a noticeable client cost to renegotiate resources. The other exception is shared memory A/V buffer transfers as they are used as a pressure valve of sorts.

Clients that communicate exclusively using shared memory has used explicit synchronization as the default, since it is quite easy: arcan_shmif_signal(cont, SHMIF_SIGVID) and it will be released when the other end so decides. If slightly better performance is needed, it can be more wasteful and resize to a format with another back buffer (double-buffered mode) and - as long as the mandated limit isn't exceeded - multiple more buffers where only the 'latest' will get used.

Clients can, furthermore, supply audio buffers that clock to the video buffer as well in order to help assuring A/V synch.

For GPU workloads, this is not as easy as the workloads are inherently asynchronous. When a handle is extracted that represents the client output, the server might need to implicitly wait for the GPU to finish. This opens up for priority-inversion and DoS (had it not been for our watchdog and feedback to reject GPU work from untrusted clients).

When DMA_BUF_IOCTL_EXPORT_SYNC_FILE work is finished in upstream kernel/graphics stack, we can extract a handle from the client workload that multiplexes with other waiting in the conductor, and hang on to the previous buffer / results until the new is ready to be consumed.

There are other possibles by bumping GL version and so on, but mixing these abstractions are the inferior choice to simply being able to request to poll the state of a resource we have imported.

Client Content Synchronization

A lot of clients that have complex animations or data dependent deadlines (such as video) can leverage a distinction between 'transmission time' and 'presentation time'. This is common in audio/video, but less so in graphics and games - it has been more important that steps between frames match some interpolation function. The exception is emulators that have a rare tactical advantage in being able to speed up and roll-back emulation.

With VR the rules has changes somewhat as reliable, predictable and low-latency frame delivery is especially important, and the better this can become the more advanced techniques can be used. The solution herein is simply to continuously let the conductor update a deadline timestamp in shared memory control section for each client in order to give the client high-accuracy feedback about when something will be presented, and when the deadline is to have provided a frame for the presentation time to be reliable.

This can then simply be skewed / extended both for testing, VRR and networking contexts.

Specific OS Considerations

Some Operating System kernels, e.g. OpenBSD run at a certain configurable tickrate. This means that descriptor polling timeouts and timesleeps are restricted to a certain clock for updates, which can be as high as 10ms - where the option tend to be a mix of more threading event triggered releases.

This means that the actual rate needs to be probed or provided so that the scheduling jitter can be considered by the conductor, and not erroneously attributed to the cost of producing a command buffer or producing a frame.