Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement improved VSYNC estimator from Blur Busters' open source #1145

Open
mdrejhon opened this issue Jul 13, 2023 · 4 comments
Open

Implement improved VSYNC estimator from Blur Busters' open source #1145

mdrejhon opened this issue Jul 13, 2023 · 4 comments

Comments

@mdrejhon
Copy link

mdrejhon commented Jul 13, 2023

Paging @TomHarte - Time to upgrade/replace your VSYNC estimator:

class VSyncPredictor {

We just released this today under Apache 2.0. A little TestUFO magic sauce, too.
https://github.com/blurbusters/RefreshRateCalculator
Our algorithm is superior, since it filters jitter better + ignores missed vsyncs (to avoid polluting the data).

Or alternatively, ad8e's cross platform VSYNC estimator algorithm.
https://github.com/ad8e/vsync_blurbusters/blob/main/vsync.cpp

@mdrejhon
Copy link
Author

Here's the README.md:

RefreshRateCalculator CLASS

PURPOSE: Accurate cross-platform display refresh rate estimator / dejittered VSYNC timestamp estimator.

  • Input: Series of frame timestamps during framerate=Hz (Jittery/lossy)
  • Output: Accurate filtered and dejittered floating-point Hz estimate & refresh cycle timestamps.
  • Algorithm: Combination of frame counting, jitter filtering, ignoring missed frames, and averaging.
  1. This is also a way to measure a GPU clock source indirectly, since the GPU generates the refresh rate during fixed Hz.
  2. IMPORTANT VRR NOTE: This algorithm does not generate a GPU clock source when running this on a variable refresh rate display
    (e.g. GSYNC/FreeSync), but can still measure the foreground software application's fixed-framerate operation during
    windowed-VRR-enabled operation, such as desktop compositor (e.g. DWM). This can allow a background application
    to match the frame rate of the desktop compositor or foreground application (e.g. 60fps capped app on VRR display).
    This algorithm currently degrades severely during varying-framerate operation on a VRR display.

LICENSE - Apache-2.0

Copyright 2014-2023 by Jerry Jongerius of DuckWare (https://www.duckware.com) - original code and algorithm
Copyright 2017-2023 by Mark Rejhon of Blur Busters / TestUFO (https://www.testufo.com) - refactoring and improvements

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

*** First publicly released July 2023 under mutual agreement
*** between Rejhon Technologies Inc. (Blur Busters) and Jongerius LLC (DuckWare)
*** PLEASE DO NOT DELETE THIS COPYRIGHT NOTICE

JAVASCRIPT VSYNC API / REFRESH CYCLE TIME STAMPS

CODE PORTING

  • This algorithm is very portable to most languages, on most platforms, via high level and low level graphics frameworks.
  • Generic VSYNC timestamps is usually immediately after exit of almost any frame presentation API during VSYNC ON framerate=Hz
  • APIs for timestamps include RTDSC / QueryPerformanceCounter() / std::chrono::high_resolution_clock::now()
  • APIs for low level frame presentation include DirectX Present(), OpenGL glFinish(), Vulkan vkQueuePresentKHR()
  • APIs for high level frame presentation include XBox/MonoGame Draw(), Unity3D Update(), etc.
  • APIs for zero-graphics timestamps (e.g. independent/separate thread) include Windows D3DKMTWaitForVerticalBlankEvent()
  • While not normally used for beam racing, this algorithm is sufficiently accurate enough for cross-platform raster estimates for beam racing applications, based on a time offset between refresh cycle timestamps! (~1% error vs vertical resolution is possible on modern AMD/NVIDIA GPUs).

SIMPLE CODE EXAMPLE

var hertz = new RefreshRateCalculator();

[...]

  // Call this inside your full frame rate VSYNC ON frame presentation or your VSYNC listener.
  // It will automatically filter-out the jitter and dropped frames.
  // For JavaScript, most accurate timestamp occurs if called at very top of your requestAnimationFrame() callback.

hertz.countCycle(performance.now());

[...]

  // This data becomes accurate after a few seconds

var accurateRefreshRate = hertz.getCurrentFrequency();
var accurateRefreshCycleTimestamp = hertz.getFilteredCycleTimestamp();

  // See code for more good helper functions

@TomHarte
Copy link
Owner

Cool! Especially for potential beam racing, though also for the Qt port — which does a [terrible] attempt at the same estimation in an attempt to back-load work within frames*. There's already a mechanism in place to try to phase-lock machine execution speed to display rate if they're very similar but not exactly matched, and machines can already be run for arbitrary periods, so hopefully most of the other wiring is already in place.

* i.e. it attempts to estimate: (1) standard frame duration; (2) time remaining until next end-of-frame; (3) time it seems to be taking the emulator to run for one frame; and (4) scheduling jitter. It then seeks to sleep the right amount of time to finish generating each frame as close as possible to when it'll be presented, given the potential scheduling jitter from the sleep. Which is a lot of stuff that mostly amounts to: Qt is a terrible target for anything multimedia, but best foot forwards.

Though the whole precept of offering only blocking upon posting a new frame as retrace synchronisation seems to be common to many of the platform abstractions, and I'm sure there are others with the same degree of pain attached to trying to split across threads, so maybe it's worth investing properly there.

@mdrejhon
Copy link
Author

mdrejhon commented Jul 14, 2023

There are several possibilities to listener into a VSYNC heartbeat;

  • Use full-framerate VSYNC ON. That's laggy, but it does produce a VSYNC clock
  • Execute full-framerate VSYNC ON only at software startup. Then switch to VSYNC OFF dead-reckon precisely from there. There will be drift, but it might give you a few minutes of low-latency ops if you clock to a microsecond timer
  • Use a background VSYNC daemon approach. Bypass Qt for a VSYNC listener (separate process, separate thread) and send signals to Qt
  • Use simultaneous VSYNC ON (offscreen) and VSYNC OFF (visible) graphics buffers. Send signals between threads.
    Not all implementations of graphics frameworks can do separate sync technologies on separate video buffers, but this actually worked part of the time on some platforms

All of this can be abstracted somewhat -- our open source module doesn't make assumptions of how you will provide a VSYNC heartbeat -- simply my module helps filter/dejitter the VSYNC heartbeat to error margins accurate enough for "lagless vsync" algorithms (like WinUAE) or at least to flywheel a CPU-clocked emulator VSYNC slowly towards a GPU-clocked realworld VSYNC.

@mdrejhon
Copy link
Author

mdrejhon commented Jul 14, 2023

given the potential scheduling jitter from the sleep.

If you want to have precision needed for ultra-low-latency beam raced sync -- I prefer CPU busywait loops instead of sleeps. It does burn a CPU core, but it does produce massively improved precision necessary for beam-racing feats. Perhaps a configurable sleep (language native sleep, timer sleep, and CPU busyloop sleep). In Tearline Jedi, I used language sleep to 1-2ms before the sleep event, then busyloop the final 1-2ms -- and that worked fairly well.

One problem is power management degrades beam racing precision. On VOGONS.org I posted an 2023 updated emulator beamracing HOWTO in response to a question to me by GloriusCow, the author of MartyPC.

Relevant Updated 2023 Best Practices for emulator "lagless vsync" beam racing ala WinUAE

Best Pratices for Emulator Developers (new findings as of 2023)

  1. Only works if you can get (A) intentional tearing, (B) raster poll or estimate the raster as offsets between vsync, and (C) precise timing via microsecond counter. For Windows, you typically need full screen exclusive + VSYNC OFF, in order to get intentional tearing.

  2. For estimating a raster poll as offsets between vsync's, you need to find a reliable way to get fairly accurate vsync timestamps, whether via a CPU thread, a listener, a VSYNC estimator that averages over several refresh cycles (and ignores missed vsyncs), a startup VSYNC ON-listener that goes into deadreckoning mode when switching to VSYNC OFF, etc. Many ways to estimate a raster poll in a cross-platform manner (platform-specific wrappers).

  3. Make it configurable (on/off, frameslice count, raceahead margin, etc), perhaps conservatively autoconfigured by a initial self-benchmarker.

  4. Use CPU busywaits, not timers. Timers aren't always precise enough.

  5. Always flush after frame presentation, or make flush configurable. GPUs are pipelined, so you must flush to get deterministic-enough behaviors for beam racing. The return from the flush will be approx raster time aligned to tear line, and jitter in return timestamps can be used to guesstimate rather jitter (and maybe automatically warn/disable beam racing if it always massively jitters too much)

  6. Detect your screen refresh rate and compensate. If your VSYNC dejitterer has a refresh rate estimator, use that! (Examples of existing VSYNC-dejittering estimators include https://github.com/blurbusters/RefreshRateCalculator and https://github.com/ad8e/vsync_blurbusters/blob/main/vsync.cpp ) Usually it's best that the refresh rate is the same as the refresh rate you want to emulate. If there's a refresh rate mismatch, you can fast-beamrace specific refresh cycles (by running your beamraced emulator engine faster in sync with the faster refresh cycles). WinUAE does this on 120Hz, 180Hz and 240Hz (NTSC) and 100Hz, 150Hz, 200Hz (PAL). At 120Hz, emulator-beamracing every other refresh cycle (so emulator is idling/paused every other refresh cycle).

  7. (OPTIONAL) Detect your current screen rotation and make sure realworld scanout direction is same as emulator scanout direction. Signal is only top-to-bottom in default rotation. So if you are an arcade emulator, and you want to frameslice beamrace Galaga, you will have to rotate your LCD computer monitor for subrefresh latency. There's display rotation APIs on all platforms that you can call to check. For legacy PC emulator developers, this will be largely unnecessary as it's always top-to-bottom scanout direction until the first version of Windows that supported screen rotation. So simply disable "lagless vsync" whenever you're not in default rotation.

  8. (OPTIONAL) Detect whether VRR (FreeSync/G-SYNC) is enabled, and warn/turn off/compensate beamracing if VRR is enabled. If VRR is enabled, you can disable beamracing, or simply interrupt your VRR refresh cycle via VSYNC OFF. You can still do multiple emulated refresh rates flawlessly via 60Hz and 70Hz via VRR and still beamrace-sync VRR refresh cycles (with some caveats). WinUAE is currently the only emulator successfully beamracing GSYNC/FreeSync. For developers not familiar with how a VRR display refreshes -- it is not initially intuitive to developers who don't fully understand VRR yet. If you choose to beam-race-sync (lagless vsync) your VRR refresh cycles, you need "GSYNC + VSYNC OFF" in NVIDIA Control Panel. What happens is that the first Present() (while the display is idling waiting for new refresh cycles) triggers the display to start the refresh cycle, and subsequent sub-refresh Present() will tearline-in new frameslices if the current VRR refresh cycle is scanning-out. The OS raster polls (e.g. D3DKMTGetScanLine) will also still work on VRR refresh cycles. I do, however, suggest not trying to make your "lagless vsync" algorithm VRR compatible initially -- start with the easy stuff (emuHz=realHz) and iterate-in the new capabilities later.

  9. Turn off power management (Performance Mode), or warn user if battery-saving power modes are enabled.

  10. GPU drivers will automatically go into power management if it idles for a millisecond. This adds godawful timing jitters that kills the raster effects. Don't use too-low frameslice counts on high-performance GPUs, you only want hundreds of microseconds (or less) to elapse from a return from Flush() before the next Present(). If you must idle the GPU a lot, then prod/thrash the GPU with changed frames (change a few dummy pixels in an invisible area (e.g. ahead of the beam) about 1-2 milliseconds before your timing-accurate frameslice). So raster effects with only a few splits per screens (e.g. splitscreens) will perform timing-worse than raster effects that are continuous (e.g. Kefrens), due to the mid-refresh-cycle power management a GPU is oding.

  11. CPU priority, application priority, and thread affinity helps A LOT. "REALTIME" priority is ideal, but be careful not to starve the rest of the system (e.g. mouse driver = sluggish cursor). "HIGH PRIORITY" for both process and for the beamrace present-flush thread, is key.

  12. It's best to overwrite a copy of the previous refresh-cycle emulator framebuffer with new scanlines (and frameslice-beamrace a fragment of that framebuffer), rather than blank/new frame buffer. That way, raster glitches from momentary performance issues (like a virus scanner suddenly running, and beam racing too late) will show as simple intermittent tearing artifacts instead of black flickers.

  13. The first few scanlines of a refresh cycles are pretty tricky because of background processing done by desktop compositors systems like Windows dwm.exe. I can't ever get a tearline to appear in the first ~30 scanlines of my GeForce RTX on Windows 11. Factor this in, possibly by using larger raceahead margin, or taller frameslices, or variable-height frameslices (with taller for top of screen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants