Skip to content

tidoust/media-tests

Repository files navigation

Experimenting with video processing pipelines on the web

This repository contains experimental code to create video processing pipelines using web technologies. It features a semi-generic mechanism to measure the time taken by each processing step.

The code was developed by @dontcallmedom and @tidoust during W3C's Geek Week 2022. It should be viewed as a semi-neophyte attempt to combine recent web technologies to process video, with a view to evaluating how easy or difficult it is to create such processing pipelines. Code here should not be seen as authoritative or even correct. We don't have particular plans to maintain the code either.

See also the Processing video streams slides that present the approach we took and reflect on key outcomes.

Combined Web technologies

Main web technologies combined are:

Running the demo

The demo requires support for the list of technologies mentioned above. Currently, this means using Google Chrome with WebGPU enabled.

The demo lets the user:

  • Choose a source of input to create an initial stream of VideoFrame: either an animation created from scratch (using OffscreenCanvas) or a stream generated from a camera.
  • Replace green with W3C blue color using WebAssembly, convert the frame to grey using JavaScript, and/or add an H.264 encoding/decoding transformation stage using WebCodecs, and/or introduce slight delays in the stream using regular JavaScript.
  • Add an overlay to the bottom right part of the video that encodes the frame's timestamp. The overlay is added using WebGPU and WGSL.
  • Force explicit copies to CPU memory or GPU memory before transformation steps to evaluate the impact of the frame's location on processing times.

Timing statistics are reported in a table at the end of the page and as objects to the console when the "Stop" button is pressed (this requires opening the dev tools panel). Display times for each frame are reported too when the overlay was present.

Quick code walkthrough

The code uses TransformStream to create processing pipelines. That seemed like the most straightforward mechanism to chain processing steps and benefit from the queueing/backpressure mechanism that comes with streams.

The code features the following files:

  • InstrumentedTransformStream.js: A drop-in replacement for TransformStream that records the time it took to transform a chunk.
  • VideoFrameTimestampDecorator.js: A transformer that adds an overlay to the bottom right corner of a frame, using WebGPU. Use of WebGPU to create an overlay is certainly not mandatory, it was just an excuse for us to use the technology.
  • GreenBackgroundReplacer.js: A transformer that replaces green present in the video frame with W3C blue, using WebAssembly. Use of WebAssembly to run this action is also certainly not mandatory (and probably not a good idea in practice), it was again just an excuse for us to use the technology. The transformer references binary WebAssembly code in GreenBackgroundReplacer.wasm. That binary code is generated from the WebAssembly code in text format in GreenBackgroundReplacer.wat. Code was written using the text format and not one of the many languages that can get transpiled to binary WebAssembly to better understand how WebAssembly works internally. Code was compiled with the wat2wasm package, but you may get the same result with the online wat2wasm demo.
  • BlackAndWhiteConverter.js: A transformer that replaces color with shades of grey, using pure JavaScript. This is meant to serve as a reference step to evaluate JavaScript performance for processing pixels.
  • ToRGBXVideoFrameConverter.js: A transformer that converts a video frame, regardless of its pixel format, to a video frame that uses the RGBX format. The transformer uses WebGPU, which is very convenient here because the GPUSampler does all the work! This avoids having to handle different formats in the WebAssembly code. The transformer can also be used to copy frame data to GPU memory.
  • ToCPUMemoryCopier.js: A transformat that copies video frame data to an ArrayBuffer in CPU memory. Together with the previous transformer, this transformer is useful to evaluate the impact of a frame's location in memory on transformations it could go through.
  • worker-getinputstream.js: A worker that generates a stream of VideoFrame.
  • worker-overlay.js: A worker that leverages VideoFrameTimestampDecorator to add the overlay.
  • worker-transform.js: A worker that can apply transforms to a stream of VideoFrame, including green color replacement, H.264 encoding/decoding, and slight alterations of frame delays.
  • StepTimesDB.js: A generic simple in-memory database to record step processing times of chunks in a stream, and compute stats out of them.
  • main.js: Main thread logic. The code uses requestVideoFrameCallback to inspect rendered frames, copy them to a canvas and decode the color-encoded overlay to retrieve the frame's timestamp (and thus compute the time at which the frame was rendered).

Struggles / Learnings

Here are some of the things we struggled with, wondered about or learned while developing the code.

No way to track a frame fed to a <video> element

The frame's timestamp can be used to track a video frame throughout a processing pipeline. In most scenarios though, the final step is to inject the resulting video into a <video> element for playback, and there is no direct way to tell when a specific frame has been rendered by a <video> element. HTMLVideoElement.requestVideoFrameCallback() exposes a number of times that may be used to compute when the underlying frame will be presented to the user, but it does not (yet?) expose the underlying frame's timestamp so applications cannot tell which frame is going to be presented.

The code's workaround is to encode the frame's timestamp in an overlay and to copy frames rendered to the <video> element to a <canvas> element whenever the requestVideoFrameCallback() callback is called to decode the timestamp. That works so-so because it needs to run on the main thread and requestVideoFrameCallback() sometimes misses frames as a result.

Being able to track when a frame is actually rendered seems useful for statistic purpose, e.g. to evaluate jitter effects, and probably for synchronization purpose as well if video needs to be synchronized with some separate audio stream and/or other non-video overlays.

An alternative approach would be to render video frames directly to a <canvas> element instead of to a <video> element. This means having to re-implement an entire media player in the generic case, which seems a hard problem.

Hard to mix hybrid stream architectures

The backpressure mechanism in WHATWG Streams takes some getting used to, but appears simple and powerful after a while. It remains difficult to reason about backpressure in video processing pipelines because, by definition, this backpressure mechanism stops whenever something else than WHATWG Streams are used:

  • WebRTC uses MediaStreamTrack by default.
  • The VideoEncoder and VideoDecoder classes in WebCodecs have their own queueing mechanism.
  • VideoTrackGenerator and MediaStreamTrackProcessor create a bridge between WebRTC and WebCodecs, with specific queueing rules.

There are good reasons that explain the divergence of approaches regarding streams handling across technologies. For example, see Decoupling WebCodecs from Streams. From a developer perspective, this makes mixing technologies harder. It also creates more than one way to build the same pipeline with no obvious right approach to queueing and backpressure.

Hard to mix technologies that require dedicated expertise

More generally speaking and not surprisingly, it is hard to mix technologies that require different sets of skills. Examples include: pipeline layouts and memory alignment concepts in WebGPU and WGSL, streams and backpressure, video encoding/decoding parameters, WebAssembly memory layout and instructions. It is also hard to understand when copies are made when technologies are combined. In short, combining technologies creates cognitive load, all the more so than these technologies live in their own ecosystem with somewhat disjoint communities.

Missing WebGPU / WebCodecs connector? (resolved)

Importing a VideoFrame to WebGPU as an external texture is relatively straightforward. To create a VideoFrame once GPU processing is over, the code waits for onSubmittedWorkDone and creates a VideoFrame out of the rendered <canvas>. In theory at least, the <canvas> seems unnecessary but a VideoFrame cannot be created out of a GPUBuffer (at least without copying the buffer into CPU memory first). Also, this approach seems to create a ~10ms delay in average and it is not clear whether this is just a temporary implementation hiccup (support for WebGPU and WebCodecs in Chrome are still under development) or just not the right approach to creating a VideoFrame. The shaders that create the overlay were clumsily written and can be drastically optimized for sure, but the delay seems to appear even when the shaders merely sample the texture. Is there a more efficient way to read back from WebGPU and hook into further video processing stages?

There is no need to wait for onSubmittedWorkDone after all. Provided that a GPUTexture is attached to the canvas' context, the VideoFrame constructor will always read the results of the GPU processing from the canvas (waiting for the processing to complete if needed), see related discussion in WebGPU repository. Doing so makes GPU processing drop to ~1ms in average.

VideoFrame, TransformStream and workers

Streams can be transferred through postMessage() to workers. This makes it easy to create processing steps in workers, and transfer streams of VideoFrame back and forth between these workers and the main thread, through TransformStream. Problem is that the actual chunks written to a TransformStream are serialized and not transferred, and serialization does not clearly transfer the ownership of the VideoFrame for now.

In essence, the application needs to close the frame explicitly on the sender's side, otherwise memory gets leaked and stalls promptly occur, especially when there is a hardward video decoder in the processing pipeline. Now, when the TransformStream exists with one leg in each realm, the actual sending/receiving occurs asynchronously, and there is no way to know when the frame has already been cloned and can be safely closed on the sender's side.

Workaround is to keep track of the frames to close, and to close them once we know that processing is over. This feels tedious and hacky, and complicates the use of workers and TransformStream. Ideally, frames would be automatically closed when there are enqueued to the TransformStream controller or written to a WritableStream, so that application would only have to worry about the receiver side. Inclusion of such a mechanism in the Streams spec is being discussed, see:

Number of workers can be reduced of course, e.g. to run all processing steps in one and only one worker. That said, if media capture is needed, getUserMedia() only runs on the main thread and the initial stream of VideoFrame needs to be created on the main thread and transferred to the worker for processing no matter what.

WebAssembly and VideoFrame data copy performances

WebCodecs and other APIs that handle decoded image data were carefully written to avoid exposing the actual image data to the JavaScript code in order to minimize the number of copies that may need to be made. In practice, most processing steps involve data that sits on the GPU: drawing to a canvas, capturing a video stream from a camera, encoding/decoding in H.264, processing with WebGPU.

WebAssembly cannot interact with GPU-bound memory though. The only way to process a VideoFrame in WebAssembly is to first copy it over to WebAssembly's linear memory, using VideoFrame.copyTo(). That call triggers a readback from the GPU, meaning that the frame needs to be copied over from the GPU memory to the ArrayBuffer that describes the WebAssembly's linear memory. That copy is costly. A Full HD (1920x1080) SDR video frame in RGBA is 8MB and copy may take 15-20ms. That is a significant cost to pay, especially considering that the time budget per frame is only 40ms for a video at 25 frames per second, and 20ms for a video at 50 frames per second.

Copy times seem consistent with, albeit a bit higher than, times reported by Paul Adenot in his Memory Access Patterns in WebCodecs talk during W3C's workshop on professional media production on the Web. As Paul points out, "It's always better to keep the VideoFrames on the GPU if possible". Video frame processing with WebAssembly is possible but not the best option if that processing can be envisioned on the GPU.

WebGPU and video pixel formats

The arrangement of bytes in each plane of the VideoFrame essentially does not matter when WebGPU is used: the importExternalTexture() method accepts all common pixel formats and the GPUSampler takes care of converting these formats under the hoods to create RGBA colors. As such, the shaders don't need to deal with format conversion, they will read RGBA pixels from the incoming video frame texture.

This actually makes use of WebGPU quite convenient to convert a VideoFrame whose format is one of the YUV variants to an RGBA one, as done in ToRGBXVideoFrameConverter.js.

One drawback is that it also means that VideoFrame that uses one of the YUV formats will get converted no matter what. Is there an easy way to preserve the initial YUV format using WebGPU?

WebAssembly and video pixel formats

As opposed to WebGPU, WebAssembly sees the pixels of a VideoFrame as a linear sequence of bytes, whose interpretation is up to the WebAssembly code that gets executed. The WebAssembly code thus needs to deal with the different possible variations of YUV and RGB arrangements... or frame conversion needs to take place before the frame reaches the WebAssembly code. Converting YUV-like formats to RGBA is relatively straightforward but that is still extra work.

WebAssembly and bytes arrangement

To speed up WebAssembly processing a bit, it seems useful to read RGBA bytes as one 32-bit integer. Beware though, WebAssembly reads memory using little-endian, so when reading the RGBA bytes at once, you actually end up with ABGR bytes in the 32-bit integer.

WebAssembly and SIMD

The use of SIMD instructions in WebAssembly could in theory further help divide the processing time by up to 4: SIMD instructions process vectors of 128 bits, meaning 16 color components or 4 pixels at once. This may be worth considering, depending on the processing being applied.

Acknowledgments

Work on this code was prompted and strongly inspired by demos, code and issues created by Bernard Aboba (@aboba) as part of joint Media Working Group and WebRTC Working Group discussions on the media pipeline architecture, see w3c/media-pipeline-arch#1 and underlying code in w3c/webcodecs#583 for additional context. Many thanks for providing the initial spark and starting code that @dontcallmedom and I could build upon!

WebAssembly code was loosely inspired the Video manipulation with WebAssembly article by Szabolcs Damján.