Skip to content

VPF Performance analysis

Roman Arzumanyan edited this page Oct 29, 2021 · 12 revisions

Basic idea

VPF provides Python bindings to HW-accelerated video processing features such as decoding / encoding and some CUDA-accelerated features like color conversion.
How do you know your Python video processing program shows optimal performance of not? By using performance profiling and measurement tools.
You may think of VPF as just another CUDA and Video Codec SDK C++ library, so all existing CUDA tools apply to VPF as well.

nvidia-smi

Small yet extremely useful CLI tool which shows a whole lot of information such as Nvdec / Nvenc / CUDA cores load level, GPU clocks and many more. Run this utility in parallel with your program to see what HW components are used.

Let's run it to get information about the HW, CUDA version and many more: nvidia-smi CLI utility output

Below is example of SampleDemuxDecode.py which utilizes Nvdec for 1080p H.264 video decoding and CUDA cores for NV12 -> YUV420 color converson:

nvidia-smi dmon CLI utility launched in parallel with SampleDemuxDecode.py

What do these numbers mean?

Both Nvdec and CUDA cores aren't maxed out, their usage is between 15% and 23%. Hence something slows down the program.
Usual reasons are:

  • Network or disk IO speed
  • Memory copies between RAM and vRAM
  • CPU-accelerated code which takes long time to run

Let's check what's happening in our SampleDemuxDecode.py code:

    while True:
        # Demuxer has sync design, it returns packet every time it's called.
        # If demuxer can't return packet it usually means EOF.
        if not nvDmx.DemuxSinglePacket(packet):
            break


        # Decoder is async by design.
        # As it consumes packets from demuxer one at a time it may not return
        # decoded surface every time the decoding function is called.
        surface_nv12 = nvDec.DecodeSurfaceFromPacket(packet)
        if not surface_nv12.Empty():
            surface_yuv420 = nvCvt.Execute(surface_nv12, cc_ctx)
            if surface_yuv420.Empty():
                break
            if not nvDwn.DownloadSingleSurface(surface_yuv420, rawFrame):
                break
            bits = bytearray(rawFrame)
            decFile.write(bits)

We can easily see that decoded surfaces are copied from vRAM to RAM and later saved on disk. Let's comment out portion of the code which dumps the frames to disk and see the nvidia-smi dmon output again:

            #bits = bytearray(rawFrame)
            #decFile.write(bits)

nvidia-smi dmon output after frames aren no longer stored on disk

One can easily notice the performance improvement. Now let's remove a copy between RAM and vRAM which is done here:

            #if not nvDwn.DownloadSingleSurface(surface_yuv420, rawFrame):
            #    break

And repeat analysis one more time:

nvidia-smi dmon output after DtoH memcpy is eliminated

We see a decline in CUDA cores load and slightly more stable Nvdec usage which is stable at 33%. It doesn't go any higher than this simply because our Quadro RTX 3000 GPU has 3 Nvdec units, hence a single video stream can only occupy 1/3 of it's decoding capacity which is ~33%.

Now our modified SampleDemuxDecode.py script is clearly limited by Nvdec performance and can't be further optimized.

Nsight Systems

Nsight Systems is an application profiler utility which is shipped alongside CUDA SDK. It supports both GUI and CLI mode. Because Python interpreter is an application itself, it loads PyNvCodec module the similar way other applications load shared libraries.

VPF supports Nvidia NVTX library which allows to add custom markers seen at application timeline within Nsight Systems. To enable this, opt-in USE_NVTX option when configurin VPF with CMake.

See below how to collect timeline for SampleDemuxDecode.py:
Make sure you opt-in "Collect CUDA trace" and "Collect NVTX trace" options. Setting up Nsight Systems for SampleDemuxDecode.py profiling

As you're done collecting profiling samples, you can see the application timeline:

Application timeline

All VPF Tasks are encapsulated in NVTX markers, so you can inspect each and every task.
Let's take a closer look: Nsight Systems application timeline - closeup

You can see fragment of timeline which shows frame demux, decode, NV12 > YUV420 color conversion and DtoH CUDA memcpy.
Please note that demuxing is done completely on CPU yet it's shown in application timeline.

Actual decoding latency

There are 2 more NVTX makers in Task which decodes frames: one marker is issued when frame is kicked-off for decoding:

  try {
    {
      /* Do this in separate scope because we don't want to measure
       * DecodeLockSurface() function run time;
       */
      stringstream ss;
      ss << "Start decode for frame with pts " << timestamp;
      NvtxMark decode_k_off(ss.str().c_str());
    }


    isSurfaceReturned =
        decoder.DecodeLockSurface(pEncFrame, timestamp, dec_ctx);
    pImpl->didDecode = true;
  } catch (exception& e) {
    cerr << e.what() << endl;
    return TASK_EXEC_FAIL;
  }

And second marker is issued when frame is ready for display:

    {
      stringstream ss;
      ss << "End decode for frame with pts " << dec_ctx.pts;
      NvtxMark display_ready(ss.str().c_str());
    }

As you see, information about frame PTS (presentation timestamp) is baked into NVTX marker message.
So you can get exact info about when frames are submitted to decoding and ready for display. Please note that NVTX markers are drawn to scale according to their duration, hence these two markers are tiny.

E. g. let's see when first frame is submitted in our SampleDemuxDecode.py script: First frame submitted to decoder

It happens around 1.21388s after the program started. Now let's find when first frame was ready for display: First frame submitted is ready for display

First frame submitted to decoding is ready for display around 1.23212s. So the latency is 1.23212-1.21388)=0.01824s. Please note that input file has B frames, so the decoding time will vary. It also depends on resolution, GOP structure, decoder settings and many more parameters, so please don't rely on SampleDemuxDecode.py as on performance benchmark.