You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to test the A100 hardware decoder and analyze why there are significant throughput differences when allocating different ratios of decoding tasks to the hardware decoder. My naive approach was to add pdb to inspect program execution. However, when the program reaches self._pipe.RunGPU() in nvidia/dali/pipeline.py, I can't step into RunGPU().
What I want to know is how to analyze why there is a difference in throughput when different ratios of decoding tasks are assigned to the hardware decoder. For example, according to the blogLoading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs, if 75% of the decoding tasks are assigned to the hardware decoder, the throughput can reach about 7000img/sec.
However, if all tasks are assigned to the hardware decoder, the throughput is only about 5000img/sec. If all decoding is assigned to the A100 GPU, the throughput is about 6000img/sec. In my own test, when I assigned 10% of the decoding tasks to the hardware decoder (hw_decoder_load=0.1), the throughput was only 2000img/sec. I want to know why and how to analyze why this is the case.
Check for duplicates
I have searched the open bugs/issues and have found no duplicates for this bug report
The text was updated successfully, but these errors were encountered:
The best tool I can recommend is nsight systems profiler. You can collect your profile like this:
# (optional) lower paranoid level for profiling (this will give us some extra info for the CPU part of the execution)
echo 1 > /proc/sys/kernel/perf_event_paranoid
# collect your profile with nsys from cuda toolkit
nsys profile --trace=cuda,opengl,nvtx python your_test_script.py
This should give you a profile file that you can load and visualize to see the timeline of your execution. You need to install https://developer.nvidia.com/nsight-systems to open it.
Feel free to send us the profile back and we can have a look and help you figure out what's going on in your case.
Describe the question.
A100 hardware decoder: I pulled the
nvcr.io/nvidia/pytorch:23.12-py3
Docker image and created a container. I built the following pipelineI want to test the A100 hardware decoder and analyze why there are significant throughput differences when allocating different ratios of decoding tasks to the hardware decoder. My naive approach was to add
pdb
to inspect program execution. However, when the program reachesself._pipe.RunGPU()
innvidia/dali/pipeline.py
, I can't step intoRunGPU()
.What I want to know is how to analyze why there is a difference in throughput when different ratios of decoding tasks are assigned to the hardware decoder. For example, according to the blogLoading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs, if
75%
of the decoding tasks are assigned to the hardware decoder, the throughput can reach about7000img/sec
.However, if all tasks are assigned to the hardware decoder, the throughput is only about
5000img/sec
. If all decoding is assigned to theA100 GPU
, the throughput is about6000img/sec
. In my own test, when I assigned10%
of the decoding tasks to the hardware decoder(hw_decoder_load=0.1)
, the throughput was only2000img/sec
. I want to know why and how to analyze why this is the case.Check for duplicates
The text was updated successfully, but these errors were encountered: