A100 hardware decoder #5362

dengjiahao12 · 2024-03-08T02:17:31Z

Describe the question.

A100 hardware decoder: I pulled the nvcr.io/nvidia/pytorch:23.12-py3 Docker image and created a container. I built the following pipeline

images, _ = fn.readers.file(file_root=image_dir, random_shuffle=True)
images = fn.decoders.image(images, device='mixed', output_type=types.RGB, hw_decoder_load=0.75)

I want to test the A100 hardware decoder and analyze why there are significant throughput differences when allocating different ratios of decoding tasks to the hardware decoder. My naive approach was to add pdb to inspect program execution. However, when the program reaches self._pipe.RunGPU() in nvidia/dali/pipeline.py, I can't step into RunGPU().

What I want to know is how to analyze why there is a difference in throughput when different ratios of decoding tasks are assigned to the hardware decoder. For example, according to the blogLoading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs, if 75% of the decoding tasks are assigned to the hardware decoder, the throughput can reach about 7000img/sec.

However, if all tasks are assigned to the hardware decoder, the throughput is only about 5000img/sec. If all decoding is assigned to the A100 GPU, the throughput is about 6000img/sec. In my own test, when I assigned 10% of the decoding tasks to the hardware decoder (hw_decoder_load=0.1), the throughput was only 2000img/sec. I want to know why and how to analyze why this is the case.

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

jantonguirao · 2024-03-08T08:35:33Z

Hi @dengjiahao12. Thank you for your question.

The best tool I can recommend is nsight systems profiler. You can collect your profile like this:

# (optional) lower paranoid level for profiling (this will give us some extra info for the CPU part of the execution)
echo 1 > /proc/sys/kernel/perf_event_paranoid

# collect your profile with nsys from cuda toolkit
nsys profile --trace=cuda,opengl,nvtx python your_test_script.py

This should give you a profile file that you can load and visualize to see the timeline of your execution. You need to install https://developer.nvidia.com/nsight-systems to open it.

Feel free to send us the profile back and we can have a look and help you figure out what's going on in your case.

Hope that helps.

dengjiahao12 added the question Further information is requested label Mar 8, 2024

dali-automaton assigned awolant Mar 8, 2024

jantonguirao assigned jantonguirao and unassigned awolant Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A100 hardware decoder #5362

A100 hardware decoder #5362

dengjiahao12 commented Mar 8, 2024

jantonguirao commented Mar 8, 2024

A100 hardware decoder #5362

A100 hardware decoder #5362

Comments

dengjiahao12 commented Mar 8, 2024

Describe the question.

Check for duplicates

jantonguirao commented Mar 8, 2024