Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A100 hardware decoder #5362

Open
1 task done
dengjiahao12 opened this issue Mar 8, 2024 · 1 comment
Open
1 task done

A100 hardware decoder #5362

dengjiahao12 opened this issue Mar 8, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@dengjiahao12
Copy link

Describe the question.

A100 hardware decoder: I pulled the nvcr.io/nvidia/pytorch:23.12-py3 Docker image and created a container. I built the following pipeline

images, _ = fn.readers.file(file_root=image_dir, random_shuffle=True)
images = fn.decoders.image(images, device='mixed', output_type=types.RGB, hw_decoder_load=0.75)

I want to test the A100 hardware decoder and analyze why there are significant throughput differences when allocating different ratios of decoding tasks to the hardware decoder. My naive approach was to add pdb to inspect program execution. However, when the program reaches self._pipe.RunGPU() in nvidia/dali/pipeline.py, I can't step into RunGPU().

What I want to know is how to analyze why there is a difference in throughput when different ratios of decoding tasks are assigned to the hardware decoder. For example, according to the blogLoading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs, if 75% of the decoding tasks are assigned to the hardware decoder, the throughput can reach about 7000img/sec.

However, if all tasks are assigned to the hardware decoder, the throughput is only about 5000img/sec. If all decoding is assigned to the A100 GPU, the throughput is about 6000img/sec. In my own test, when I assigned 10% of the decoding tasks to the hardware decoder (hw_decoder_load=0.1), the throughput was only 2000img/sec. I want to know why and how to analyze why this is the case.

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@dengjiahao12 dengjiahao12 added the question Further information is requested label Mar 8, 2024
@jantonguirao jantonguirao assigned jantonguirao and unassigned awolant Mar 8, 2024
@jantonguirao
Copy link
Contributor

Hi @dengjiahao12. Thank you for your question.

The best tool I can recommend is nsight systems profiler. You can collect your profile like this:

# (optional) lower paranoid level for profiling (this will give us some extra info for the CPU part of the execution)
echo 1 > /proc/sys/kernel/perf_event_paranoid

# collect your profile with nsys from cuda toolkit
nsys profile --trace=cuda,opengl,nvtx python your_test_script.py

This should give you a profile file that you can load and visualize to see the timeline of your execution. You need to install https://developer.nvidia.com/nsight-systems to open it.

Feel free to send us the profile back and we can have a look and help you figure out what's going on in your case.

Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants