Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

Open
1 task done
mengwanguc opened this issue Jan 9, 2024 · 4 comments
Open
1 task done

Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

mengwanguc opened this issue Jan 9, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@mengwanguc
Copy link

Describe the question.

Hi,

I'm using DALI to preprocess the Imagenet data. I have all my data cached in memory. I want to test the performance of GPU preprocessing.

I'm using batch size 256, and thread=64 (I make threads large enough so that CPU is not the bottleneck).

My pipeline only has reader and decoder. And I don't train any model.
And I'm using device='cpu' for reader, and device='mixed' for decoder per advised by the documentation: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/image_processing/decoder_examples.html

However, I found that DALI is consuming 2400% CPU usage, which means 24 CPU cores.

I'm surprised because I thought DALI will offload the all the decoding to GPU, so the CPU usage should be low.

I can see it's indeed offloaded to GPU though, as I see the GPU utilization be ~54%.

But I don't understand what is consuming so much CPU resources?
I understand that there is some memory copy that could take CPU, e.g. copying data from pageable memory to pinned_memory. But I don't think it will consume 2400% CPU.

This is my pipeline:

@pipeline_def
def create_dali_pipeline(data_dir, crop, size, shard_id, num_shards, dali_cpu=False, is_training=True):
    images, labels = fn.readers.file(file_root=data_dir,
                                     shard_id=shard_id,
                                     num_shards=num_shards,
                                     random_shuffle=is_training,
                                     pad_last_batch=True,
                                     name="Reader")
    dali_device = 'cpu' if dali_cpu else 'gpu'
    decoder_device = 'cpu' if dali_cpu else 'mixed'
    # ask nvJPEG to preallocate memory for the biggest sample in ImageNet for CPU and GPU to avoid reallocations in runtime
    device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
    host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
    # ask HW NVJPEG to allocate memory ahead for the biggest image in the data set to avoid reallocations in runtime
    preallocate_width_hint = 5980 if decoder_device == 'mixed' else 0
    preallocate_height_hint = 6430 if decoder_device == 'mixed' else 0
    if is_training:
        images = fn.decoders.image_random_crop(images,
                                               device=decoder_device, output_type=types.RGB,
                                               device_memory_padding=device_memory_padding,
                                               host_memory_padding=host_memory_padding,
                                               preallocate_width_hint=preallocate_width_hint,
                                               preallocate_height_hint=preallocate_height_hint,
                                               random_aspect_ratio=[0.8, 1.25],
                                               random_area=[0.1, 1.0],
                                               num_attempts=100)
    return images, labels

Thanks!

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@mengwanguc mengwanguc added the question Further information is requested label Jan 9, 2024
@JanuszL
Copy link
Contributor

JanuszL commented Jan 9, 2024

Hi @mengwanguc,

Thank you for reaching out. Please read this blog post to learn how image decoding is accelerated in detail.
Long story short JPEG decoding can be split into a very serial part that is done on the CPU and the part that can be accelerated on the GPU (on SMs). Starting with Ampere architecture a dedicated HW block has been added that offloads the whole decoding process, including the serial part. In DALI you can use both approaches in parallel and use hw_decoder_load to split the work.
Based on your case I guess you are using GPU that doesn't have the HW decoder and your CPU and GPUs are both involved in the decoding.
There is also hybrid_huffman_threshold where you can ask to offload even the serial port to the GPU but for small images, it is inefficient and slows things down.

@mengwanguc
Copy link
Author

Thanks @JanuszL ! This makes a lot of sense!

@mengwanguc
Copy link
Author

Hi @JanuszL ,

Sorry I have a follow-up question. I observed that when I use more CPU threads, the GPU memory consumption by DALI also increases, even though I use the same batch size.

For example, when I use batch size 64.

If I use 4 CPU threads, the GPU memory consumption by DALI will keep increasing and stop at 2.4GB
However, if I use 8 CPU threads, the GPU memory consumption by DALI can increase to 3.3 GB
And with 16 CPU threads, it can increase to 4.9GB.

Is it expected? If so, why does this happen since they are using the same batch size?

@mengwanguc mengwanguc reopened this Jan 10, 2024
@JanuszL
Copy link
Contributor

JanuszL commented Jan 10, 2024

@mengwanguc,

Is it expected? If so, why does this happen since they are using the same batch size?

It comes from how DALI uses the nvJPEG library. We create one instance of the decoder per CPU thread to improve the performance of the serial CPU part. Each instance uses CPU and GPU memory for the decoding so even if you are using the same batch size the memory consumption will grow with the number of threads when you use a hybrid image decoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants