Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

mengwanguc · 2024-01-09T16:08:40Z

Describe the question.

Hi,

I'm using DALI to preprocess the Imagenet data. I have all my data cached in memory. I want to test the performance of GPU preprocessing.

I'm using batch size 256, and thread=64 (I make threads large enough so that CPU is not the bottleneck).

My pipeline only has reader and decoder. And I don't train any model.
And I'm using device='cpu' for reader, and device='mixed' for decoder per advised by the documentation: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/image_processing/decoder_examples.html

However, I found that DALI is consuming 2400% CPU usage, which means 24 CPU cores.

I'm surprised because I thought DALI will offload the all the decoding to GPU, so the CPU usage should be low.

I can see it's indeed offloaded to GPU though, as I see the GPU utilization be ~54%.

But I don't understand what is consuming so much CPU resources?
I understand that there is some memory copy that could take CPU, e.g. copying data from pageable memory to pinned_memory. But I don't think it will consume 2400% CPU.

This is my pipeline:

@pipeline_def
def create_dali_pipeline(data_dir, crop, size, shard_id, num_shards, dali_cpu=False, is_training=True):
    images, labels = fn.readers.file(file_root=data_dir,
                                     shard_id=shard_id,
                                     num_shards=num_shards,
                                     random_shuffle=is_training,
                                     pad_last_batch=True,
                                     name="Reader")
    dali_device = 'cpu' if dali_cpu else 'gpu'
    decoder_device = 'cpu' if dali_cpu else 'mixed'
    # ask nvJPEG to preallocate memory for the biggest sample in ImageNet for CPU and GPU to avoid reallocations in runtime
    device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
    host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
    # ask HW NVJPEG to allocate memory ahead for the biggest image in the data set to avoid reallocations in runtime
    preallocate_width_hint = 5980 if decoder_device == 'mixed' else 0
    preallocate_height_hint = 6430 if decoder_device == 'mixed' else 0
    if is_training:
        images = fn.decoders.image_random_crop(images,
                                               device=decoder_device, output_type=types.RGB,
                                               device_memory_padding=device_memory_padding,
                                               host_memory_padding=host_memory_padding,
                                               preallocate_width_hint=preallocate_width_hint,
                                               preallocate_height_hint=preallocate_height_hint,
                                               random_aspect_ratio=[0.8, 1.25],
                                               random_area=[0.1, 1.0],
                                               num_attempts=100)
    return images, labels

Thanks!

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

JanuszL · 2024-01-09T16:33:36Z

Hi @mengwanguc,

Thank you for reaching out. Please read this blog post to learn how image decoding is accelerated in detail.
Long story short JPEG decoding can be split into a very serial part that is done on the CPU and the part that can be accelerated on the GPU (on SMs). Starting with Ampere architecture a dedicated HW block has been added that offloads the whole decoding process, including the serial part. In DALI you can use both approaches in parallel and use hw_decoder_load to split the work.
Based on your case I guess you are using GPU that doesn't have the HW decoder and your CPU and GPUs are both involved in the decoding.
There is also hybrid_huffman_threshold where you can ask to offload even the serial port to the GPU but for small images, it is inefficient and slows things down.

mengwanguc · 2024-01-09T17:07:39Z

Thanks @JanuszL ! This makes a lot of sense!

mengwanguc · 2024-01-10T17:39:25Z

Hi @JanuszL ,

Sorry I have a follow-up question. I observed that when I use more CPU threads, the GPU memory consumption by DALI also increases, even though I use the same batch size.

For example, when I use batch size 64.

If I use 4 CPU threads, the GPU memory consumption by DALI will keep increasing and stop at 2.4GB
However, if I use 8 CPU threads, the GPU memory consumption by DALI can increase to 3.3 GB
And with 16 CPU threads, it can increase to 4.9GB.

Is it expected? If so, why does this happen since they are using the same batch size?

JanuszL · 2024-01-10T18:05:39Z

@mengwanguc,

Is it expected? If so, why does this happen since they are using the same batch size?

It comes from how DALI uses the nvJPEG library. We create one instance of the decoder per CPU thread to improve the performance of the serial CPU part. Each instance uses CPU and GPU memory for the decoding so even if you are using the same batch size the memory consumption will grow with the number of threads when you use a hybrid image decoder.

mengwanguc added the question Further information is requested label Jan 9, 2024

dali-automaton assigned JanuszL Jan 9, 2024

mengwanguc closed this as completed Jan 9, 2024

mengwanguc reopened this Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

mengwanguc commented Jan 9, 2024

JanuszL commented Jan 9, 2024

mengwanguc commented Jan 9, 2024

mengwanguc commented Jan 10, 2024

JanuszL commented Jan 10, 2024

Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

Why Decoder consumes 24 CPU cores even if using "mixed" device #5274

Comments

mengwanguc commented Jan 9, 2024

Describe the question.

Check for duplicates

JanuszL commented Jan 9, 2024

mengwanguc commented Jan 9, 2024

mengwanguc commented Jan 10, 2024

JanuszL commented Jan 10, 2024