You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, I found that DALI is consuming 2400% CPU usage, which means 24 CPU cores.
I'm surprised because I thought DALI will offload the all the decoding to GPU, so the CPU usage should be low.
I can see it's indeed offloaded to GPU though, as I see the GPU utilization be ~54%.
But I don't understand what is consuming so much CPU resources?
I understand that there is some memory copy that could take CPU, e.g. copying data from pageable memory to pinned_memory. But I don't think it will consume 2400% CPU.
This is my pipeline:
@pipeline_def
def create_dali_pipeline(data_dir, crop, size, shard_id, num_shards, dali_cpu=False, is_training=True):
images, labels = fn.readers.file(file_root=data_dir,
shard_id=shard_id,
num_shards=num_shards,
random_shuffle=is_training,
pad_last_batch=True,
name="Reader")
dali_device = 'cpu' if dali_cpu else 'gpu'
decoder_device = 'cpu' if dali_cpu else 'mixed'
# ask nvJPEG to preallocate memory for the biggest sample in ImageNet for CPU and GPU to avoid reallocations in runtime
device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
# ask HW NVJPEG to allocate memory ahead for the biggest image in the data set to avoid reallocations in runtime
preallocate_width_hint = 5980 if decoder_device == 'mixed' else 0
preallocate_height_hint = 6430 if decoder_device == 'mixed' else 0
if is_training:
images = fn.decoders.image_random_crop(images,
device=decoder_device, output_type=types.RGB,
device_memory_padding=device_memory_padding,
host_memory_padding=host_memory_padding,
preallocate_width_hint=preallocate_width_hint,
preallocate_height_hint=preallocate_height_hint,
random_aspect_ratio=[0.8, 1.25],
random_area=[0.1, 1.0],
num_attempts=100)
return images, labels
Thanks!
Check for duplicates
I have searched the open bugs/issues and have found no duplicates for this bug report
The text was updated successfully, but these errors were encountered:
Thank you for reaching out. Please read this blog post to learn how image decoding is accelerated in detail.
Long story short JPEG decoding can be split into a very serial part that is done on the CPU and the part that can be accelerated on the GPU (on SMs). Starting with Ampere architecture a dedicated HW block has been added that offloads the whole decoding process, including the serial part. In DALI you can use both approaches in parallel and use hw_decoder_load to split the work.
Based on your case I guess you are using GPU that doesn't have the HW decoder and your CPU and GPUs are both involved in the decoding.
There is also hybrid_huffman_threshold where you can ask to offload even the serial port to the GPU but for small images, it is inefficient and slows things down.
Sorry I have a follow-up question. I observed that when I use more CPU threads, the GPU memory consumption by DALI also increases, even though I use the same batch size.
For example, when I use batch size 64.
If I use 4 CPU threads, the GPU memory consumption by DALI will keep increasing and stop at 2.4GB
However, if I use 8 CPU threads, the GPU memory consumption by DALI can increase to 3.3 GB
And with 16 CPU threads, it can increase to 4.9GB.
Is it expected? If so, why does this happen since they are using the same batch size?
Is it expected? If so, why does this happen since they are using the same batch size?
It comes from how DALI uses the nvJPEG library. We create one instance of the decoder per CPU thread to improve the performance of the serial CPU part. Each instance uses CPU and GPU memory for the decoding so even if you are using the same batch size the memory consumption will grow with the number of threads when you use a hybrid image decoder.
Describe the question.
Hi,
I'm using DALI to preprocess the Imagenet data. I have all my data cached in memory. I want to test the performance of GPU preprocessing.
I'm using batch size 256, and thread=64 (I make threads large enough so that CPU is not the bottleneck).
My pipeline only has reader and decoder. And I don't train any model.
And I'm using device='cpu' for reader, and device='mixed' for decoder per advised by the documentation: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/image_processing/decoder_examples.html
However, I found that DALI is consuming 2400% CPU usage, which means 24 CPU cores.
I'm surprised because I thought DALI will offload the all the decoding to GPU, so the CPU usage should be low.
I can see it's indeed offloaded to GPU though, as I see the GPU utilization be ~54%.
But I don't understand what is consuming so much CPU resources?
I understand that there is some memory copy that could take CPU, e.g. copying data from pageable memory to pinned_memory. But I don't think it will consume 2400% CPU.
This is my pipeline:
Thanks!
Check for duplicates
The text was updated successfully, but these errors were encountered: