Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

Open
marcjasner opened this issue Apr 12, 2023 · 29 comments
Open

[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

marcjasner opened this issue Apr 12, 2023 · 29 comments
Labels

Comments

@marcjasner
Copy link

What Operating System(s) are you seeing this problem on?

Other (plase, specify in the Steps to Reproduce)

dlib version

19.24

Python version

3.6

Compiler

gcc 7.5

Expected Behavior

I am attempting to create a facial detection/recognition component for a system I'm working on and I am unable to get dlib/face_recognition to perform at better than 2FPS under any circumstances.

The systems is running on a Jetson Nano 4G running Ubuntu 18.04 with Jetpack 4.6 installed.

I built dlib from scratch (using this helper script: https://github.com/JpnTr/Jetson-Nano-Install-Dlib-Library) and verified that suggested Jetson specific patches were made (as per https://medium.com/@ageitgey/build-a-hardware-based-face-recognition-system-for-150-with-the-nvidia-jetson-nano-and-python-a25cb8c891fd).

The test code I am running (against a single picture at 585x388 resolution with 5 people in it) looks like:

`#!/usr/bin/python3.6

import face_recognition
import time

def current_milli_time():
return round(time.time() * 1000)

for i in range(0,30):
t1=current_milli_time()
image = face_recognition.load_image_file("humans_1.jpg")
t2=current_milli_time()
face_locations = face_recognition.face_locations(image, model="cnn")
t3=current_milli_time()

print(face_locations)
print("load: ", t2-t1 )
print("detect: ",t3-t2)
print("Total: ", t3-t1)`

With no model specified (so the CPU is being used I believe) the normal face detection time is about 500ms, give or take. When I specify model="cnn" that number actually INCREASES to over 800ms.

tegrastats verifies that my GPU utilization is 99%.

I've seen this issue reported by other people but I have yet to see a solution. Shouldn't this be a reasonably fast operation (under 100ms) on a GPU? I've seen other (c/c++ based) face detection methods that suggest that detection can take as little as 20-50ms.

Current Behavior

Current behavior is that face detection takes 500ms on the CPU and even longer (800+ms) when using CUDA/GPU.

Steps to Reproduce

Nothing fancy, just run the code I provided.

Anything else?

No response

@marcjasner marcjasner added the bug label Apr 12, 2023
@davisking
Copy link
Owner

davisking commented Apr 13, 2023

I don't know what's going on in face_recognition so I can't say (it's not a dlib package). Maybe there is a surprise in there.

@marcjasner
Copy link
Author

Is there any logging I could turn on that would help me determine if the slowdown is from face_recognition or dlib?

What kind of timings SHOULD I be expecting? Is it safe to say 500-800ms is way too slow for GPU?

@davisking
Copy link
Owner

It depends on your GPU, how big the image is, and what kinds of settings or way of running it is being used in that code. So I can't say. Could be anything.

@marcjasner
Copy link
Author

Ok, I appreciate the help. I opened a ticket with the face_recognition project once before and never got a reply. I'll give it another try. Thanks.

@marcjasner
Copy link
Author

After some debugging of the face_recognition code I've traced through it enough to see that the code is performing quickly up until the call to dlib.face_recognition_model_v1(face_recognition_model), where face_recognition_model points to the file dlib_face_recognition_resnet_model_v1.dat

That call takes over 800ms to return on a Jetson Nano. Should I be calling a different function/model?

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Are you sure that model is running on CUDA? I am not familiar with the Jetson Nano speeds, but those timings you get look a lot like CPU inference. It should be really fast, since it's a slimmed version of ResNet34 that operates on 150×150 images.

@marcjasner
Copy link
Author

When I run tegrastats I see GPU usage at 99%, so as far as I can tell it's running on the GPU. I also made sure I compiled dlib with CUDA support.

Also, if I don't specify that model="cnn" then the whole operation on CPU takes 500ms.

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Hmm, can you run the inference on the network twice in a row and measure only the second time? Maybe your measurements include the allocation on the GPU.

@marcjasner
Copy link
Author

I actually have my test program running the inference 30 times in a row in a loop. the first call takes a LONG time (16-20s) and then after that the time is consistently 840ms (give or take a millisecond or two).

@marcjasner
Copy link
Author

Oh, you mean the GPU utilization? it stays at 99% while the 30 inference loop runs.

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Then I don't know what's going on.

For reference, I timed the inference of one image using the dlib C++ example dnn_face_recognition_ex on a 12th Gen Intel® Core™ i7-1260P × 16 CPU, and it takes about 50 ms.

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Another thing, are you certain it's that model that's causing the latency? Not the face detector? How big are your images?

@marcjasner
Copy link
Author

The test image is 585x388. How do I determine if it's the face detector or not? The dlib call that I put timing measurements around was the call to dlib.face_recognition_model_v1(face_recognition_model). Does that do the detection and the recognition or did I misunderstand this and miss another dlib call somewhere? That call was the one that took 840ms.

@marcjasner
Copy link
Author

marcjasner commented Apr 14, 2023

No, you're right, I'm looking at the wrong function... dlib.cnn_face_detection_model_v1(cnn_face_detection_model) is the function that is taking 840ms. The model it is using is mmod_human_face_detector.dat

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Ah, that makes more sense. However, that image doesn't seem that big, though.
That model is creating an image pyramid which will have roughly 4 times the number of pixels of the original image (see: https://blog.dlib.net/2017/08/vehicle-detection-with-dlib-195_27.html).

Try downscaling the image (at the risk of not detecting the smaller faces, any face which is smaller than 80×80 pixels won't be detected).

@marcjasner
Copy link
Author

That did make a difference, dropping 840ms down to 160ms when I shrunk the file down to 1/3 of it's size, but it no longer detected any faces. Also, that still doesn't account for the face recognition that will have to be done. I'm confused as to why face detection/recognition is taking so much longer than something like human pose estimation, which I'm able to do on the Jetson in about 60ms.

I've also been able to do face detection/recognition (using different code, admittedly) on a Raspberry Pi with Intel's Neural Compute Stick 2 in around 50-75ms as well, and the Jetson is many times more powerful than the NCS2. The detection model I'm using there is https://docs.openvino.ai/latest/omz_models_model_face_detection_retail_0004.html.

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Yeah, I don't know what's going on. That model should be really fast, even on relatively large images, since it only has 7 convolutional layers... There must be something else wrong.

@arrufat
Copy link
Contributor

arrufat commented Apr 14, 2023

Can you just run the face detector model using the official dlib examples:

Maybe that other library you're using does something we are not aware of. Let's try to isolate the problem.

@marcjasner
Copy link
Author

Good idea. I will try both of those and let you. I appreciate all of the help. Thanks very much!

@marcjasner
Copy link
Author

marcjasner commented Apr 16, 2023

cnn_face_detectory.py times ranged from 840ms to over 4s depending on the size of the image passed in. None of the measurements were under 840ms and all were utilizing the GPU. I'll try the C++ next, but this seems excessively long for GPU based face detection.

Edit: C++ results were similar.

@arrufat
Copy link
Contributor

arrufat commented Apr 16, 2023

Ok, so I ran the C++ example on a NVIDIA Quadro RTX 5000 and, with the default example, the second inference on each image (to avoid measuring the memory allocation) took about 250 ms, which might seem quite slow, at first. This is how I ran it:

dnn_mmod_face_detection_ex mmod_human_face_detector.dat faces/*jpg

However, if we look at the code, we can see that we're upscaling the images so that they have about 1800×1800 pixels. Which means we are doing inference on images that are about 4000×3000 pixels (they are actually larger because the network will create a tiled pyramid, but let's ignore that).

If I change the code to infer on images that are about 900×900 pixels, the runtime goes down to 70 ms, and the images are about 2000×1500 pixels. For the images in the dlib examples, that is enough to detect all faces.

If I try with 450×450 pixels, then the inference time goes down to 20 ms, but I start to have false negatives (not able to detect the smallest faces). So, for the images in that dataset, the optimal size is somewhere between 450×450 and 900×900.

For reference, here are the modifications I made:

diff --git a/examples/dnn_mmod_face_detection_ex.cpp b/examples/dnn_mmod_face_detection_ex.cpp
index 3cdf4fcc..92988540 100644
--- a/examples/dnn_mmod_face_detection_ex.cpp
+++ b/examples/dnn_mmod_face_detection_ex.cpp
@@ -88,7 +88,7 @@ int main(int argc, char** argv) try
 
         // Upsampling the image will allow us to detect smaller faces but will cause the
         // program to use more RAM and run longer.
-        while(img.size() < 1800*1800)
+        while(img.size() < 900*900)
             pyramid_up(img);
 
         // Note that you can process a bunch of images in a std::vector at once and it runs
@@ -97,6 +97,10 @@ int main(int argc, char** argv) try
         // the same size.  To avoid this requirement on images being the same size we
         // process them individually in this example.
         auto dets = net(img);
+        const auto t0 = chrono::steady_clock::now();
+        dets = net(img);
+        const auto t1 = chrono::steady_clock::now();
+        cout << "size: " << img.nc() << "×" << img.nr() << ", elapsed: " << chrono::duration_cast<chrono::duration<float, milli>>(t1 - t0).count() << " ms\n"; 
         win.clear_overlay();
         win.set_image(img);
         for (auto&& d : dets)

This means that you need to find a trade-off between speed and accuracy for your use case.
I don't know what the library you're using is doing under the hood, but I would just call the detector myself, and this way you'll be sure of what's actually happening in your program.

@marcjasner
Copy link
Author

I'll try calling dlib directly then and see. Just out of curiosity what command do you use to rebuild the example code without rebuilding everything.

@arrufat
Copy link
Contributor

arrufat commented Apr 16, 2023

Assuming you are at the top-level directory of the dlib repository:

cd examples
cmake -B build -G Ninja
cmake --build build -t dnn_mmod_face_detection_ex

If you don't specify -G Ninja, CMake will use your OS's default build system.

@facug91
Copy link
Contributor

facug91 commented Jul 16, 2023

I tried the C++ example on a Jetson Nano and got similar results.
I first tried running it as is, with images fromexamples/faces, but the large number of pyramid up made it impossible for it to work without exploding.
I then modified it to work like the python example, with just one pyramid up, and was able to run many examples, but not the biggest one.
So I decided to resize each image to the same resolution you were testing (585, 388), before the pyramid up, and got 477 ms on average, similar to your results, and found almost all faces except two really small ones. Without CUDA, I got ~25 seconds, so we can be sure you were using the GPU.
Without any pyramid up, it ran in ~118 ms, but detected almost no faces.
My conclusion is that it's not a bug, it's how it works on Jetson Nano devices.
Anyway, I think the model could be optimized to be smaller and work just as well, but it would require some training and testing.

@Compaile
Copy link

Compaile commented Jul 18, 2023

There is also an issue(cuda not dlib) that if you just upscale instead of doing a fixed size as target (I know letterbox and stuff) you need to relocate cuda mem. which makes it nearly as slow as the notorious first pass on cuda

@justmobilize
Copy link

justmobilize commented Oct 28, 2023

So I've noticed the slowdown with cuda and needing to relocate when image sizes are different. I'm processing through a bunch of images, and it's easy for me to sort them in size order. My question is:

Is in Python - is there a way to release the memory held in cuda, without just exiting out?

@davisking
Copy link
Owner

Letting all the objects go out of scope or deleting them will free the memory. But the CUDA runtime itself likes to hold onto memory and as far as I am aware there isn't any way to tell it to not do that.

@justmobilize
Copy link

@davisking thanks.

I did realize that letting it go out of scope clears most of it, and seems to work most of the time. If it hits an out of memory error and throws, it seems that it holds it forever (need to exit python).

Is there a way to calculate what the largest size image it can take without maxing out?

I've noticed that if I shrink large images (4000x3000) down a bit (0.625) and then upscale it by 1 (when calling the detector), it is able to find more faces then keeping the image at its native size and not upscaling.

I don't particularly care about speed, just trying to optimize for the best results.

@davisking
Copy link
Owner

davisking commented Oct 31, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants