Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading the frontal face shape predictor calls 135053 times malloc. #2919

Closed
mcourteaux opened this issue Feb 28, 2024 · 10 comments
Closed

Loading the frontal face shape predictor calls 135053 times malloc. #2919

mcourteaux opened this issue Feb 28, 2024 · 10 comments
Labels

Comments

@mcourteaux
Copy link
Contributor

mcourteaux commented Feb 28, 2024

Main idea

I am not familiar with the dlib codebase, but it seems there is some mem_manager stuff happening in quite some places. As the whole dlib::deserialize<> traversal is doing a bunch of small news, this is ideal for a bump allocator (a.k.a "memory arena").

I get that it's not trivial to integrate that into the STL containers being used. STL uses something called "polymorphic resources" in the std::pmr:: namespace, which supports bump allocators.

However, most allocations happen inside dlib::matrix (I estimate 70% of them).

So, I instrumented operator new() and operator delete() to keep track of these things. The result is that during loading of the frontal face shape predictor here is what happens:

  • 135053 allocations
  • 4 frees
  • total allocation 68MB.
  • average allocation size: 515 bytes per malloc.
  • 70% of those allocations all happen inside dlib::matrix.

Overall, I'd argue that this is bad for performance.


I actually tested it, and replaced the default operator new behavior by using a bump allocator (memory arena), and the load time went from 1.75s to 1.18s, which is a 48% performance increase.

Anything else?

No response

@davisking
Copy link
Owner

davisking commented Feb 29, 2024 via email

@mcourteaux
Copy link
Contributor Author

Sorry, my reported times were actually from both the "frontal_face_detector" AND the "shape_predictor_68_face_landmarks" together. Let me break down more clearly what's happening:

  • First, for this answer, I bumped the compile flag from -O2 to -O3, as per your suggestion.
  • frontal face detector:
    • normal new: 13395 mallocs + 12299 frees, taking 266ms.
    • instrumented normal new: 13395 mallocs + 12299 frees, taking 340ms.
    • memory arena new: 0 mallocs + 0 frees, taking 288ms.
    • instrumented memory arena new: 0 mallocs + 0 frees, taking 281ms.
  • 68 face landmarks shape predictor:
    • normal new: 135053 mallocs + 4 frees, taking 876ms.
    • instrumented normal new: 135053 mallocs + 4 frees, taking 1.34s.
    • memory arena new: 0 mallocs + 0 frees, taking 783ms.
    • instrumented memory arena new: 0 mallocs + 0 frees, taking 836ms.

So, my timings were too much influenced by the fact that I was recording the allocations and frees with too much detail.
Looking at the non-instrumented timings, the 68 face landmarks shape predictor speeds up from skipping all the malloc work with about 11%. Doing the same for the frontal face detector slows it down by 8%, which is perhaps due to cache misses, as it allocates round 2MB and frees up again 1MB during loading.

I don't know how you manage to load the 68 point model so quickly. I'm using this snippet:

std::string path = "shape_predictor_68_face_landmarks.dat";
try {
  dlib::deserialize(path) >> m_internals->sp_face_landmarks;
  return true;
} catch (const dlib::serialization_error &e) {
  spdlog::error("Could not load {}: {}", path, e.what());
  return false;
}

@davisking
Copy link
Owner

I was just doing deserialize(argv[1]) >> sp;

Anyway, you shouldn't need to worry about this startup time right? Just don't do it more than once?

@dlib-issue-bot
Copy link
Collaborator

Warning: this issue has been inactive for 35 days and will be automatically closed on 2024-04-14 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

@mcourteaux
Copy link
Contributor Author

I indeed do it once, but this is a very expensive wait time of 1100ms. My computer can read more than 1GB/s (sequential reading) from SSD. The thing we are loading is 70MB, which should take less than 70ms, not 1142ms. Of course, I'm aware of the base64-decode happening for the FFD. Overall, what I'm trying to say is that this way of making it user-friendly (read: programmer-friendly) is actually making it unsuitable for production code. It's a bad user experience if this thing takes 1.1s to load 70MB of coefficients.

@arrufat
Copy link
Contributor

arrufat commented Apr 7, 2024

I was never inconvenienced by the loading time of the shape predictor model. Out of curiosity, I just timed how long it takes on my machine, and it's about 350 ms.

Here's what I did, using the webcam face pose example program.

Add this at the top:

#include <chrono>
using fms = std::chrono::duration<float, std::milli>;

Time the loading:

const auto t0 = std::chrono::steady_clock::now();
deserialize("shape_predictor_68_face_landmarks.dat") >> pose_model;
const auto t1 = std::chrono::steady_clock::now();
cout << "shape predictor loaded in " << chrono::duration_cast<fms>(t1 - t0).count() << " ms\n";

@dlib-issue-bot
Copy link
Collaborator

Warning: this issue has been inactive for 35 days and will be automatically closed on 2024-05-22 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

@dlib-issue-bot
Copy link
Collaborator

Warning: this issue has been inactive for 43 days and will be automatically closed on 2024-05-22 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

@dlib-issue-bot
Copy link
Collaborator

Notice: this issue has been closed because it has been inactive for 45 days. You may reopen this issue if it has been closed in error.

@mcourteaux
Copy link
Contributor Author

Solved by ignoring it long enough. Nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants