Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA infinite loop reproducer #2533

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

pdokoupil
Copy link

This PR is not meant to be merged

I would like to continue with what we have discussed with @interwq some time ago in #2484

We spent some time trying to reproduce the issue without having to expose our internal code and finally found something.

To reproduce the issue, we have temporarily enabled jemalloc's built-in logging in our system ("log:."), extended log lines with additional data and gathered all the logs that were done prior to getting stuck in the infinite loop. The logs come from our internal reproducer (simple piece of code that reproduced inf. loop reliably, but relied on our code).

After that, we have written a very simplistic simulation, that parses the log file and for each thread (speaking about N threads that produced the logs) we create an array (std::vector actually) with "operations", where operation is from {malloc, free, mallocx, sdallocx, calloc, realloc}. To deal with different addresses or pointers, we added simple mapping from "original" to "real" addresses, but this should not be that important for the inf. loop itself. After these arrays are prepared, we start worker threads that simply iterate over these arrays (ordered by time of logging these operations) and execute them with logged parameters. We do not do any timing, we just execute all the operations and to make the reproducer a bit more likely to get stuck in inf loop, we added some random thread sleeps and iterate over each array multiple times.

Initially, this was not sufficent to reproduce the bug (maybe the lack of proper timing, or whatever, the simulation is quite fragile and really very simplistic/naive), but we have worked this out by playing a bit with HPA related parameters.

To sharing of the code easier, I just copy-pasted the simulation to one of the .cpp integration tests, so running make tests_integration followed by ./test/integration/cpp/basic should do the trick. Note that failrate is below 100%, you might have to run it a couple of times before it gets stuck. Also, I attach the normalized_out.tar.gz which contains compressed logs ("normalized" does not mean anything special, I just replaced thread IDs and timestamps to start with 0) - this file is intended to get extracted in same folder where basic.cpp is.

As I already mentioned, unfortunately I was not able to reproduce the issue with exactly same parameters as we used in our system, but had to tweak hpa_hugification_threshold_ratio and hpa_dirty_mult a bit to 0.9 and 0.11 respectively. I was trying different parameter combinations, and overall, I found that we usually end up in inf. loop for values that are just slightly greater than 1 (slightly, but strictly greater), e.g. [0.9, 0.11], [0.51, 0.5], or even [0.4, 0.7].

Since the issue is very sensitive to the value of these parameters, we would like to understand these in more detail. Since HPA is still experimental feature, I understand that there is no documentation for this, but would you be able to give us some more detailed description of these parameters? I am currently doing some experiments with setting different values. So far, using 0.9 and 0.25 seem to be quite stable, but any additional insight into this would be highly appreciated.

Thanks

CC @ericm1024

@interwq
Copy link
Member

interwq commented Sep 21, 2023

Thanks for sharing the details @pdokoupil . I'll try getting to this in the next few weeks.

ilvokhin added a commit to ilvokhin/jemalloc that referenced this pull request Apr 4, 2024
One of the condition to start purging is `hpa_hugify_blocked_by_ndirty`
function call returns true. This can happen in cases where we have no
dirty memory for this shard at all. In this case purging loop will be an
infinite loop.

`hpa_hugify_blocked_by_ndirty` was introduced at 0f6c420, but at that
time purging loop has different form and additional `break` was not
required. Purging loop form was re-written at 6630c59, but additional
exit condition wasn't added there at the time.

Repo code was shared by Patrik Dokoupil at [1], I stripped it down to
minimum to reproduce issue in jemalloc unit tests.

[1]: jemalloc#2533
ilvokhin added a commit to ilvokhin/jemalloc that referenced this pull request Apr 5, 2024
One of the condition to start purging is `hpa_hugify_blocked_by_ndirty`
function call returns true. This can happen in cases where we have no
dirty memory for this shard at all. In this case purging loop will be an
infinite loop.

`hpa_hugify_blocked_by_ndirty` was introduced at 0f6c420, but at that
time purging loop has different form and additional `break` was not
required. Purging loop form was re-written at 6630c59, but additional
exit condition wasn't added there at the time.

Repo code was shared by Patrik Dokoupil at [1], I stripped it down to
minimum to reproduce issue in jemalloc unit tests.

[1]: jemalloc#2533
interwq pushed a commit that referenced this pull request Apr 30, 2024
One of the condition to start purging is `hpa_hugify_blocked_by_ndirty`
function call returns true. This can happen in cases where we have no
dirty memory for this shard at all. In this case purging loop will be an
infinite loop.

`hpa_hugify_blocked_by_ndirty` was introduced at 0f6c420, but at that
time purging loop has different form and additional `break` was not
required. Purging loop form was re-written at 6630c59, but additional
exit condition wasn't added there at the time.

Repo code was shared by Patrik Dokoupil at [1], I stripped it down to
minimum to reproduce issue in jemalloc unit tests.

[1]: #2533
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants