Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is not meant to be merged
I would like to continue with what we have discussed with @interwq some time ago in #2484
We spent some time trying to reproduce the issue without having to expose our internal code and finally found something.
To reproduce the issue, we have temporarily enabled jemalloc's built-in logging in our system ("log:."), extended log lines with additional data and gathered all the logs that were done prior to getting stuck in the infinite loop. The logs come from our internal reproducer (simple piece of code that reproduced inf. loop reliably, but relied on our code).
After that, we have written a very simplistic simulation, that parses the log file and for each thread (speaking about N threads that produced the logs) we create an array (
std::vector
actually) with "operations", where operation is from {malloc
,free
,mallocx
,sdallocx
,calloc
,realloc
}. To deal with different addresses or pointers, we added simple mapping from "original" to "real" addresses, but this should not be that important for the inf. loop itself. After these arrays are prepared, we start worker threads that simply iterate over these arrays (ordered by time of logging these operations) and execute them with logged parameters. We do not do any timing, we just execute all the operations and to make the reproducer a bit more likely to get stuck in inf loop, we added some random thread sleeps and iterate over each array multiple times.Initially, this was not sufficent to reproduce the bug (maybe the lack of proper timing, or whatever, the simulation is quite fragile and really very simplistic/naive), but we have worked this out by playing a bit with HPA related parameters.
To sharing of the code easier, I just copy-pasted the simulation to one of the .cpp integration tests, so running
make tests_integration
followed by./test/integration/cpp/basic
should do the trick. Note that failrate is below 100%, you might have to run it a couple of times before it gets stuck. Also, I attach thenormalized_out.tar.gz
which contains compressed logs ("normalized" does not mean anything special, I just replaced thread IDs and timestamps to start with 0) - this file is intended to get extracted in same folder where basic.cpp is.As I already mentioned, unfortunately I was not able to reproduce the issue with exactly same parameters as we used in our system, but had to tweak
hpa_hugification_threshold_ratio
andhpa_dirty_mult
a bit to 0.9 and 0.11 respectively. I was trying different parameter combinations, and overall, I found that we usually end up in inf. loop for values that are just slightly greater than 1 (slightly, but strictly greater), e.g. [0.9, 0.11], [0.51, 0.5], or even [0.4, 0.7].Since the issue is very sensitive to the value of these parameters, we would like to understand these in more detail. Since HPA is still experimental feature, I understand that there is no documentation for this, but would you be able to give us some more detailed description of these parameters? I am currently doing some experiments with setting different values. So far, using 0.9 and 0.25 seem to be quite stable, but any additional insight into this would be highly appreciated.
Thanks
CC @ericm1024