You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed in my application that -e lock causes noticeable overhead. I believe that's in part because the default threshold for recording the 'contention' is 0.
It also attributed a large amount of time in an area I believe is mostly uncontended, and comparing with e.g. -e wall, the picture is totally different.
I have a guess as to how the overhead leads to inaccurate results - the time taken to record the sample by the lock holder is not included in the acquire time reported by that thread, which is good, but thread 2 sees a longer acquire time because thread 1 is recording the sample while still holding the lock. So lock tracing actually increases contention, which is bad for performance but also messes with the result accuracy because it's not accounted for.
I suggest at least increasing the default threshold, in my test 10us is enough to remove the undue overhead while still catching the most interesting cases.
As far as I can tell, the overhead is from:
a) getting the thread id via gettid syscall
b) getting the stack trace via jvmti
I'm wondering if there's any way to move these outside the lock. I suppose you could do them pre-emptively in ContendedMonitorEnter before the lock is acquired, but that means you don't know if the threshold has been crossed yet. However, when the threshold is 0 that shouldn't be a problem so it might be beneficial. I found some discussion on the old MonitorContendedExit event which I guess would be even better it if was available, I don't know if there's any way for async-profiler to create an equivalent hook something like:
monitorExitHook() {
if (tag.acquireTime > threshold) recordSample() // expensive work done outside lock
}
As I mentioned, I saw this in a real application, but I also tried to create a simple repro:
public class Test {
static int x = 0;
public static void main(String[] args) throws Exception {
for (int t = 0; t < Integer.parseInt(args[0]); t++) {
new Thread(Test::loop).start();
}
Thread.sleep(30_000L);
}
static void loop() {
while (true) {
long start = System.currentTimeMillis();
for (int i = 0; i < 10_000_000; i++) {
if (i % 1_000 == 0) {
synchronized(Test.class) {
if (x++ % 123 == 456) System.out.println("!");
}
}
}
long dur = System.currentTimeMillis() - start;
System.out.println(dur + "ms");
}
}
}
Attaching async-profiler after startup shows the overhead. Using --lock 10000 does not show any overhead.
And you can compare a wall profile (doesn't show significant contention) to the results shown by a lock profile (suggests high contention) to show the inaccuracy I was talking about.
Thanks for reading! Let me know what you think, I'm happy to help with implementation (even just changing the default threshold) if you think any of the ideas are good.
The text was updated successfully, but these errors were encountered:
Interesting point. Lock profiling could work on an interval instead of a threshold. It doesn't really need the rand() style operation, though I think it works out the same mostly.
if (totalAcquireTime + latestAcquireTime) > interval
recordSample() // weight of the sample = interval, like cpu profiling or tlab allocs
totalAcquireTime = (totalAcquireTime + latestAcquireTime) % interval
It would still be separately valuable to reduce overhead and improve accuracy by reducing work done under lock, if at all possible.
I noticed in my application that
-e lock
causes noticeable overhead. I believe that's in part because the default threshold for recording the 'contention' is 0.It also attributed a large amount of time in an area I believe is mostly uncontended, and comparing with e.g.
-e wall
, the picture is totally different.I have a guess as to how the overhead leads to inaccurate results - the time taken to record the sample by the lock holder is not included in the acquire time reported by that thread, which is good, but thread 2 sees a longer acquire time because thread 1 is recording the sample while still holding the lock. So lock tracing actually increases contention, which is bad for performance but also messes with the result accuracy because it's not accounted for.
I suggest at least increasing the default threshold, in my test 10us is enough to remove the undue overhead while still catching the most interesting cases.
As far as I can tell, the overhead is from:
a) getting the thread id via gettid syscall
b) getting the stack trace via jvmti
I'm wondering if there's any way to move these outside the lock. I suppose you could do them pre-emptively in
ContendedMonitorEnter
before the lock is acquired, but that means you don't know if the threshold has been crossed yet. However, when the threshold is 0 that shouldn't be a problem so it might be beneficial. I found some discussion on the old MonitorContendedExit event which I guess would be even better it if was available, I don't know if there's any way for async-profiler to create an equivalent hook something like:As I mentioned, I saw this in a real application, but I also tried to create a simple repro:
Attaching async-profiler after startup shows the overhead. Using
--lock 10000
does not show any overhead.And you can compare a wall profile (doesn't show significant contention) to the results shown by a lock profile (suggests high contention) to show the inaccuracy I was talking about.
Thanks for reading! Let me know what you think, I'm happy to help with implementation (even just changing the default threshold) if you think any of the ideas are good.
The text was updated successfully, but these errors were encountered: