Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock monitoring overhead/accuracy #805

Open
olivergillespie opened this issue Aug 22, 2023 · 2 comments
Open

Lock monitoring overhead/accuracy #805

olivergillespie opened this issue Aug 22, 2023 · 2 comments

Comments

@olivergillespie
Copy link

I noticed in my application that -e lock causes noticeable overhead. I believe that's in part because the default threshold for recording the 'contention' is 0.

It also attributed a large amount of time in an area I believe is mostly uncontended, and comparing with e.g. -e wall, the picture is totally different.

I have a guess as to how the overhead leads to inaccurate results - the time taken to record the sample by the lock holder is not included in the acquire time reported by that thread, which is good, but thread 2 sees a longer acquire time because thread 1 is recording the sample while still holding the lock. So lock tracing actually increases contention, which is bad for performance but also messes with the result accuracy because it's not accounted for.

I suggest at least increasing the default threshold, in my test 10us is enough to remove the undue overhead while still catching the most interesting cases.

As far as I can tell, the overhead is from:

a) getting the thread id via gettid syscall
b) getting the stack trace via jvmti

I'm wondering if there's any way to move these outside the lock. I suppose you could do them pre-emptively in ContendedMonitorEnter before the lock is acquired, but that means you don't know if the threshold has been crossed yet. However, when the threshold is 0 that shouldn't be a problem so it might be beneficial. I found some discussion on the old MonitorContendedExit event which I guess would be even better it if was available, I don't know if there's any way for async-profiler to create an equivalent hook something like:

monitorExitHook() {
  if (tag.acquireTime > threshold) recordSample() // expensive work done outside lock
}

As I mentioned, I saw this in a real application, but I also tried to create a simple repro:

public class Test {
    static int x = 0;
    public static void main(String[] args) throws Exception {
        for (int t = 0; t < Integer.parseInt(args[0]); t++) {
            new Thread(Test::loop).start();
        }
        Thread.sleep(30_000L);
    }

    static void loop() {
        while (true) {
            long start = System.currentTimeMillis();
            for (int i = 0; i < 10_000_000; i++) {
                if (i % 1_000 == 0) {
                    synchronized(Test.class) {
                        if (x++ % 123 == 456) System.out.println("!");
                    }
                }
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println(dur + "ms");
        }
    }
}

Attaching async-profiler after startup shows the overhead. Using --lock 10000 does not show any overhead.

java17 Test.java 5
13ms
14ms
13ms
14ms
14ms
14ms
Profiling started
14ms
21ms
40ms
44ms
47ms
68ms
81ms
88ms
81ms
86ms

And you can compare a wall profile (doesn't show significant contention) to the results shown by a lock profile (suggests high contention) to show the inaccuracy I was talking about.

Thanks for reading! Let me know what you think, I'm happy to help with implementation (even just changing the default threshold) if you think any of the ideas are good.

@zdyj3170101136
Copy link

i think the default threshold is not easy to decide.
the best is a random lock tracer: #775

@olivergillespie
Copy link
Author

Interesting point. Lock profiling could work on an interval instead of a threshold. It doesn't really need the rand() style operation, though I think it works out the same mostly.

if (totalAcquireTime + latestAcquireTime) > interval
   recordSample() // weight of the sample = interval, like cpu profiling or tlab allocs
   totalAcquireTime = (totalAcquireTime + latestAcquireTime) % interval

It would still be separately valuable to reduce overhead and improve accuracy by reducing work done under lock, if at all possible.

apangin added a commit that referenced this issue Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants