-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow work packets to report their sizes #895
base: master
Are you sure you want to change the base?
Conversation
Could you run some performance test? Note that the Perhaps try running DaCapo benchmarks with small heap sizes (or find benchmarks that execute a lot of small packets) and see whether there's any significant difference. |
The PR generally looks good for me. Just discuss with Zixian about the PR. I removed myself from the reviewer. |
/// This method will be called before the execution of every work packet, so it must be fast. | ||
/// If it is impractical to report the exact size, this method can return approximate size, or | ||
/// return `None`. | ||
fn debug_get_size(&self) -> Option<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note: this size may not always correlate with the time length of a packet's execution. This is the initial size of a work packet. Packets like ProcessEdgesWork
may generate new work, and do it in the same work packet. That is not reflected by this number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that ProcessEdgesWork might generate more packets, and store the packets in a worker's local queue that will be executed without pushing to the global queue. But they are still distinct packets.
Perhaps @wenyuzhao can clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one problem with the current tracepoints (which is not related to Kunshan's PR) is that packets pushed to the local queue, like object scanning in MMTk/OpenJDK, doesn't use the normal worker loop, so the work tracepoints might not fire (if you Zoom in on a traced obtained from OpenJDK, you can see gaps between ProcessEdges that are object scanning pakcets)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScanObjectsWork
generated from ProcessEdgesWork
is done in the same ProcessEdgesWork
by default. This is hidden from our profiling code (work packet stats, and bpf tracing point). But new ProcessEdgesWork
packets from ScanObjectsWork
are pushed to the bucket, and will be shown correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning to do an independent PR to probe the time of sub-tasks. Therefore, ProcessEdgesWork
will have two sub-tasks: processing edges and scanning objects. The time of each of them will be presented separately.
This is lusearch running on mole.moma with 2.5x min heap size using Immix plan, 20 invocations. Build1 is mmtk-core master; build2 is this PR; build3 is a slight modification (this commit) that adds the probe inside No probes are attached in this test. Plot: The plot shows that there is a small but visible slow-down in "time.stw" for build2. But the slowdown is smaller for build3. This indicates that virtual call may be a problem. To be fair, I added a derived value column "time.stw / GC" which is time.stw divided by the number of GCs, just in case some runs trigger more GC than others. Build1 and build3 are similar, but build2 is noticeably greater in this column. |
OK. This looks reasonable. Actually, we might as well just move the entire tracepoint inside |
One problem of moving the (Note that I made some changes to existing trace points (for example, $ readelf -n /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so | grep Name | sort | uniq -c
36 Name: alloc_slow_once_end
27 Name: alloc_slow_once_start
1 Name: bucket_opened
1 Name: collection_initialized
1 Name: gccontroller_run
1 Name: gc_end
1 Name: gc_start
1 Name: gcworker_run
1 Name: harness_begin
1 Name: harness_end
30 Name: process_edges_end
30 Name: process_edges_start
18 Name: scan_objects_edge_enqueuing
18 Name: scan_objects_end
18 Name: scan_objects_object_enqueuing
18 Name: scan_objects_start
360 Name: work_end
1 Name: work_poll
360 Name: work_start And I got the following error when I try to attach
|
One workaround is wrapping the #[inline(never)]
fn probe_work_start(typename: &str, size: usize) {
probe!(mmtk, work_start, typename.as_ptr(), typename.len(), size);
}
#[inline(never)]
fn probe_work_end(typename: &str, size: usize) {
probe!(mmtk, work_end, typename.as_ptr(), typename.len(), size);
} And call those functions instead of inserting let typename = self.get_type_name();
let size = self.debug_get_size().unwrap_or(0);
probe_work_start(typename, size); // Call function
// Do the actual work
self.do_work(worker, mmtk);
probe_work_end(typename, size); // Call function The obvious profit is that it will reduce the instances of those trace points to 1. $ readelf -n /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so | grep Name | sort | uniq -c
36 Name: alloc_slow_once_end
27 Name: alloc_slow_once_start
1 Name: bucket_opened
1 Name: collection_initialized
1 Name: gccontroller_run
1 Name: gc_end
1 Name: gc_start
1 Name: gcworker_run
1 Name: harness_begin
1 Name: harness_end
30 Name: process_edges_end
30 Name: process_edges_start
18 Name: scan_objects_edge_enqueuing
18 Name: scan_objects_end
18 Name: scan_objects_node_enqueuing
18 Name: scan_objects_start
1 Name: work_end
1 Name: work_poll
1 Name: work_start And obviously it is no longer zero-cost when the trace points are not attached. It will have the cost of one direct call which, I think, is reasonable for coarse-grained tasks like a work packet. |
Right. Making it out of line sounds fine, as long as we measure the performance. |
I repeated the experiment, and added "build4". Build4 moves The strange thing is, the experiment result contradicts with the previous one. Build1-3 didn't change (the binary are recompiled but the source code remain the same). The STW time of Build3 used to be as much as build1, but is now much costlier. Build4 has lower than build3, so I don't know whether it really slowed things down. I'll try running with a smaller heap with more iterations. |
Ran it again with 1.9x heap size and 100 invocations. Build1: master It looks like all of those approaches have observable cost. I think it's better to guard those probes with features. |
I am resetting this to a draft because experiments show that this has noticeable performance impact when not enabled. |
This PR adds a method in the
GCWork
trait to allow each work packet to report its size for debug purpose.The primary intention is for using with the EBPF-based visualisation tool. The work packet size will be reported at USDT trace points. With proper scripting, this information can be captured and displayed in a graphical timeline such as Perfetto UI. An example is given below:
There is an alternative approach for getting the packet size via USDT: Executing a dedicated USDT trace point during the execution of a work packet to report he size, and using scripts to match this trace point with the surrounding work packet. Comparing with this approach,
The advantages of this PR are:
The disadvantages of this PR are:
ProcessEdgesWork
contains aVec<Edge>
.ScanObjects
andProcessModBuf
contain aVec<ObjectReference>
. It is straightforward to the the size of those packets.WeakRefProcessing
andScanStackRoot
, those work packets can only report their sizes in the middle or at the end of their executions. Such work packets can only use the alternative method.ProcessEdgesWork
processes edges and then scans objects. To accurately visualise the execution time of such work packets, we should add finer-grained trace points to show the execution time and work item sizes of each sub-task.ScanObjects
has a fieldroot: bool
to show if this work packet contains roots. If we use the size to represent the number of objects, we have to use an alternative method to show theroot
field.