Allow work packets to report their sizes #895

wks · 2023-08-09T07:33:02Z

This PR adds a method in the GCWork trait to allow each work packet to report its size for debug purpose.

The primary intention is for using with the EBPF-based visualisation tool. The work packet size will be reported at USDT trace points. With proper scripting, this information can be captured and displayed in a graphical timeline such as Perfetto UI. An example is given below:

There is an alternative approach for getting the packet size via USDT: Executing a dedicated USDT trace point during the execution of a work packet to report he size, and using scripts to match this trace point with the surrounding work packet. Comparing with this approach,

The advantages of this PR are:

Easy to use. The work packet implementer simply overrides a method, and it will be properly displayed uniformly by existing tools.

The disadvantages of this PR are:

The size must be available in the content of the work packet.
- ProcessEdgesWork contains a Vec<Edge>. ScanObjects and ProcessModBuf contain a Vec<ObjectReference>. It is straightforward to the the size of those packets.
- However, for other work packets, such as WeakRefProcessing and ScanStackRoot, those work packets can only report their sizes in the middle or at the end of their executions. Such work packets can only use the alternative method.
- Some work packets contain multiple sub-tasks. For example, ProcessEdgesWork processes edges and then scans objects. To accurately visualise the execution time of such work packets, we should add finer-grained trace points to show the execution time and work item sizes of each sub-task.
It can only report a numerical size, but not other properties. Some work packets have other interesting properties that should be shown in the visualisation. For example, ScanObjects has a field root: bool to show if this work packet contains roots. If we use the size to represent the number of objects, we have to use an alternative method to show the root field.

caizixian · 2023-08-10T03:12:15Z

Could you run some performance test? Note that the probe! arguments are always evaluated even if it's not attached. This PR introduces an extra virtual call to get the packet size on the packet execution loop. So let's confirm its overhead.

Perhaps try running DaCapo benchmarks with small heap sizes (or find benchmarks that execute a lot of small packets) and see whether there's any significant difference.

qinsoon · 2023-08-10T05:54:42Z

The PR generally looks good for me. Just discuss with Zixian about the PR. I removed myself from the reviewer.

qinsoon · 2023-08-10T06:00:53Z

src/scheduler/work.rs

+    /// This method will be called before the execution of every work packet, so it must be fast.
+    /// If it is impractical to report the exact size, this method can return approximate size, or
+    /// return `None`.
+    fn debug_get_size(&self) -> Option<usize> {


One thing to note: this size may not always correlate with the time length of a packet's execution. This is the initial size of a work packet. Packets like ProcessEdgesWork may generate new work, and do it in the same work packet. That is not reflected by this number.

My understanding is that ProcessEdgesWork might generate more packets, and store the packets in a worker's local queue that will be executed without pushing to the global queue. But they are still distinct packets.

Perhaps @wenyuzhao can clarify.

I think one problem with the current tracepoints (which is not related to Kunshan's PR) is that packets pushed to the local queue, like object scanning in MMTk/OpenJDK, doesn't use the normal worker loop, so the work tracepoints might not fire (if you Zoom in on a traced obtained from OpenJDK, you can see gaps between ProcessEdges that are object scanning pakcets)

ScanObjectsWork generated from ProcessEdgesWork is done in the same ProcessEdgesWork by default. This is hidden from our profiling code (work packet stats, and bpf tracing point). But new ProcessEdgesWork packets from ScanObjectsWork are pushed to the bucket, and will be shown correctly.

I am planning to do an independent PR to probe the time of sub-tasks. Therefore, ProcessEdgesWork will have two sub-tasks: processing edges and scanning objects. The time of each of them will be presented separately.

wks · 2023-08-10T08:18:40Z

Could you run some performance test? Note that the probe! arguments are always evaluated even if it's not attached. This PR introduces an extra virtual call to get the packet size on the packet execution loop. So let's confirm its overhead.

Perhaps try running DaCapo benchmarks with small heap sizes (or find benchmarks that execute a lot of small packets) and see whether there's any significant difference.

This is lusearch running on mole.moma with 2.5x min heap size using Immix plan, 20 invocations.

Build1 is mmtk-core master; build2 is this PR; build3 is a slight modification (this commit) that adds the probe inside work.do_work_with_stat instead of GCWorker::run. Since work.do_work_with_stat is already one virtual call, calling self.debug_get_size() inside it will be a direct call.

No probes are attached in this test.

Plot:

Plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2023-08-10-Thu-065716&benchmark^build^invocation^iteration&GC^time^time.stw&time.stw%20/%20GC|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;build1|40&Histogram%20(with%20CI)^build^benchmark&

The plot shows that there is a small but visible slow-down in "time.stw" for build2. But the slowdown is smaller for build3. This indicates that virtual call may be a problem.

To be fair, I added a derived value column "time.stw / GC" which is time.stw divided by the number of GCs, just in case some runs trigger more GC than others. Build1 and build3 are similar, but build2 is noticeably greater in this column.

caizixian · 2023-08-10T08:22:55Z

OK. This looks reasonable. Actually, we might as well just move the entire tracepoint inside do_work_with_stat, which saves virtual call, and makes the code cleaner (conceptually the stats code and the tracepoint are both to instrument the work).

wks · 2023-08-10T10:01:54Z

One problem of moving the probe!(mmtk, work) into the function is that every time the function is specialised, one USDT trace point is generated. There is a limit of how many trace points we can attach by default.

(Note that I made some changes to existing trace points (for example, work -> work_start and work_end) so you may see some trace points you don't recognise)

$ readelf -n /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so | grep Name | sort | uniq -c
     36     Name: alloc_slow_once_end
     27     Name: alloc_slow_once_start
      1     Name: bucket_opened
      1     Name: collection_initialized
      1     Name: gccontroller_run
      1     Name: gc_end
      1     Name: gc_start
      1     Name: gcworker_run
      1     Name: harness_begin
      1     Name: harness_end
     30     Name: process_edges_end
     30     Name: process_edges_start
     18     Name: scan_objects_edge_enqueuing
     18     Name: scan_objects_end
     18     Name: scan_objects_object_enqueuing
     18     Name: scan_objects_start
    360     Name: work_end
      1     Name: work_poll
    360     Name: work_start

And I got the following error when I try to attach work_start and work_end:

ERROR: Can't attach to 786 probes because it exceeds the current limit of 512 probes.
You can increase the limit through the BPFTRACE_MAX_PROBES environment variable, but BE CAREFUL since a high number of probes attached can cause your system to crash.

wks · 2023-08-10T11:04:46Z

One workaround is wrapping the probe! trace points inside global functions, and using #[inline(never)] to prevent them from being inlined.

#[inline(never)]
fn probe_work_start(typename: &str, size: usize) {
    probe!(mmtk, work_start, typename.as_ptr(), typename.len(), size);
}

#[inline(never)]
fn probe_work_end(typename: &str, size: usize) {
    probe!(mmtk, work_end, typename.as_ptr(), typename.len(), size);
}

And call those functions instead of inserting probe!() instances:

        let typename = self.get_type_name();
        let size = self.debug_get_size().unwrap_or(0);
        probe_work_start(typename, size);  // Call function

        // Do the actual work
        self.do_work(worker, mmtk);

        probe_work_end(typename, size);  // Call function

The obvious profit is that it will reduce the instances of those trace points to 1.

$ readelf -n /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so | grep Name | sort | uniq -c
     36     Name: alloc_slow_once_end
     27     Name: alloc_slow_once_start
      1     Name: bucket_opened
      1     Name: collection_initialized
      1     Name: gccontroller_run
      1     Name: gc_end
      1     Name: gc_start
      1     Name: gcworker_run
      1     Name: harness_begin
      1     Name: harness_end
     30     Name: process_edges_end
     30     Name: process_edges_start
     18     Name: scan_objects_edge_enqueuing
     18     Name: scan_objects_end
     18     Name: scan_objects_node_enqueuing
     18     Name: scan_objects_start
      1     Name: work_end
      1     Name: work_poll
      1     Name: work_start

And obviously it is no longer zero-cost when the trace points are not attached. It will have the cost of one direct call which, I think, is reasonable for coarse-grained tasks like a work packet. self.get_type_name(); and self.debug_get_size().unwrap_or(0); already introduced costs when the trace point is not attached.

caizixian · 2023-08-10T12:21:55Z

One problem of moving the probe!(mmtk, work) into the function is that every time the function is specialised, one USDT trace point is generated. There is a limit of how many trace points we can attach by default.

Right. Making it out of line sounds fine, as long as we measure the performance.

wks · 2023-08-11T08:25:07Z

I repeated the experiment, and added "build4". Build4 moves work_start and work_end into do_work_with_stat, but wrap them with direct function calls.

The strange thing is, the experiment result contradicts with the previous one. Build1-3 didn't change (the binary are recompiled but the source code remain the same). The STW time of Build3 used to be as much as build1, but is now much costlier. Build4 has lower than build3, so I don't know whether it really slowed things down.

I'll try running with a smaller heap with more iterations.

wks · 2023-08-14T02:41:12Z

Ran it again with 1.9x heap size and 100 invocations.

Build1: master
Build2: virtual call to work.debug_get_size() in GCWorker::run
Build3: direct call to self.debug_get_size() in GCWork::do_work_with_stat, using inlined probe!()
Build4: direct call to self.debug_get_size() in GCWork::do_work_with_stat, using two outlined probe!() in two directly-called functions.

It looks like all of those approaches have observable cost. I think it's better to guard those probes with features.

wks · 2023-08-22T08:55:36Z

I am resetting this to a draft because experiments show that this has noticeable performance impact when not enabled.

wks added 2 commits August 8, 2023 14:32

WIP: Work packet size

8585a44

More packets and add comments

6969140

wks requested review from caizixian and qinsoon August 9, 2023 07:33

wks changed the title ~~Feature/packet size~~ Allow work packets to report their sizes Aug 9, 2023

Merge branch 'master' into feature/packet-size

9f3e342

qinsoon removed their request for review August 10, 2023 05:54

qinsoon reviewed Aug 10, 2023

View reviewed changes

wks marked this pull request as draft August 22, 2023 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow work packets to report their sizes #895

Allow work packets to report their sizes #895

wks commented Aug 9, 2023

caizixian commented Aug 10, 2023

qinsoon commented Aug 10, 2023

qinsoon Aug 10, 2023

caizixian Aug 10, 2023

caizixian Aug 10, 2023

qinsoon Aug 10, 2023

wks Aug 10, 2023

wks commented Aug 10, 2023

caizixian commented Aug 10, 2023

wks commented Aug 10, 2023 •

edited

wks commented Aug 10, 2023

caizixian commented Aug 10, 2023

wks commented Aug 11, 2023

wks commented Aug 14, 2023

wks commented Aug 22, 2023

Allow work packets to report their sizes #895

Are you sure you want to change the base?

Allow work packets to report their sizes #895

Conversation

wks commented Aug 9, 2023

caizixian commented Aug 10, 2023

qinsoon commented Aug 10, 2023

qinsoon Aug 10, 2023

Choose a reason for hiding this comment

caizixian Aug 10, 2023

Choose a reason for hiding this comment

caizixian Aug 10, 2023

Choose a reason for hiding this comment

qinsoon Aug 10, 2023

Choose a reason for hiding this comment

wks Aug 10, 2023

Choose a reason for hiding this comment

wks commented Aug 10, 2023

caizixian commented Aug 10, 2023

wks commented Aug 10, 2023 • edited

wks commented Aug 10, 2023

caizixian commented Aug 10, 2023

wks commented Aug 11, 2023

wks commented Aug 14, 2023

wks commented Aug 22, 2023

wks commented Aug 10, 2023 •

edited