Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow work packets to report their sizes #895

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wks
Copy link
Collaborator

@wks wks commented Aug 9, 2023

This PR adds a method in the GCWork trait to allow each work packet to report its size for debug purpose.

The primary intention is for using with the EBPF-based visualisation tool. The work packet size will be reported at USDT trace points. With proper scripting, this information can be captured and displayed in a graphical timeline such as Perfetto UI. An example is given below:

image

There is an alternative approach for getting the packet size via USDT: Executing a dedicated USDT trace point during the execution of a work packet to report he size, and using scripts to match this trace point with the surrounding work packet. Comparing with this approach,

The advantages of this PR are:

  • Easy to use. The work packet implementer simply overrides a method, and it will be properly displayed uniformly by existing tools.

The disadvantages of this PR are:

  • The size must be available in the content of the work packet.
    • ProcessEdgesWork contains a Vec<Edge>. ScanObjects and ProcessModBuf contain a Vec<ObjectReference>. It is straightforward to the the size of those packets.
    • However, for other work packets, such as WeakRefProcessing and ScanStackRoot, those work packets can only report their sizes in the middle or at the end of their executions. Such work packets can only use the alternative method.
    • Some work packets contain multiple sub-tasks. For example, ProcessEdgesWork processes edges and then scans objects. To accurately visualise the execution time of such work packets, we should add finer-grained trace points to show the execution time and work item sizes of each sub-task.
  • It can only report a numerical size, but not other properties. Some work packets have other interesting properties that should be shown in the visualisation. For example, ScanObjects has a field root: bool to show if this work packet contains roots. If we use the size to represent the number of objects, we have to use an alternative method to show the root field.

@wks wks requested review from caizixian and qinsoon August 9, 2023 07:33
@wks wks changed the title Feature/packet size Allow work packets to report their sizes Aug 9, 2023
@caizixian
Copy link
Member

Could you run some performance test? Note that the probe! arguments are always evaluated even if it's not attached. This PR introduces an extra virtual call to get the packet size on the packet execution loop. So let's confirm its overhead.

Perhaps try running DaCapo benchmarks with small heap sizes (or find benchmarks that execute a lot of small packets) and see whether there's any significant difference.

@qinsoon qinsoon removed their request for review August 10, 2023 05:54
@qinsoon
Copy link
Member

qinsoon commented Aug 10, 2023

The PR generally looks good for me. Just discuss with Zixian about the PR. I removed myself from the reviewer.

/// This method will be called before the execution of every work packet, so it must be fast.
/// If it is impractical to report the exact size, this method can return approximate size, or
/// return `None`.
fn debug_get_size(&self) -> Option<usize> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note: this size may not always correlate with the time length of a packet's execution. This is the initial size of a work packet. Packets like ProcessEdgesWork may generate new work, and do it in the same work packet. That is not reflected by this number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that ProcessEdgesWork might generate more packets, and store the packets in a worker's local queue that will be executed without pushing to the global queue. But they are still distinct packets.

Perhaps @wenyuzhao can clarify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one problem with the current tracepoints (which is not related to Kunshan's PR) is that packets pushed to the local queue, like object scanning in MMTk/OpenJDK, doesn't use the normal worker loop, so the work tracepoints might not fire (if you Zoom in on a traced obtained from OpenJDK, you can see gaps between ProcessEdges that are object scanning pakcets)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScanObjectsWork generated from ProcessEdgesWork is done in the same ProcessEdgesWork by default. This is hidden from our profiling code (work packet stats, and bpf tracing point). But new ProcessEdgesWork packets from ScanObjectsWork are pushed to the bucket, and will be shown correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to do an independent PR to probe the time of sub-tasks. Therefore, ProcessEdgesWork will have two sub-tasks: processing edges and scanning objects. The time of each of them will be presented separately.

@wks
Copy link
Collaborator Author

wks commented Aug 10, 2023

Could you run some performance test? Note that the probe! arguments are always evaluated even if it's not attached. This PR introduces an extra virtual call to get the packet size on the packet execution loop. So let's confirm its overhead.

Perhaps try running DaCapo benchmarks with small heap sizes (or find benchmarks that execute a lot of small packets) and see whether there's any significant difference.

This is lusearch running on mole.moma with 2.5x min heap size using Immix plan, 20 invocations.

Build1 is mmtk-core master; build2 is this PR; build3 is a slight modification (this commit) that adds the probe inside work.do_work_with_stat instead of GCWorker::run. Since work.do_work_with_stat is already one virtual call, calling self.debug_get_size() inside it will be a direct call.

No probes are attached in this test.

Plot:

image

Plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2023-08-10-Thu-065716&benchmark^build^invocation^iteration&GC^time^time.stw&time.stw%20/%20GC|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;build1|40&Histogram%20(with%20CI)^build^benchmark&

The plot shows that there is a small but visible slow-down in "time.stw" for build2. But the slowdown is smaller for build3. This indicates that virtual call may be a problem.

To be fair, I added a derived value column "time.stw / GC" which is time.stw divided by the number of GCs, just in case some runs trigger more GC than others. Build1 and build3 are similar, but build2 is noticeably greater in this column.

@caizixian
Copy link
Member

OK. This looks reasonable. Actually, we might as well just move the entire tracepoint inside do_work_with_stat, which saves virtual call, and makes the code cleaner (conceptually the stats code and the tracepoint are both to instrument the work).

@wks
Copy link
Collaborator Author

wks commented Aug 10, 2023

One problem of moving the probe!(mmtk, work) into the function is that every time the function is specialised, one USDT trace point is generated. There is a limit of how many trace points we can attach by default.

(Note that I made some changes to existing trace points (for example, work -> work_start and work_end) so you may see some trace points you don't recognise)

$ readelf -n /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so | grep Name | sort | uniq -c
     36     Name: alloc_slow_once_end
     27     Name: alloc_slow_once_start
      1     Name: bucket_opened
      1     Name: collection_initialized
      1     Name: gccontroller_run
      1     Name: gc_end
      1     Name: gc_start
      1     Name: gcworker_run
      1     Name: harness_begin
      1     Name: harness_end
     30     Name: process_edges_end
     30     Name: process_edges_start
     18     Name: scan_objects_edge_enqueuing
     18     Name: scan_objects_end
     18     Name: scan_objects_object_enqueuing
     18     Name: scan_objects_start
    360     Name: work_end
      1     Name: work_poll
    360     Name: work_start

And I got the following error when I try to attach work_start and work_end:

ERROR: Can't attach to 786 probes because it exceeds the current limit of 512 probes.
You can increase the limit through the BPFTRACE_MAX_PROBES environment variable, but BE CAREFUL since a high number of probes attached can cause your system to crash.

@wks
Copy link
Collaborator Author

wks commented Aug 10, 2023

One workaround is wrapping the probe! trace points inside global functions, and using #[inline(never)] to prevent them from being inlined.

#[inline(never)]
fn probe_work_start(typename: &str, size: usize) {
    probe!(mmtk, work_start, typename.as_ptr(), typename.len(), size);
}

#[inline(never)]
fn probe_work_end(typename: &str, size: usize) {
    probe!(mmtk, work_end, typename.as_ptr(), typename.len(), size);
}

And call those functions instead of inserting probe!() instances:

        let typename = self.get_type_name();
        let size = self.debug_get_size().unwrap_or(0);
        probe_work_start(typename, size);  // Call function

        // Do the actual work
        self.do_work(worker, mmtk);

        probe_work_end(typename, size);  // Call function

The obvious profit is that it will reduce the instances of those trace points to 1.

$ readelf -n /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so | grep Name | sort | uniq -c
     36     Name: alloc_slow_once_end
     27     Name: alloc_slow_once_start
      1     Name: bucket_opened
      1     Name: collection_initialized
      1     Name: gccontroller_run
      1     Name: gc_end
      1     Name: gc_start
      1     Name: gcworker_run
      1     Name: harness_begin
      1     Name: harness_end
     30     Name: process_edges_end
     30     Name: process_edges_start
     18     Name: scan_objects_edge_enqueuing
     18     Name: scan_objects_end
     18     Name: scan_objects_node_enqueuing
     18     Name: scan_objects_start
      1     Name: work_end
      1     Name: work_poll
      1     Name: work_start

And obviously it is no longer zero-cost when the trace points are not attached. It will have the cost of one direct call which, I think, is reasonable for coarse-grained tasks like a work packet. self.get_type_name(); and self.debug_get_size().unwrap_or(0); already introduced costs when the trace point is not attached.

@caizixian
Copy link
Member

One problem of moving the probe!(mmtk, work) into the function is that every time the function is specialised, one USDT trace point is generated. There is a limit of how many trace points we can attach by default.

Right. Making it out of line sounds fine, as long as we measure the performance.

@wks
Copy link
Collaborator Author

wks commented Aug 11, 2023

I repeated the experiment, and added "build4". Build4 moves work_start and work_end into do_work_with_stat, but wrap them with direct function calls.

The strange thing is, the experiment result contradicts with the previous one. Build1-3 didn't change (the binary are recompiled but the source code remain the same). The STW time of Build3 used to be as much as build1, but is now much costlier. Build4 has lower than build3, so I don't know whether it really slowed things down.

image

I'll try running with a smaller heap with more iterations.

@wks
Copy link
Collaborator Author

wks commented Aug 14, 2023

Ran it again with 1.9x heap size and 100 invocations.

Build1: master
Build2: virtual call to work.debug_get_size() in GCWorker::run
Build3: direct call to self.debug_get_size() in GCWork::do_work_with_stat, using inlined probe!()
Build4: direct call to self.debug_get_size() in GCWork::do_work_with_stat, using two outlined probe!() in two directly-called functions.

image

It looks like all of those approaches have observable cost. I think it's better to guard those probes with features.

@wks
Copy link
Collaborator Author

wks commented Aug 22, 2023

I am resetting this to a draft because experiments show that this has noticeable performance impact when not enabled.

@wks wks marked this pull request as draft August 22, 2023 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants