Stats: number of GC stuck at 2047 #1074

wks · 2024-01-19T12:54:44Z

When using harness_begin and harness_end to enable statistics collection, the number of GC will be stuck at 2047 if GC is triggered more than that number. It will print an error message in STDERR:

Warning: number of GC phases exceeds MAX_PHASES

The message is printed Stats::start_gc:

    pub fn start_gc(&self) {
        self.gc_count.fetch_add(1, Ordering::SeqCst);
        if !self.get_gathering_stats() {
            return;
        }
        if self.get_phase() < MAX_PHASES - 1 {
            let counters = self.counters.lock().unwrap();
            for counter in &(*counters) {
                counter.lock().unwrap().phase_change(self.get_phase());
            }
            self.shared.increment_phase();
        } else if !self.exceeded_phase_limit.load(Ordering::SeqCst) {
            eprintln!("Warning: number of GC phases exceeds MAX_PHASES");
            self.exceeded_phase_limit.store(true, Ordering::SeqCst);
        }
    }

And the number of GC printed in harness_end is actually self.get_phase() / 2, and it comes from the following function:

    pub fn print_stats<VM: VMBinding>(&self, mmtk: &'static MMTK<VM>) {
        println!(
            "============================ MMTk Statistics Totals ============================"
        );
        let scheduler_stat = mmtk.scheduler.statistics();
        self.print_column_names(&scheduler_stat);  // This prints "GC  time.other   time.stw  ....."
        print!("{}\t", self.get_phase() / 2); // This is the number under "GC"
        let counter = self.counters.lock().unwrap();
        // ...

Because the phase no longer increments when reached MAX_PHASES which is defined as 1 << 12 which is 4096, the number of GC is printed is 2047.

When running lusearch from DaCapo Chopin, if the plan is GenCopy or GenImmix and the heap size is set at 2.5x min heap size, the number of GC will exceed 2047. This may happen to other benchmarks, too. For example, if this happens when running xalan, the error message Warning: number of GC phases exceeds MAX_PHASES printed in STDERR will cause the benchmark validation to fail.

There is a comment near the MAX_PHASES constant:

// TODO: Increasing this number would cause JikesRVM die at boot time. I don't really know why.
// E.g. using 1 << 14 will cause JikesRVM segfault at boot time.
pub const MAX_PHASES: usize = 1 << 12;

In each EventCounter, there is an embedded array of MAX_PHASES elements:

pub struct EventCounter {
    name: String,
    pub implicitly_start: bool,
    merge_phases: bool,
    count: Box<[u64; MAX_PHASES]>, // This one
    current_count: u64,
    running: bool,
    stats: Arc<SharedStats>,
}

It looks like it intends to record the number for each individual STW and other phase. This works fine if the benchmark never triggers more than 2047 GCs. But actually it often does, especially with smaller heap and generational plan. After the number of phases exceeded the threshold, subsequent counter values (such as time.stw and time.other) are ignored. This explains some of the anomalies in #1062. From the un-normalized plots, we can see the number of GC for GenCopy and GenImmix for lusearch is stuck at 2047. This means the time.other and time.stw for subsequent GCs are ignored. This explains the hard-to-explain increment and decrement in mutator time when run on different machines.

What should we do?

We never look at the STW time and "other" time for individual phases. We use the statistics counters mainly for printing the "GC", "time.other", "time.stw" when calling harness_end. That only needs to record the accumulated time for all "STW" phases and all "other" phases. It is more practical to only keep two numbers for each counter: the aggregated value for all odd phases and the aggregated value for all even phases.

If we want to inspect each phase (e.g. find the number of work packets executed), we have a better tool now. We can use the eBPF tracing tool for that.

The text was updated successfully, but these errors were encountered:

qinsoon · 2024-01-22T02:15:45Z

Related issue: #493

wks mentioned this issue Jan 19, 2024

Remove NULL ObjectReference #1064

Merged

qinsoon added the P-normal Priority: Normal. label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats: number of GC stuck at 2047 #1074

Stats: number of GC stuck at 2047 #1074

wks commented Jan 19, 2024

qinsoon commented Jan 22, 2024

Stats: number of GC stuck at 2047 #1074

Stats: number of GC stuck at 2047 #1074

Comments

wks commented Jan 19, 2024

What should we do?

qinsoon commented Jan 22, 2024