Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats: number of GC stuck at 2047 #1074

Open
wks opened this issue Jan 19, 2024 · 1 comment
Open

Stats: number of GC stuck at 2047 #1074

wks opened this issue Jan 19, 2024 · 1 comment
Labels
P-normal Priority: Normal.

Comments

@wks
Copy link
Collaborator

wks commented Jan 19, 2024

When using harness_begin and harness_end to enable statistics collection, the number of GC will be stuck at 2047 if GC is triggered more than that number. It will print an error message in STDERR:

Warning: number of GC phases exceeds MAX_PHASES

The message is printed Stats::start_gc:

    pub fn start_gc(&self) {
        self.gc_count.fetch_add(1, Ordering::SeqCst);
        if !self.get_gathering_stats() {
            return;
        }
        if self.get_phase() < MAX_PHASES - 1 {
            let counters = self.counters.lock().unwrap();
            for counter in &(*counters) {
                counter.lock().unwrap().phase_change(self.get_phase());
            }
            self.shared.increment_phase();
        } else if !self.exceeded_phase_limit.load(Ordering::SeqCst) {
            eprintln!("Warning: number of GC phases exceeds MAX_PHASES");
            self.exceeded_phase_limit.store(true, Ordering::SeqCst);
        }
    }

And the number of GC printed in harness_end is actually self.get_phase() / 2, and it comes from the following function:

    pub fn print_stats<VM: VMBinding>(&self, mmtk: &'static MMTK<VM>) {
        println!(
            "============================ MMTk Statistics Totals ============================"
        );
        let scheduler_stat = mmtk.scheduler.statistics();
        self.print_column_names(&scheduler_stat);  // This prints "GC  time.other   time.stw  ....."
        print!("{}\t", self.get_phase() / 2); // This is the number under "GC"
        let counter = self.counters.lock().unwrap();
        // ...

Because the phase no longer increments when reached MAX_PHASES which is defined as 1 << 12 which is 4096, the number of GC is printed is 2047.

When running lusearch from DaCapo Chopin, if the plan is GenCopy or GenImmix and the heap size is set at 2.5x min heap size, the number of GC will exceed 2047. This may happen to other benchmarks, too. For example, if this happens when running xalan, the error message Warning: number of GC phases exceeds MAX_PHASES printed in STDERR will cause the benchmark validation to fail.

There is a comment near the MAX_PHASES constant:

// TODO: Increasing this number would cause JikesRVM die at boot time. I don't really know why.
// E.g. using 1 << 14 will cause JikesRVM segfault at boot time.
pub const MAX_PHASES: usize = 1 << 12;

In each EventCounter, there is an embedded array of MAX_PHASES elements:

pub struct EventCounter {
    name: String,
    pub implicitly_start: bool,
    merge_phases: bool,
    count: Box<[u64; MAX_PHASES]>, // This one
    current_count: u64,
    running: bool,
    stats: Arc<SharedStats>,
}

It looks like it intends to record the number for each individual STW and other phase. This works fine if the benchmark never triggers more than 2047 GCs. But actually it often does, especially with smaller heap and generational plan. After the number of phases exceeded the threshold, subsequent counter values (such as time.stw and time.other) are ignored. This explains some of the anomalies in #1062. From the un-normalized plots, we can see the number of GC for GenCopy and GenImmix for lusearch is stuck at 2047. This means the time.other and time.stw for subsequent GCs are ignored. This explains the hard-to-explain increment and decrement in mutator time when run on different machines.

What should we do?

We never look at the STW time and "other" time for individual phases. We use the statistics counters mainly for printing the "GC", "time.other", "time.stw" when calling harness_end. That only needs to record the accumulated time for all "STW" phases and all "other" phases. It is more practical to only keep two numbers for each counter: the aggregated value for all odd phases and the aggregated value for all even phases.

If we want to inspect each phase (e.g. find the number of work packets executed), we have a better tool now. We can use the eBPF tracing tool for that.

@qinsoon
Copy link
Member

qinsoon commented Jan 22, 2024

Related issue: #493

@qinsoon qinsoon added the P-normal Priority: Normal. label Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-normal Priority: Normal.
Projects
None yet
Development

No branches or pull requests

2 participants