You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using harness_begin and harness_end to enable statistics collection, the number of GC will be stuck at 2047 if GC is triggered more than that number. It will print an error message in STDERR:
Warning: number of GC phases exceeds MAX_PHASES
The message is printed Stats::start_gc:
pubfnstart_gc(&self){self.gc_count.fetch_add(1,Ordering::SeqCst);if !self.get_gathering_stats(){return;}ifself.get_phase() < MAX_PHASES - 1{let counters = self.counters.lock().unwrap();for counter in&(*counters){
counter.lock().unwrap().phase_change(self.get_phase());}self.shared.increment_phase();}elseif !self.exceeded_phase_limit.load(Ordering::SeqCst){eprintln!("Warning: number of GC phases exceeds MAX_PHASES");self.exceeded_phase_limit.store(true,Ordering::SeqCst);}}
And the number of GC printed in harness_end is actually self.get_phase() / 2, and it comes from the following function:
pub fn print_stats<VM: VMBinding>(&self, mmtk: &'static MMTK<VM>) {
println!(
"============================ MMTk Statistics Totals ============================"
);
let scheduler_stat = mmtk.scheduler.statistics();
self.print_column_names(&scheduler_stat); // This prints "GC time.other time.stw ....."
print!("{}\t", self.get_phase() / 2); // This is the number under "GC"
let counter = self.counters.lock().unwrap();
// ...
Because the phase no longer increments when reached MAX_PHASES which is defined as 1 << 12 which is 4096, the number of GC is printed is 2047.
When running lusearch from DaCapo Chopin, if the plan is GenCopy or GenImmix and the heap size is set at 2.5x min heap size, the number of GC will exceed 2047. This may happen to other benchmarks, too. For example, if this happens when running xalan, the error message Warning: number of GC phases exceeds MAX_PHASES printed in STDERR will cause the benchmark validation to fail.
There is a comment near the MAX_PHASES constant:
// TODO: Increasing this number would cause JikesRVM die at boot time. I don't really know why.
// E.g. using 1 << 14 will cause JikesRVM segfault at boot time.
pub const MAX_PHASES: usize = 1 << 12;
In each EventCounter, there is an embedded array of MAX_PHASES elements:
pubstructEventCounter{name:String,pubimplicitly_start:bool,merge_phases:bool,count:Box<[u64;MAX_PHASES]>,// This onecurrent_count:u64,running:bool,stats:Arc<SharedStats>,}
It looks like it intends to record the number for each individual STW and other phase. This works fine if the benchmark never triggers more than 2047 GCs. But actually it often does, especially with smaller heap and generational plan. After the number of phases exceeded the threshold, subsequent counter values (such as time.stw and time.other) are ignored. This explains some of the anomalies in #1062. From the un-normalized plots, we can see the number of GC for GenCopy and GenImmix for lusearch is stuck at 2047. This means the time.other and time.stw for subsequent GCs are ignored. This explains the hard-to-explain increment and decrement in mutator time when run on different machines.
What should we do?
We never look at the STW time and "other" time for individual phases. We use the statistics counters mainly for printing the "GC", "time.other", "time.stw" when calling harness_end. That only needs to record the accumulated time for all "STW" phases and all "other" phases. It is more practical to only keep two numbers for each counter: the aggregated value for all odd phases and the aggregated value for all even phases.
If we want to inspect each phase (e.g. find the number of work packets executed), we have a better tool now. We can use the eBPF tracing tool for that.
The text was updated successfully, but these errors were encountered:
When using
harness_begin
andharness_end
to enable statistics collection, the number of GC will be stuck at 2047 if GC is triggered more than that number. It will print an error message in STDERR:The message is printed
Stats::start_gc
:And the number of GC printed in
harness_end
is actuallyself.get_phase() / 2
, and it comes from the following function:Because the phase no longer increments when reached
MAX_PHASES
which is defined as1 << 12
which is 4096, the number of GC is printed is 2047.When running
lusearch
from DaCapo Chopin, if the plan isGenCopy
orGenImmix
and the heap size is set at 2.5x min heap size, the number of GC will exceed 2047. This may happen to other benchmarks, too. For example, if this happens when runningxalan
, the error messageWarning: number of GC phases exceeds MAX_PHASES
printed in STDERR will cause the benchmark validation to fail.There is a comment near the
MAX_PHASES
constant:In each
EventCounter
, there is an embedded array ofMAX_PHASES
elements:It looks like it intends to record the number for each individual STW and other phase. This works fine if the benchmark never triggers more than 2047 GCs. But actually it often does, especially with smaller heap and generational plan. After the number of phases exceeded the threshold, subsequent counter values (such as
time.stw
andtime.other
) are ignored. This explains some of the anomalies in #1062. From the un-normalized plots, we can see the number of GC for GenCopy and GenImmix forlusearch
is stuck at 2047. This means thetime.other
andtime.stw
for subsequent GCs are ignored. This explains the hard-to-explain increment and decrement in mutator time when run on different machines.What should we do?
We never look at the STW time and "other" time for individual phases. We use the statistics counters mainly for printing the "GC", "time.other", "time.stw" when calling
harness_end
. That only needs to record the accumulated time for all "STW" phases and all "other" phases. It is more practical to only keep two numbers for each counter: the aggregated value for all odd phases and the aggregated value for all even phases.If we want to inspect each phase (e.g. find the number of work packets executed), we have a better tool now. We can use the eBPF tracing tool for that.
The text was updated successfully, but these errors were encountered: