Skip to content

Latest commit

 

History

History
59 lines (43 loc) · 2.62 KB

vtune.md

File metadata and controls

59 lines (43 loc) · 2.62 KB

Using VTune with Sightglass

Sightglass is instrumented to record each benchmark phase (compilation, instantiation, execution) as a VTune task. This uses the ittapi crate as a part of the measure configuration, e.g., --measure vtune (see vtune.rs).

Run a benchmark in VTune

To run a benchmark using the VTune CLI, run:

$ vtune -collect hotspots \
    target/release/sightglass-cli benchmark \
    --engine engines/wasmtime/libengine.so \
    --measure vtune
    -- benchmarks/spidermonkey/benchmark.wasm

This will create a new directory (e.g., r000hs) containing the results. Note that the same results can be collected from within the VTune UI. For documentation on the available VTune configuration options, see the VTune User Guide.

Analyze the results

The vtune CLI application does have a way to display the results (e.g., vtune -report hotspots r000hs) but the UI has the visualization tools &emdash; timelines, tables, filters &emdash; one might expect for performance analysis. Import the results into the VTune UI by navigating to "Import Result" and entering the results path (e.g., .../r000hs); alternately, simply run the same analysis as above in the UI directly.

Summary

Notice that each phase is displayed on the timeline as a VTune task (see the thin blue bar above sightglass-cli):

Timeline

Also, work can be organized by task but note that (currently) the tasks only contain measurements for the Sightglass thread, not any additional threads spawned by the engine during compilation(i.e., "Outside any task"):

Functions

To see flame graphs and explore call stacks, re-collect results with "Collect stacks" enabled:

Flame graph

Helpful hints

  • Try restricting which CPU cores are used with a tool like taskset: e.g., taskset --cpu-list 0
  • Ensure the benchmark is not context-switched to another core using the Sightglass --pin flag
  • You may want to analyze a single iteration; use --processes 1 --iterations-per-process 1
  • When collecting call stacks, there is a balance between accuracy and slowing the workload down; on Linux, -knob enable-stack-collection=true -knob stack-size=2048 -knob stack-type=software_lbr seemed to work well but this document has more details