



2nd CERN Advanced  
Performance Tuning workshop

# Top Down Analysis Never lost with Xeon® perf. counters

Ahmad Yasin

Intel Core™ Monitoring & Analysis

# Motivation



DriversEdGuru.com



# Motivation



# Motivation



# Motivation

General Exploration - General Exploration ?

Intel VTune Amplifier XE 2013

Analysis Target Analysis Type Collection Log Summary Bottom-up Top-down Tree Tasks

Grouping: Package / H/W Context / Function / Call Stack

| Package / H/W Context / Function / Call Stack | Hardware... | Hardware...   | CPI Rate | Filled Pipeline Slots  |                          | Unfilled Pipeline Slots (...  |                                |
|-----------------------------------------------|-------------|---------------|----------|------------------------|--------------------------|-------------------------------|--------------------------------|
|                                               | CPU. THR.   | INST_R... ANY |          | Retired Pipeline Slots | Cancelled Pipeline Slots | Back-end Bound Pipeline Slots | Front-end Bound Pipeline Slots |
| package_0                                     | 100.0%      | 737,17...     | 0.561    | 0.462                  | 0.055                    | 0.391                         | 0.108                          |
| cpu_0                                         | 88.1%       | 667,93...     | 0.546    | 0.477                  | 0.047                    | 0.394                         | 0.095                          |
| __PGOSF1__ZN3pov31All_CSG_Intersec            | 7.5%        | 61,492,...    | 0.503    | 0.537                  | 0.015                    | 0.433                         | 0.056                          |
| Intersect_Plane                               | 7.2%        | 58,772,...    | 0.505    | 0.497                  | 0.000                    | 0.489                         | 0.024                          |
| pov::Check_And_Enqueue                        | 6.8%        | 48,116,...    | 0.584    | 0.488                  | 0.119                    | 0.318                         | 0.113                          |
| Intersect_Sphere                              | 5.0%        | 57,404,...    | 0.361    | 0.577                  | 0.042                    | 0.355                         | 0.026                          |
| pov::DNoise                                   | 3.6%        | 26,136,...    | 0.576    | 0.459                  | 0.000                    | 0.482                         | 0.062                          |
| VDot                                          | 3.6%        | 21,586,...    | 0.682    | 0.361                  | 0.153                    | 0.469                         | 0.021                          |
| Inside_Object                                 | 3.4%        | 30,742,...    | 0.451    | 0.569                  | 0.000                    | 0.540                         | 0.033                          |

# Preface

- Performance Optimization Is Difficult
  - Complicated micro-architectures
  - Application/workload diversity
  - Unmanageable data
  - Tougher constraints
    - Time, Resources, Priorities
- Top Down Analysis Method
  - Identify the true bottleneck in a structured hierarchical process
  - Analysis is made easier for non-expert users
    - Simplified hierarchy avoids the u-arch high-learning curve



# Agenda

## ✓Motivation

- Top Level Heuristics
- Top Down hierarchy
  - Results
  - Memory breakdown
  - Frontend breakdown
- Example
  - Many use-cases
- Summary



# Performance Analysis

- Process
  - System Level
    - Memory setup
  - Application Level
    - Algorithm
  - Architectural & micro-architectural Levels
    - Vector code, Cache misses
- Assumptions/Caveats
  - CPU Bound (IA)
  - Predefined analysis goal
  - Goal: detect bottleneck
    - Not-a-goal: quantify speedup
  - Forward compatibility



# Intel Core™ μarch



Top Level  
counters are  
located here

Front end  
of processor pipeline

Back end  
of processor pipeline

Where To Start In This Complex Microarchitecture?

# Top Level Breakdown - the idea



# The Top Down Hierarchy

CPU Bound  $\Rightarrow$  Analyze

A user-defined criteria for analyzing a hotspot

Frontend Bound

Bad Speculation

Retiring

Backend Bound

Frontend Latency

Bandwidth

Branch Misspredicts

Machine Clears

BASE

Micro Sequencer

Core Bound

Memory Bound

iTLB Miss  
iCache Miss  
Branch Resteers  
DSB switches  
LCP  
MITE  
DSB

Divider

Ports Utilization

Stores Bound

L1 Bound

L2 Bound

L3 Bound

Ext. Memory Bound

Systematically Find True Bottleneck with Less Guess Work

# Top Level Breakdown

| Cycle           | 1 | 2 | 3 | 4 | 5 |
|-----------------|---|---|---|---|---|
| Back End Stall  | 0 | 0 | 1 | 0 | 0 |
| Alloc Slot 0    | - | v | - | v | v |
| Alloc Slot 1    | - | v | - | v | v |
| Alloc Slot 2    | - | - | - | v | v |
| Alloc Slot 3    | - | - | - | v | - |
| Frontend Bound  | 4 | 2 |   | 0 | 1 |
| Backend Bound   |   |   | 4 | 0 | 0 |
| Retiring        |   | 2 |   | 1 | 2 |
| Bad Speculation |   |   |   | 3 | 1 |



Classify Each Pipeline Slot Into 1 of 4 Categories

# Top Level Equations

- **Front End Bound**
  - The front end is delivering < 4 uops per cycle while the back end of the pipeline is ready to accept uops
    - $\text{IDQ\_UOPS\_NOT\_DELIVERED.CORE} / (4 * \text{Clockticks})$
- **Bad Speculation**
  - Tracks uops that never retire or allocation slots wasted due to recovery from branch miss-prediction or clears
    - $(\text{UOPS\_ISSUED.ANY} - \text{UOPS\_RETIRED.RETIRE\_SLOTS} + 4 * \text{INT\_MISC.RECOVERY\_CYCLES}) / (4 * \text{Clockticks})$
- **Retiring**
  - Successfully delivered uops who eventually do retire
    - $\text{UOPS\_RETIRED.RETIRE\_SLOTS} / (4 * \text{Clockticks})$
- **Back End Bound**
  - No uops are delivered due to lack of required resources at the back end of the pipeline
    - $1 - (\text{FrontEnd Bound} + \text{Bad Speculation} + \text{Retiring})$

Just 5 Events Provide Much Invaluable Insights

# Top Level for SPEC CPU2006



Top Down Correctly Characterizes All Workloads

SPEC rate 1-copy, Intel Complier 13, IvyBridge @ 3 GHz



# VTune “new General Exploration” interface

The screenshot shows the VTune interface with several windows open:

- Front-end Bound**: A table showing metrics like Back-end Bound, Front-End Latency, and Front-End Bandwidth. A tooltip for "Front-End Latency" is displayed, stating: "This metric represents a fraction of slots during which CPU was stalled due to front-end latency issues, such as instruction-cache misses, ITLB misses or fetch stalls after a branch misprediction. In such cases, the front-end delivers no uOps." Formula:  $(\text{IDQ\_UOPS\_NOT\_DELIVERED.CYCLES\_0\_UOPS\_DELIV.CORE}) / \text{CPU\_CLK\_UNHALTED.THREAD}$ .
- Function / Call Stack**: A table showing performance metrics for various functions. A red box highlights the "Bad Speculation" column, which includes sub-metrics: Branch Mispredict and Machine Clears. A red arrow points from the "Bad Speculation" column in the main table to this breakdown.
- Unfilled Pipeline Slots (Stalls)**: A table showing metrics related to pipeline stalls.
- Bad Speculation**: A detailed breakdown of the Bad Speculation metric, showing Branch Mispredict and Machine Clears.

A large red callout box in the bottom-left corner provides instructions:

Hover to see Metric description + formula of PMU events, or click arrow to expand a column to see a breakdown of issues pertaining to that category

CPU Bound ⇒ Analyze

*A user-defined criteria for analyzing a hotspot*

Frontend Bound

Bad Speculation

Retiring

Backend Bound

Frontend

Bandwidth

Branch Predicts

Branch Misses

Machine Clears

BASE

Micro Sequencer

Core Bound

Memory Bound

Latency

• Motivation

• Top Level Heuristics

Top Down hierarchy

Memory breakdown

Frontend breakdown

• Examples

• Summary

Ports Utilization

Stores Bound

L1 Bound

Load Bound

L2 Bound

L3 Bound

Ext. Memory Bound

2+ ports

1 port

0 ports

False Sharing

Split Stores

dTLB Store

dTLB overhead

Store fwd block

4K aliasing

Contested Access

Data Sharing

L3 Latency

Local MEM

Remote MEM

Remote Cache

# Backend Bound

- First distinction
  - Core- vs Memory-Bound
- Memory Bound
  - Loads limited by which level
    - MEM Latency vs Bandwidth
  - Store Issues
  - Legacy tuning metrics plugged into the hierarchy
    - Data Sharing, Store Forward Blocks, False Sharing, ...
- Core Bound
  - Non-memory core-internal issues
  - Example: Divider, Execution Ports Utilization



# Results: Memory-level drilldown



# Memory & multi-core (1-copy vs 4-copy)



CPU Bound ⇒ Analyze

*A user-defined criteria for analyzing a hotspot*

Frontend Bound

Bad Speculation

Retiring

Backend Bound

Frontend Latency

Band width

Branch Misspredicts

Machine Clears!

BASE

Micro Sequencer

Core Bound

## Motivation

- Top Level Heuristics
- Top Down hierarchy
- Memory breakdown

## Frontend breakdown

## Examples Summary

iTLB Miss  
iCache Miss  
Branch Resteers  
DSB switches  
LCP  
MITE  
DSB

Divider

Ports Utilization

2+ ports  
1 port  
0 ports  
False Sharing  
Split Stores  
dTLB Store  
dTLB overhead  
Store fwd block  
4K aliasing

L1 Bound

L2 Bound  
L3 Bound

Ext. Memory Bound

Contested Access  
Data Sharing  
L3 Latency

Local MEM  
Remote MEM  
Remote Cache

# FrontEnd Bound

- FrontEnd issues
  - Less encountered in traditional client/HPC, more common in servers/enterprise
- Breakdown
  - Rough Frontend Latency vs BW classification
  - Frontend Latency
    - Intervals with uop delivery starvation
    - Buckets: i-Cache Miss, iTLB Miss, Branch Resteers
  - Frontend Bandwidth
    - Intervals when supplied non optimal # of uops per cycles
    - Breakdown by Fetch source unit (DSB, MITE, LSD)



# Results: Frontend drilldown



# Frontend

Enterprise  
Latency Bound

"Client"  
Bandwidth Sensitive



# Hold on... but why this differs?

- Top Down utilizes designated PMU heuristics
  - IDQ\_UOPS\_NOT\_DELIVERED
  - CYCLE\_ACTIVITY.STALLS\_L2\_MISS
- Naïve methods are often inaccurate
  - Example:  $\text{Counted_Stalls} = \sum \text{Fixed_Penalty}_i * \text{Number}_i$
  - Many Issues
    - Assumes stalls are sequential!
    - Speculations not well handled
    - Fixed penalty for all workloads
    - Restriction to a pre-defined set of miss-events
    - Superscalar oblivious



# EXAMPLE 1: MATRIX MULTIPLY



# Un-tuned

General Exploration General Exploration viewpoint (change) ?

Analysis Target Analysis Type Collection Log Summary Bottom-up Top-down Tree Tasks and Frames

Grouping: Function / Call Stack

| Function / Call Stack      | Hardware Event C... |        | Hardware Event... |     | Filled Pipeline Slots |          | Unfilled Pipeline Slots (Stalls) |                |                 |
|----------------------------|---------------------|--------|-------------------|-----|-----------------------|----------|----------------------------------|----------------|-----------------|
|                            | CPU_CLK_U...        | THREAD | INST_RETIRE...    | ANY | CPI Rate              | Retiring | Bad Speculation                  | Back-end Bound | Front-end Bound |
| + multiply1                | 488,292,732,438     |        | 43,100,064,650    |     | 11.329                | 0.022    | 0.001                            | 0.974          | 0.003           |
| + KeWaitForMultipleObjects | 86,000,129          |        | 14,000,021        |     | 6.143                 | 0.081    | 0.244                            | 0.430          | 0.244           |
| + KeSetTimer               | 86,000,129          |        | 6,000,009         |     | 14.333                | 0.000    | 0.000                            | 0.919          | 0.081           |

General Exploration General Exploration viewpoint (change) ?

Analysis Target Analysis Type Collection Log Summary Bottom-up Top-down Tree Tasks and Frames

Grouping: Function / Call Stack

| Function / Call Stack      | Filled Pipeline Slots |          |                 | Unfilled Pipeline Slots (Stalls) |             |            |                 |       |       |
|----------------------------|-----------------------|----------|-----------------|----------------------------------|-------------|------------|-----------------|-------|-------|
|                            | Retiring              |          | Bad Speculation | Back-end Bound                   |             |            |                 |       |       |
|                            | L1 Bound              | L2 Bound | L3 Bound        | DRAM Bound                       | Store Bound | Core Bound | Front-end Bound |       |       |
|                            |                       |          |                 |                                  |             |            |                 |       |       |
| + multiply1                | 0.022                 | 0.001    | 0.070           | 0.023                            | 0.064       | 0.745      | 0.036           | 0.022 | 0.003 |
| + KeWaitForMultipleObjects | 0.081                 | 0.244    | 0.000           | 0.326                            | 0.000       | 0.000      | 0.000           | 0.000 | 0.244 |

# Loop Interchange

```
void matrix_multiply ()  
{  
    // Multiply the two matrices  
  
    for (int i = 0 ; i < ROWS ; i++) {  
  
        for (int j = 0 ; j < COLUMNS ; j++) {  
  
            for (int k = 0 ; k < COLUMNS ; k++) {  
  
                matrix_r[i][j] = matrix_r[i][j] + matrix_a[i][k] * matrix_b[k][j];  
  
            }  
        }  
    }  
}
```



# Loop Interchange

General Exploration General Exploration viewpoint (change) ②

Analysis Target Analysis Type Collection Log Summary Bottom-up Top-down Tree T

Grouping: Function / Call Stack

| Function / Call Stack    | Hardware Event | Hardware Event  | Filled Pipeline Slots |          | Unfilled Pipeline Slots (Stalls) |                |                 |
|--------------------------|----------------|-----------------|-----------------------|----------|----------------------------------|----------------|-----------------|
|                          | CPU_CLK_THREAD | INST_RETIRE_ANY | CPI Rate              | Retiring | Bad Speculation                  | Back-end Bound | Front-end Bound |
| + multiply2              | 43,980,065,970 | 51,604,077,406  | 0.852                 | 0.353    | 0.001                            | 0.573          | 0.073           |
| + KeSetTimer             | 24,000,036     |                 | 0                     | 0.000    | 0.000                            | 0.000          | 0.000           |
| + init_arr               | 20,000,030     | 16,000,024      | 1.250                 | 0.000    | 0.000                            | 0.000          | 0.000           |
| + KeSynchronizeExecution | 18,000,027     | 0               | 0.000                 | 0.389    | 0.000                            | 0.000          | 0.000           |
| + ExReleaseRundownProt   | 14,000,021     | 6,000,009       | 2.333                 | 0.000    | 0.000                            | 1.000          | 0.000           |

General Exploration General Exploration viewpoint (change) ②

Analysis Target Analysis Type Collection Log Summary Bottom-up Top-down Tree Tasks and Frames

Grouping: Function / Call Stack

| Function / Call Stack    | Unfilled Pipeline Slots (Stalls) |          |        |            |            |            |                            |                           |                            |                             |
|--------------------------|----------------------------------|----------|--------|------------|------------|------------|----------------------------|---------------------------|----------------------------|-----------------------------|
|                          | Back-end Bound                   |          |        |            |            |            |                            |                           |                            |                             |
|                          | Memory Bound                     |          |        |            | Core Bound |            |                            |                           |                            |                             |
|                          | L1 Bo.                           | L2 Bou.. | L3 Bo. | DRA, Bou.. | St. Bo.    | DIV Active | Cycles of 0 Ports Utilized | Cycles of 1 Port Utilized | Cycles of 2 Ports Utilized | Cycles of 3+ Ports Utilized |
| + multiply2              | 0.060                            | 0.000    | 0.000  | 0.066      | 0.137      | 0.000      | 0.133                      | 0.353                     | 0.324                      | 0.20                        |
| + KeSetTimer             | 0.000                            | 1.000    | 0.000  | 0.000      | 0.000      | 0.000      | 0.000                      | 0.000                     | 0.000                      | 0.00                        |
| + init_arr               | 0.000                            | 0.000    | 0.000  | 0.000      | 0.000      | 0.000      | 0.000                      | 0.000                     | 0.000                      | 0.00                        |
| + KeSynchronizeExecution | 0.000                            | 0.000    | 0.000  | 0.000      | 0.000      | 0.000      | 0.000                      | 1.000                     | 0.000                      | 0.00                        |
| + ExReleaseRundownProt   | 0.000                            | 0.000    | 0.000  | 0.000      | 0.000      | 0.000      | 0.000                      | 0.000                     | 0.000                      | 0.00                        |
| Selected 1 row(s):       | 0.060                            | 0.000    | 0.000  | 0.066      | 0.137      | 0.000      | 0.133                      | 0.353                     | 0.324                      | 0.20                        |

# Vectorization



# Example 2: False Sharing

- Field threading example
  - By UIUC class using VTune
  - Single-threaded compute-bound kernel is parallelized
  - 1<sup>st</sup> attempt shows no speedup due to false sharing
  - Backend.Memory.StoreBound is highlighted
  - 2<sup>nd</sup> attempt works. 3.8x Speedup achieved and code is back to be compute-bound

| Metric           | Single Thread | Multi-thread  |       |
|------------------|---------------|---------------|-------|
|                  |               | False Sharing | Fixed |
| Speedup          | 1.00          | 0.97          | 3.77  |
| IPC              | 0.90          | 0.36          | 0.84  |
| Frontend Bound   | 0.00          | 0.02          | 0.01  |
| Retiring         | 0.31          | 0.11          | 0.30  |
| Bad Speculation  | 0.00          | 0.00          | 0.00  |
| Backend Bound    | 0.69          | 0.87          | 0.69  |
| --- Memory Bound | 0.19          | 0.49          | 0.19  |
| --- L1 Bound     | 0.19          | 0.16          | 0.19  |
| --- L2 Bound     | -             | (0.06)        | -     |
| --- L3/MEM Bound | -             | 0.06          | -     |
| --- Stores Bound | -             | 0.33          | -     |
| --- Core Bound   | 0.33          | 0.36          | 0.36  |

# Example 3: Software prefetching

## Original Code



## Tuned (1.35x speedup)



Prefetching can help Memory Latency Bound Apps. Use Carefully

# Example 4: Microarchitecture comparison

- Haswell (4<sup>th</sup> Core gen) has improved front-end
  - Speculative iTLB and cache accesses with better timing to improve the benefits of prefetching
- Benefiting benchmarks clearly show reduction in Frontend Bound



Using Top Down, forward compatibility is assured on Intel Core™



# Enterprise Challenges

## Software

- LARGE
  - Data and Code size
  - # modules/developers
- Un-optimized code
  - E.g. x87
  - Dead code
  - JITed
- Cloud era: Virtualized, ...

| Category        | Expected Range of Pipeline Slots in this Category, for a Hotspot in a <i>Well-defined</i> Application |                                           |                                              |
|-----------------|-------------------------------------------------------------------------------------------------------|-------------------------------------------|----------------------------------------------|
|                 | Client/ Desktop application                                                                           | Server/ Database/ Distributed application | High Performance Computing (HPC) application |
| Retiring        | 20-50%                                                                                                | 10-30%                                    | 30-70%                                       |
| Back-End Bound  | 20-40%                                                                                                | 20-60%                                    | 20-40%                                       |
| Front-End Bound | 5-10%                                                                                                 | 10-25%                                    | 5-10%                                        |
| Bad Speculation | 5-10%                                                                                                 | 5-10%                                     | 1-5%                                         |

## PMU/Tools

- Counter Multiplexing
- Hyper-Threading
- Precise profiling accuracy
  - \* A joint work with CERN openlab
- Long-tail profiles
  - Streams across modules
- Data Profiling
- ...

|         | Classic Error | Precise Error |
|---------|---------------|---------------|
| FullCMS | 41.8%         | 8.4%          |
| xalan   | 38.2%         | 27.1%         |
| povray  | 38.0%         | 14.0%         |
| mcf     | 48.5%         | 25.8%         |
| omnetpp | 55.0%         | 19.4%         |
| average | 44.3%         | 18.9%         |

$$\text{Accuracy Error (x)} = \sum_{i \in BB} \frac{|(BB_x[i] - BB_{REF}[i])|}{BB_{REF}[i]}$$



# Summary

- Top Down Analysis
  - An effective method to identify the **true** bottleneck
  - Google “Ahmad Yasin Intel” - for the ISCA’13 talk/article links
- Integrated into VTune™, Linux perf toplev wrapper, and other tools
- Forward compatibility on Intel Core™ platforms



Try it out and share your feedback



# Links

- Whitepaper
  - How to Tune Applications Using a Top-down Characterization of Microarchitectural Issues
  - <http://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues>
- Tools
  - [VTune](#) Amplifier XE 2013 (Update 8 or later) 
  - Basic support in [PBA](#) - Performance Bottleneck Analyzer
  - [ocperf / toplev](#) - A wrapper on top of the Linux perf utility
- Tutorial on Analysis Methodologies and Tools - ISCA'2013
  - <https://sites.google.com/site/analysismethods/isca2013/program-1>
- Questions or feedback -  [ahmad.yasin@intel.com](mailto:ahmad.yasin@intel.com)





# EXAMPLE 3: PINPOINT A MEMORY SUBTLE ISSUE



# Memory Bound breakdown\* for Spec FP, on Ivy Bridge



# Sandy Bridge field example: Pinpoint Memory Issue across-functions in 465.tonto

|                            | Stream# | Block# | Instr # | Function      | RIP         | ASM Line                                 | comment                         |
|----------------------------|---------|--------|---------|---------------|-------------|------------------------------------------|---------------------------------|
| Front Bound                | 0       | 0      | 0       | 0\$HELL2_MO.. | 0x140193E58 | mov r9,qword ptr [rbp+2e58]              |                                 |
|                            | 0       | 0      | 1       | 1\$HELL2_MO.. | 0x140193E5F | lea rcx,ptr [rbp+23a0]                   | sparing area for parameters &   |
|                            | 0       | 0      | 2       | 2\$HELL2_MO.. | 0x140193E66 | mov r10,qword ptr [rbp+2700]             | returned value on stack         |
| 0.0                        |         |        |         |               |             | ...                                      |                                 |
|                            |         |        |         |               |             | ...                                      |                                 |
|                            | 0       | 1      | 7       | cexp          | 0x14009CACD | mov qword ptr [rsp+b0],rcx               |                                 |
| Resonant<br>MEM_R          |         |        |         |               |             | ...                                      |                                 |
|                            | 0       | 4      | 6       | cexp          | 0x14009CDE2 | addpd xmm2,xmm6                          | calculations...                 |
|                            | 0       | 4      | 7       | cexp          | 0x14009CDE6 | mulpd xmm0,xmm2                          |                                 |
|                            | 0       | 4      | 8       | cexp          | 0x14009CDEA | movq xmm1,xmm0                           |                                 |
| Load 0.                    | 0       | 4      | 9       | cexp          | 0x14009CDEE | pshufd xmm0,xmm0,e                       |                                 |
|                            | 0       | 4      | 10      | cexp          | 0x14009CDF3 | mov rcx,qword ptr [rsp+b0]               |                                 |
|                            | 0       | 4      | 11      | cexp          | 0x14009CDFB | movq qword ptr [rcx],xmm0                | store result on stack           |
| % Load<br>Loads<br>penalty |         |        |         |               |             | ...                                      |                                 |
|                            | 0       | 4      | 17      | cexp          | 0x14009CE26 | ret                                      |                                 |
|                            | 0       | 5      | 0       | 0\$HELL2_MO.. | 0x140193EC4 | vmulpd xmm1,xmm15,xmmword ptr [rbp+23a0] | Load using cexp() returned data |
|                            | 0       | 5      | 1       | 1\$HELL2_MO.. | 0x140193ECC | vmovddup xmm0,qword ptr [rbx+r12*1]      |                                 |
| 0.43                       | 0       | 5      | 2       | 2\$HELL2_MO.. | 0x140193ED2 | inc r15                                  |                                 |
|                            | 0       | 5      | 10      | 0\$HELL2_MO.. | 0x140193EFF | jb 1.40E+63                              |                                 |

Top Down Analysis relies on designated PMU events

