

# Optimizing Loops

David Levinthal  
Senior Software Engineer  
Intel Corp

Sept 2005



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be  
claimed as the property of others.



# Agenda

- Optimizing loops: What matters?
- What is a Short tripcount loop?
- SMG2000\*
- Versioning loops

**Comments are Specific to Loop Dominated Codes  
Identification of Occurrences Mostly Relies on  
Visual Inspection of Assembler + Performance Counters**



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

# Application Characterization: 2 Types

- Data Structure Driven
  - Execution is instruction retirement driven
    - Mostly FP op retirement
  - Memory access is Bandwidth dominated
    - Regular sequential array element access
    - very effective HW prefetch (IA-32 and Intel® 64 Architecture)
    - And SW prefetch ( Itanium® 2 Processors, 9.0 compiler and later)
  - HPC: loop dominated apps



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be claimed as the property of others.



# Application Characterization: 2 Types

- Data Value Driven
  - Execution is branch dominated
  - Memory access is Latency dominated
    - Data value, pointer chasing for linked lists
    - Prefetching is difficult
  - Database, Search Engines etc
- NOT DISCUSSED HERE



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be claimed as the property of others.



# What Matters in loop

1. The Trip Count
2. The Trip Count
3. The TRIP COUNT!
4. Variations in the tripcount
5. And some other things

BUT..what you do about them depends on  
THE TRIP COUNT

And of course there are virtually no tools to assist you in determining  
this..other than printf

(you can use PIN..  
or prof\_gen and then generate an asm listing and get the basic block  
counts..)

NOTE: Disambiguation is an exception to this rule..



# Why is the Trip Count so Overwhelmingly Important

- Prefetch generation is based off the inner loop.
  - The assumption is that the inner loop has a very long tripcount
  - The prefetch distance (in iterations) is then  $500/\text{cycles\_per\_iteration}$
  - So a one line do loop with a trip count of 100 is toast

## There is no other reasonable default for the compiler

- Software Pipelining is effective ONLY if the Prologue and Epilogue are insignificant
  - So a loop with 5 stages had better have a tripcount  $\gg 10$
- Remainder loops from unrolling are really hard to optimize well..so the remainder better be insignificant



# Managing Tripcount

- Get the longest loop you have, moved to be the inner loop
  - Loop interchange
  - Completely unroll inner loops
    - pragma loop count/pragma unroll may help
  - Data organization interchange..make loop index unit stride of array access
    - Leading dimension: first index for fortran, last for c/C++
    - This makes the prefetching work
    - Uses ALL the bandwidth
  - DO NOT DECREASE THIS DIMENSION WITH MPI PARTITIONING!!!!!!



# The Order of Optimization

- Get everything prefetched
  - Everything else is irrelevant if you wait 300 cycles every 16 variables..(or worse)
- Disambiguate the pointers
  - This kills SWP and vectorization
  - This can even be an issue in F77!!!
    - Once the code has been translated to Intermediate Language the identity of the source language (and its rules) are gone
    - Note: two references to the same array with different symbolic indexes are a true ambiguation
- Everything else:
  - Make sure minimum instruction count is being retired in inner loops (this comprises a lot of tricks)
  - Check latency is being hidden
  - Minimize Bandwidth
  - Check all asm versions of loops with significant cpu\_cycles
  - Etc etc..we'll get to this



# The Big Four, The Biggest Bang for the Buck

- CPU\_CYCLES (SAV = 1,000,000)
- BACK\_END\_BUBBLE.ALL (SAV = 1,000,000)
- DEAR\_LATENCY\_GT\_64 (SAV = 1000)
- BUS\_MEMORY.ALL.SELF (SAV = 10,000)
  - Disable Calibration
- Identifies
  - Low hanging fruit with back\_end\_bubble.all
  - Hotspots (cpu\_cycles / back\_end\_bubble.all)
  - Long latency data access/non prefetched data/thread sharing problems
  - Bandwidth limited loops



# Why Those SAV Values?

- Source and asm views in VTune™ Performance Analyzer display sample counts
  - ratios of sample counts should be meaningful!
- Back\_end\_bubble.all/Cpu\_Cycles = fraction of stalls
  - Sav values are equal so they cancel
- Dear\_latency\_gt\_64/cpu\_cycles ~ fraction of stalled cycles due to long latency loads

Dear \* 1000 (SAV\_dear)\*300 (L3 miss latency)\*3(~dear sampling)/cpu\_cycles\*1000,000 (SAV\_cpu) =  
(900,000/1000,000)\*dear/ cpu\_cycles

~ dear/ cpu\_cycles ~ stalled cycle fraction

BW limit ~ 5.5 bus cycles/bus transaction =>

$$5.5 < (\text{cpu\_cycles} * 1000,000 * 0.2\text{Ghz} / 1.6\text{Ghz}) / (\text{Bus\_memory} * 10,000) = 100/8 * (\text{cpu}/\text{bus})$$

44/100 < cpu/bus ~1/2



# Visual Inspection of Assembler

- Single most powerful tool
  - Other than printf(tripcount) of course!!! ☺
- Compiler generates many versions of loops
  - For various address alignments
  - Disambiguation
  - Tripcounts
- Use VTune™ Performance Analyzer to figure out which one is actually executed
  - CPU sensitive assembly editor



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be claimed as the property of others.



# Prefetching

- Nothing else matters as much
- Each L3 cache miss results in ~300 cycles
  - 20 cycles over 16 iterations if every element in an array is used
  - 300 cycles/element/iteration in the worst case!
- It is difficult for anything else to cause this large a problem



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be claimed as the property of others.



# Identifying a Prefetch Problem

- Collect L3\_reads.data\_reads.miss with and without an opcode tag on Ifetch
  - The difference is the size of your problem
  - $SAV(L3\_reads) = SAV(CPU)/100$
- Collect DEAR\_latency\_gt\_64
  - $SAV(DEAR) = SAV(CPU)/1000$
  - Fraction of stalled cycles due to long latency loads ~  
 $dear\_latency\_gt\_64 * 300 * 3(sampling\ fraction)/cpu\_cycles$   
using samples:  $\sim dear(samp)/cpu(samp)$

DEAR is exact (no skid) it identifies unprefetched load variables  
(use source line # in asm view to assist: multiple source lines  
with one array/line does not help!)

Then use Ifetch intrinsic (`__Ifetch` or `mm_prefetch`)



# Bandwidth limited App Prefetching

## everything $A(i) = B(i) + C(i) * D(i)$



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

# Bandwidth limited App Missing One Prefetch (pragma noprefetch c)



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

# Default Compiler Prefetch Generation

- Compiler generates lfetch instructions driven by loop index of the innermost loop effecting data reference
  - Prefetch distance (iterations)  $\sim 500/\text{cycles\_per\_iteration}$  (asm loop)
  - Is available in compiler opt\_report
- This works for long tripcount loops
  - Tripcount > prefetch distance
- Works for dense array access in multi dimensional loop, with unit stride data access pattern
  - Prefetches beyond inner loop tripcount, automatically prefetches for next iteration of outer loop



# Using Lfetch Intrinsics

- Simple inner loops prefetch some number of iterations ahead
  - ~500/cycles\_per\_iteration
  - in the asm view..      count ;;
- If the tripcount is not  $>>$  prefetch distance..this fails
  - May result in additional, unused bandwidth
- LFETCH is a BW limited instruction!
  - 128 bytes/slot!!! Up to 512 bytes/cycle
- Redundant Lfetches are suppressed by OzQ
  - Dropped/cancelled secondary miss
  - Occupy memory slots



# Mechanics of Lfetch Usage

- Two basic issues
  - Driving prefetches from outer loops
    - Inner loop is too short and can't be unrolled out of existence
    - First inner loop (of many) is BW limited
  - Indirect Addressing aka gather/scatter
    - One level gather usually done by compiler
    - Multilevel gather or compiler failure requires hand written prefetch
    - Tripcount is too short to support 2 level gather
- Prefetching Structures may require multiple lfetch
  - Structure size > 128 bytes
  - Structure poorly aligned to cachelines
    - Structure spans 2 cachelines
    - Prefetch leading and trailing edge of structure



# Mechanics of Lfetch Usage with Outer Loops

- Inner loop has very short tripcount, but a lot of code
  - It cannot be unrolled away
- Resolution:
  - Create a prefetch value from the index of the outer loop
    - Outer\_loop\_index+ 1? 2?
    - You won't need much..there are a lot of instructions
  - Prefetch all the data being used for each inner iteration but using the forward outer\_loop index
    - Gathers may result in a fair bit of code



# Versioning Loops

- Different tripcounts and strides may have different optimal encodings
  - An inner loop with a tripcount that is always either 3 or 4 has 2 distinct unroll options
- Predicate Hoisting
- Address alignments may or may not allow runtime disambiguation
- Address alignments may allow or preclude load pairs

Generation of multiple loop encodings  
Selected by runtime test



# Versioning on Inner Tripcount

```
DO j=1, jmax
  DO k=1, kmax
    IF( phase .EQ. 3) THEN
      DO i=1, imax
        sum = ( density(i,k ,j,1)+density(i,k ,j,2)+density(i,k-1,j,3)+  &
&           density(i,k-1,j,1)+density(i,k-1,j,2)+density(i,k ,j,3) ) ←---unrolled by 3
        csum(i,k,j) = 1./(1.+0.5*sum) ←---unroll allows SWP of divide
      ENDDO
    ELSE
      DO i=1, imax
        sum = 0.
        DO ispe=1,phase
          sum = sum + density(i,k,j,ispe) + density(i-1,k,j,ispe) ←-short inner loops are lousy
        ENDDO
        csum(i,k,j) = 1./(1.+0.5*sum) ←-divide outside inner loop
      ENDDO
    ENDIF
  ENDDO
ENDDO
```

PGO Will Do This Automatically



# Versioning on Stride

- Variable stride creates havoc for the compiler
  - Makes strength reduction go out the window
  - Drives prefetch generation crazy
  - Makes unrolling very messy
  - In principle blocks SIMD generation
- Sources of variable stride
  - Explicit use of variables to increment array indices
  - Dope vectors for F90 assumed shape arrays
- Reality is that these “variable” strides are really always 1
  - IA-32/Intel® 64 Architecture compiler versions on stride to allow SIMD generation



# SMG2000\*

VTune(TM) Performance Analyzer - [Source View - [c:\LLNL\smg\_residual.c]]

File Edit View Activity Configure Window Help

smg2000\_d18

Source

| Address | Line | Source                                                   | CPU CY | BACK END | BUS ME | DEAR L |
|---------|------|----------------------------------------------------------|--------|----------|--------|--------|
| 0...    | 280  | loop_size);                                              |        |          |        |        |
| 0...    | 281  | hypre_BoxLoop3Begin(loop_size,                           | 2      | 2        | 8      |        |
| 0...    | 282  | A_data_box, start, base_stride                           |        |          |        |        |
| 0...    | 283  | x_data_box, start, base_stride                           |        |          |        |        |
| 0...    | 284  | r_data_box, start, base_stride                           |        |          |        |        |
| 0...    | 285  | #define HYPRE_BOX_SMP_PRIVATE loopk,loopi,loopj,Ai,xi,ri |        |          |        |        |
| 0...    | 286  | #include "hypre_box_smp_forloop.h"                       |        |          |        |        |
| 0...    | 287  | hypre_BoxLoop3For(loopi, loopj, loopk, Ai, xi, ri)       | 11,341 | 2,759    | 15,221 |        |
| 0...    | 288  | {                                                        |        |          |        |        |
| 0...    | 289  | rp[ri] -= Ap[Ai] * xp[xi];                               | 14,690 | 9,484    | 14,822 | 5,048  |
| 0...    | 290  | }                                                        |        |          |        |        |
| 0...    | 291  | hypre_BoxLoop3End(Ai, xi, ri);                           | 338    | 33       | 431    |        |
| 0...    | 292  | }                                                        |        |          |        |        |
| 0...    | 293  | }                                                        |        |          |        |        |
| 0...    | 294  | }                                                        |        |          |        |        |
| 0...    | 295  |                                                          |        |          |        |        |
| 0...    | 296  |                                                          |        |          |        |        |
| 0...    | 297  | /*-----                                                  |        |          |        |        |
| 0...    | 298  | * Return                                                 |        |          |        |        |
| 0...    | 299  | *-----                                                   |        |          |        |        |

Function Summary

| Address | Size   | Function               | CPU CYCLE... | BACK-END BUBBLE... | BUS MEMORY-ALL-SE... | DEAR LATENCY GE... |
|---------|--------|------------------------|--------------|--------------------|----------------------|--------------------|
| -----   | -----  | --- Selected Range --- | 26,369       | 12,276             | 30,474               | 5,048              |
| 0x10240 | 0x2A00 | hypre_SMGResidual      | 32,022       | 16,984             | 37,371               | 5,539              |

Sampling Results [spd206-4.jf.intel.com] - Wed Sep 21 13:57:49 2005

Output

or Help, press F1



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be claimed as the property of others.

# How do we optimize a 1 Line Loop?

- In this case
- Expand the macros with the preprocessor
- Copy the expanded macro code for this one section
- Replace the expanded macro code for the original block which had the macros
  - Leave all other macros alone!
- Rewrite the code



# SMG2000\*: Macros expanded

```
for (loopk = 0; loopk < hypre_nz; loopk++) //7
{
    for (loopj = 0; loopj < hypre_ny; loopj++) //8
    {
        for (loopi = 0; loopi < hypre_nx; loopi++) //9
        {
            rp[ri] -= Ap[Ai] * xp[xi];
            Ai += hypre_sx1;
            xi += hypre_sx2;
            ri += hypre_sx3;
        } //8
        Ai += hypre_sy1 - hypre_nx*hypre_sx1;
        xi += hypre_sy2 - hypre_nx*hypre_sx2;
        ri += hypre_sy3 - hypre_nx*hypre_sx3;
    } //7
    Ai += hypre_sz1 - hypre_ny*hypre_sy1;
    xi += hypre_sz2 - hypre_ny*hypre_sy2;
    ri += hypre_sz3 - hypre_ny*hypre_sy3;
} //6
```



# SMG2000\* Issues

- Inner loop tripcount is short compared to prefetch distance (hypre\_nx = 120)
  - 1 fma/iteration  $\rightarrow$  2 iterations/cycle  $\rightarrow$  prefetch “distance”  $\sim$  600 iterations  $>>$  120
- Variable inner stride could mean 1 cacheline/iteration
  - All 3 arrays must be prefetched every iteration
- Variable outer stride means noncontiguous memory access
- Assembler analysis reveals 11 cycles/ 2 iterations!
  - Instead of 1 cycle/2 iterations



# SMG2000\* Visual Inspection of Assembler

VTune(TM) Performance Analyzer - [Source View - [C:\...smg2000\struct\_ls\smg\_residual\_d17.c]]

File Edit View Activity Configure Window Help

smg2000\_d18

Tuning Browser

| Address | Line | Source                              | CPU   | CYC   | BACK_EN | BE_EX | DEAR |
|---------|------|-------------------------------------|-------|-------|---------|-------|------|
| 0x112C1 | 289  | (p16) add r45 = r43, r8             | 1,067 |       | 115     | 60    |      |
| 0x112C2 | 289  | (p16) add r47 = r43, r2             |       |       |         |       |      |
| 0x112D0 | 289  | (p16) add r36 = r43, r30            |       |       |         |       |      |
| 0x112D1 | 289  | (p16) add r41 = r43, r28            |       |       |         |       |      |
| 0x112D2 | 289  | (p16) add r49 = r43, r28 ;;         |       |       |         |       |      |
| 0x112E0 | 287  | (p16) add r52 = r43, r27            |       |       |         |       |      |
| 0x112E1 | 287  | (p16) add r53 = r43, r26            |       |       |         |       |      |
| 0x112E2 | 289  | (p16) sxt4 r54 = r47                |       |       |         |       |      |
| 0x112F0 | 287  | (p16) add r50 = r43, r25            |       |       |         |       |      |
| 0x112F1 | 287  | (p16) add r51 = r43, r24            |       |       |         |       |      |
| 0x112F2 | 289  | (p16) sxt4 r55 = r49 ;;             |       |       |         |       |      |
| 0x11300 | 289  | (p16) shladd r47 = r34, 03h, r63    |       |       |         |       |      |
| 0x11301 | 289  | (p16) shladd r62 = r55, 02h, r61    |       |       |         |       |      |
| 0x11302 | 289  | (p16) sxt4 r126 = r44               |       |       |         |       |      |
| 0x11310 | 287  | (p16) add r49 = r43, r16            |       |       |         |       |      |
| 0x11311 | 287  | (p16) add r127 = r43, r14           |       |       |         |       |      |
| 0x11312 | 289  | (p16) sxt4 r125 = r45 ;;            |       |       |         |       |      |
| 0x11320 | 289  | (p16) shladd r44 = r126, 03h, r60   |       |       |         |       |      |
| 0x11321 | 289  | (p16) ldfd f32 = [r62]              |       |       |         |       |      |
| 0x11322 | 289  | (p16) sxt4 r45 = r36                |       |       |         |       |      |
| 0x11330 | 289  | (p16) ldfd f35 = [r67]              |       |       |         |       |      |
| 0x11331 | 289  | (p16) shladd r32 = r125, 03h, r63   |       |       |         |       |      |
| 0x11332 | 289  | (p16) sxt4 r54 = r41 ;;             |       |       |         |       |      |
| 0x11340 | 289  | (p16) ldfd f38 = [r14]              |       |       |         |       |      |
| 0x11341 | 289  | (p17) fnma.d.s0 f41 = f33, f39, f36 | 9,163 | 7,480 | 7,356   | 372   |      |
| 0x11342 | 287  | (p16) sxt4 r126 = r52               |       |       |         |       |      |

Function Summary

Sampling Results [spd206-4.jf.intel.com] - Fri Jul 08 16:...

Output

For Help, press F1



11 cycles per 2 iterations of source loop

Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

# Version on Stride: SMG2000\* hypre\_nx ~ 100: Rewriting the Source

```
for (loopk = 0; loopk < hypre_nz; loopk++) //7
{
    for (loopj = 0; loopj < hypre_ny; loopj++) //8
    {
        for (loopi = 0; loopi < hypre_nx; loopi++) //9
        {
            rp[ri] -= Ap[Ai] * xp[xi];
            Ai += hypre_sx1; ← version on stride
            xi += hypre_sx2;
            ri += hypre_sx3;
        } //8
        Ai += hypre_sy1 - hypre_nx*hypre_sx1;
        xi += hypre_sy2 - hypre_nx*hypre_sx2;
        ri += hypre_sy3 - hypre_nx*hypre_sx3;
    } //7
    Ai += hypre_sz1 - hypre_ny*hypre_sy1;
    xi += hypre_sz2 - hypre_ny*hypre_sy2;
    ri += hypre_sz3 - hypre_ny*hypre_sy3;
} //6
```



# Version on Stride: SMG2000\*

## Version loops on the strides = 1!

```
for (loopk = 0; loopk < hypre__nz; loopk++) //7
{
    for (loopj = 0; loopj < hypre__ny; loopj++) //8
    {
// version on stride
        if( (hypre__sx1 == 1)&&(hypre__sx2 ==1) &&(hypre__sx3 ==1))
        {
#pragma loop count(100)
            for (loopi = 0; loopi < hypre__nx-remainder; loopi+=4) //9
            {
                rp[ri] -= Ap[Ai] * xp[xi];
                rp[ri+1] -= Ap[Ai+1] * xp[xi+1];
                rp[ri+2] -= Ap[Ai+2] * xp[xi+2];
                rp[ri+3] -= Ap[Ai+3] * xp[xi+3];
                __lfetch(__lfhint_nt1,&rp[ri+4*hypre__sy3 ]);
                __lfetch(__lfhint_nt1,&Ap[Ai+4*hypre__sy1 ]);
                __lfetch(__lfhint_nt1,&xp[xi+4*hypre__sy2 ]);
                Ai +=4;    ←--unroll by 4 (instead of 8) because I'm lazy
                xi +=4;
                ri +=4;
            }
// remainder loop
        }
    }
}
```



# Prefetching with Short tripcount: SMG2000\*

## Outer loop Driven Prefetch

```
for (loopk = 0; loopk < hpre_nz; loopk++) //7
{
    for (loopj = 0; loopj < hpre_ny; loopj++) //8
    {
        // version on stride
        if( (hpre_sx1 == 1)&& (hpre_sx2 ==1) && (hpre_sx3 ==1))
        {
#pragma loop count(100)    ←force compiler generated prefetch to be “close”
            for (loopi = 0; loopi < hpre_nx-remainder; loopi+=4) //9
            {
                rp[ri] -= Ap[Ai] * xp[xi];
                rp[ri+1] -= Ap[Ai+1] * xp[xi+1];
                rp[ri+2] -= Ap[Ai+2] * xp[xi+2];
                rp[ri+3] -= Ap[Ai+3] * xp[xi+3];
                __lfetch(__lfhint_nt1,&rp[ri+4*hpre_sy3 ]);    ←---prefetch for loopj + 4
                __lfetch(__lfhint_nt1,&Ap[Ai+4*hpre_sy1 ]);
                __lfetch(__lfhint_nt1,&xp[xi+4*hpre_sy2 ]);
                Ai +=4;    ←--unroll by 4 (instead of 8) because I'm lazy
                xi +=4;
                ri +=4;
            }
        }
    }
}

// remainder loop
```



# SMG2000\* rewritten



Loop is Now Bandwidth Limited



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

# SMG2000\* Final Asm

VTune(TM) Performance Analyzer - [Source View - [c:\LLNL\smg\_residual.c]]

File Edit View Activity Configure Window Help

Tuning Brow... X

Projectsmg2000final Activity1 (Sampling)

smg2000\_d18 Sampling Res. Run 1 CPU\_I BACK\_ BUS\_I DEAR

smg2000\_d18 Sampling Res. Run 1 CPU\_I BACK\_ BUS\_I DEAR

| Address | Line | Source                                   | CPU   | CY    | BACK  | BUS_M | DEAR |
|---------|------|------------------------------------------|-------|-------|-------|-------|------|
| 0x1...  | 393  | ...a1+2760: (p16) ldfd f57 = [r17], 020h | 4,823 | 3,790 | 9,591 | 409   |      |
| 0x1...  | 393  | (p16) ldfd f59 = [r16], 020h             |       |       |       |       | 118  |
| 0x1...  | 395  | (p23) fnma.d.s0 f63 = f73, f75, f71      |       |       |       |       |      |
| 0x1...  | 393  | (p16) ldfd f61 = [r73]                   |       |       |       |       | 103  |
| 0x1...  | 394  | (p16) ldfd f47 = [r63]                   |       |       |       |       | 162  |
| 0x1...  | 394  | nop.i 00h ::                             |       |       |       |       |      |
| 0x1...  | 394  | (p21) stfd [r68] = f76                   |       |       |       |       |      |
| 0x1...  | 395  | (p21) ldfd f71 = [r15], 020h             | 2,772 | 2,003 | 5,693 |       |      |
| 0x1...  | 395  | nop.i 00h                                |       |       |       |       | 30   |
| 0x1...  | 395  | (p21) ldfd f73 = [r14], 020h             |       |       |       |       |      |
| 0x1...  | 395  | (p16) ldfd f64 = [r55]                   |       |       |       |       | 127  |
| 0x1...  | 396  | nop.i 00h ::                             |       |       |       |       |      |
| 0x1...  | 394  | (p16) ldfd f31 = [r33]                   |       |       |       |       |      |
| 0x1...  | 394  | (p20) fnma.d.s0 f75 = f53, f55, f51      | 1,603 | 1,087 | 2,641 | 158   |      |
| 0x1...  | 394  | nop.i 00h                                |       |       |       |       |      |
| 0x1...  | 396  | (p26) stfd [r43] = f56                   |       |       |       |       |      |
| 0x1...  | 390  | (p16) lfetcn.nt1 [r71]                   |       |       |       |       |      |
| 0x1...  | 390  | nop.i 00h ::                             |       |       |       |       |      |
| 0x1...  | 393  | (p18) stfd [r75] = f46                   |       |       |       |       |      |
| 0x1...  | 394  | (p18) ldfd f51 = [r11], 020h             | 2,391 | 1,663 | 3,990 |       | 26   |
| 0x1...  | 396  | (p25) fnma.d.s0 f55 = f43, f45, f41      |       |       |       |       |      |
| 0x1...  | 394  | (p18) ldfd f53 = [r10], 020h             |       |       |       |       | 13   |
| 0x1...  | 399  | (p16) lfetcn.nt1 [r9], 020h              |       |       |       |       |      |
| 0x1...  | 390  | (p16) adds r43 = 01h, r44 ::             |       |       |       |       |      |
| 0x1...  | 398  | (p16) lfetcn.nt1 [r8], 020h              |       |       |       |       |      |
| 0x1...  | 395  | (p23) stfd [r62] = f63                   |       |       |       |       |      |
| 0x1...  | 390  | (p16) adds r54 = 020h, r55               |       |       |       |       |      |
| 0x1...  | 396  | (p23) ldfd f41 = [r3], 020h              |       |       |       |       | 17   |
| 0x1...  | 396  | (p23) ldfd f43 = [r2], 020h              |       |       |       |       | 10   |
| 0x1...  | 390  | (p16) adds r32 = 020h, r33 ::            |       |       |       |       |      |
| 0x1...  | 397  | (p16) lfetcn.nt1 [r30], 020h             | 1,796 | 1,158 | 3,336 |       |      |
| 0x1...  | 393  | (p17) fnma.d.s0 f45 = f58, f60, f62      |       |       |       |       |      |
| 0x1...  | 390  | (p16) adds r68 = 060h, r71               |       |       |       |       |      |
| 0x1...  | 390  | (p16) adds r72 = 020h, r73               |       |       |       |       |      |
| 0x1...  | 390  | (p16) adds r62 = 020h, r63               | 2,029 | 1,367 | 3,604 |       |      |

br,ctop;; just off screen

4 iterations in 6 cycles

For Help, press F1



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

# Summary

- **BE EXPLICIT: Don't do Anything Tricky**
  - At Best You'll ONLY Confuse the Compiler
  - More Likely You'll Confuse Yourself AND the Compiler
- **General solutions might be “nice”...**
  - But they are likely to be slow
- **Optimized Solutions take advantage of unique aspects of your problem**
- **Use all the tools in a Coordinated Manner**



# Backup



Intel, the Intel logo, Itanium and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

**\*Other names and brands may be claimed as the property of others.**



# Sawtooth Access Patterns are Tough on HW/SW prefetching

- In solvers, incrementing followed by decrementing sums are common.
- This results in loops going forward followed by loops going backwards.
- Backwards loops frequently have a sawtooth access pattern
  - Short inner loop going forward in address
  - Outer loop making large jumps in negative direction in address



- HW and SW prefetch tend to see inner sequence only and thus “prefetches” data that has already been used



# Identifying and Resolving Sawtooth Access

- Identification
  - Usual stuff: use dear\_latency\_gt\_64 and opcode tagged L3\_read.data\_read.miss
  - Inspection of SOURCE for negative loop stride
    - May require printf

## Resolution:

- Change direction of inner loop
  - These loops tend to be “sum-=” so order is irrelevant
- Lfetch intrinsics and do it by hand



# Disambiguation

- C and C++ standards state that the compiler **MUST** assume that ambiguous pointers overlap
- Fortran standard states that arrays with different names can be assumed by the compiler to **NOT** overlap
  - Unless an equivalence statement is used
- `For(i=0; i<len; i++)a[i] += x*b[i];`
- The compiler must assume that `&a[i] = & b[i+1]`
  - **EITHER:** Preventing any loop unrolling or any parallel instruction execution
  - **OR:** requiring speculative execution of multiple iterations
    - Through an O-O-O engine like x86
    - Speculative instruction generation by the compiler



# Ambiguous Pointers are ALWAYS BAD

- If the code generation for ambiguous pointers were better then that code sequence would be the compilers DEFAULT
- Always requires more compiler gymnastics
  - Speculative instructions
  - Versioning
  - Things it wouldn't do otherwise
- If code with ambiguous pointers runs faster..it is a bug..pure and simple
- Fortran can have ambiguous pointers due to lost (intrinsic) disambiguation (of fortran std) in the Intermediate language
  - Arrays in common block..etc



# Identifying an Ambiguity Performance Problem

- Visual Inspection of Assembler is the ONLY reliable test
- Must use runtime performance data to identify code path that is actually executed
  - Non executed versions are irrelevant
- VTune™ Performance Analyzer asm view as spread sheet, exported to excel, is the best method
  - Sort by source line
  - Find hotspot's address by seeing what addresses have lots of events
  - Resort data by address and scroll down to hotspot
  - Maintain source listing view in VTune Analyzer for comparison
  - Make sure there are not multiple hotspots in asm view for a single block of source..(consistency of total event counts)



# Identifying an Ambiguity Performance Problem

- Visual Inspection of Assembler is the ONLY reliable test
- Look for speculative instructions
  - If there weren't a load/store pointer ambiguity, there wouldn't be any speculation
  - Id.a, Id.s, chk
  - Lack of SWP (br.ctop)??
- “Greedy Scheduling bug”
  - Lots of loads and adds with no interspersed fma's
  - Followed by bundles with 2 fma's and 4 nops
  - Finally stores, adds, lfetches and br.ctop
  - Why weren't you scheduled at the resource ii
    - = number of fma's/2
  - Check the source and calculate the number of required fp ops



# Identifying Disambiguation Performance Issues

```
void DAXPY(int len, double x, double * a, double *b){  
    int i,j;  
    for(i=0; i<len; i++)a[i] += x*b[i];  
    return;}  
(p16)ldfpd    f37,f34=[r36]  
(p16)ldfd.a    f32=[r3],16  
(p17)fma.d    f40=f8,f33,f38  
(p16)lfetch.nt1 [r34] ;;  
(p18)stfd      [r2]=f39,16  
    nop.i 0 ;;  
(p17)chk.a.nc  f33,.b1_78  
    nop.m        0  
    nop.i 0  
(p17)stfd      [r37]=f40  
(p17)ldfd      f39=[r9],16  
    nop.i 0 ;;  
(p16)add       r35=16,r36  
(p17)fma.d    f38=f8,f39,f35  
(p16)add       r32=32,r34  
    nop.m        0  
    nop.i 0  
br.ctop.sptk  .b1_18 ;;
```

Ld.a & chk.a means  
there is an ambiguity

Note: 9.0 compiler  
"versions" around this  
example



# Disambiguation Resolution

- **-fno-alias, -ansi\_alias (Linux) –Oa(windows)**
- **-restrict**
  - + restrict keyword in source

```
void DAXPY(int len, double x, double * restrict a, double *b)
```

```
{  
    int i;  
    for(i=0; i<len; i++)a[i] += x*b[i];  
    return;  
}
```

- **Restrict is NOT defined for C++**
  - What does it even mean to apply restrict to a class?



# Function Calls Exporting Pointers can Cause Ambiguities

2D Arrays declared with `**` notation

```
While(norm > tolerance){  
    MPI_Send(top_row[]....);  
    MPI_Send(bottom_row[]...);  
    MPI_Recv(bottom_of_top_neighbor[]....);  
    MPI_Recv(top_of_bottom_neighbor[]...);  
    For(i=1;i<dim1;i++)  
        For(j=1;j<dim2;j++) new[i][j] = (old[i+1][j] + old[i-1][j] +  
                                         old[i][j+1] + old[i][j-1] + src[i][j] )*0.25;  
    dsum = 0;  
    For(i=1;i<dim1;i++)  
        For(j=1;j<dim2;j++)  
            dsum += (new[i][j] - old[i][j])*(new[i][j] - old[i][j]); ←----- ambig *  
    MPI_Allreduce(&dsum, &norm....); ←----- export &dsum  
    For(i=1;i<dim1;i++)  
        For(j=1;j<dim2;j++) old[i][j] = new[i][j];  
}
```

Export of `&dsum` allows local variable `dsum` to become Ambiguous with `**new` and `**old`



# Function Calls Exporting Pointers can Cause Ambiguities

2D Arrays declared with `**` notation

**dsum = 0;**

**For(i=1;i<dim1;i++)**

**For(j=1;j<dim2;j++)**

**dsum += (new[i][j] - old[i][j])<sup>\*</sup> (new[i][j] - old[i][j]);** ←-----  
ambig \*

**tsum = dsum** ←-----

**MPI\_Allreduce(&tsum, &norm....);** ←----- export &dsum

## Introduce Local Copy to Distinguish Addresses

