

# Exercise 5: HPL Benchmark Analysis

## Performance Evaluation and Optimization

Zyad FRI  
UM6P

January 29, 2026

### 0.1 System Configuration

Table 1: Hardware and Software Specifications

| Component        | Specification                    |
|------------------|----------------------------------|
| Processor        | Intel Core i7-1255U (12th Gen)   |
| Cores            | 10 (2 P-cores + 8 E-cores)       |
| Cache            | L1: 928KB, L2: 6.5MB, L3: 12MB   |
| RAM              | 16 GB DDR4                       |
| HPL Version      | 2.3                              |
| BLAS Library     | OpenBLAS 0.3.26                  |
| Matrix Sizes (N) | 1000, 5000, 10000                |
| Block Sizes (NB) | 1, 2, 4, 8, 16, 32, 64, 128, 256 |

## 1 Experimental Results

### 1.1 Performance Summary

Table 2: Best Performance for Each Matrix Size

| N     | Optimal NB | Time (s) | GFLOPS | Efficiency (%) |
|-------|------------|----------|--------|----------------|
| 1000  | 32         | 0.03     | 24.54  | 40.9           |
| 5000  | 128        | 1.69     | 49.46  | 82.4           |
| 10000 | 256        | 12.62    | 52.84  | 88.1           |

Table 3: Worst Performance (NB=1 for All Sizes)

| N     | Time (s) | GFLOPS | Efficiency (%) |
|-------|----------|--------|----------------|
| 1000  | 0.15     | 4.32   | 7.2            |
| 5000  | 49.26    | 1.69   | 2.8            |
| 10000 | 315.72   | 2.11   | 3.5            |

## 1.2 Performance Visualization



Figure 1: HPL Performance Analysis: Performance vs block size (top left), efficiency trends (top right), execution time scaling (middle left), performance evolution (middle right), and performance heatmap (bottom)

## 2 Questions Analysis

### 2.1 Q1: Execute HPL with Varying Matrix and Block Sizes

**Answer:** Successfully executed 27 experiments ( $3 \text{ matrix sizes} \times 9 \text{ block sizes}$ ). All tests passed validation (i did not execute it for size  $N = 20000$  because i run it locally(i did not get yet access to toubkal)).

### 2.2 Q2: Measure Execution Time and Performance

**Answer:** Complete measurements collected:

- Fastest: 0.03s ( $N=1000$ , multiple NBs)
- Slowest: 315.72s ( $N=10000$ , NB=1)
- Best performance: 52.84 GFLOPS
- Worst performance: 1.69 GFLOPS

Time scales as  $O(N^3)$  as expected.

### 2.3 Q3: Compute Efficiency Relative to Theoretical Peak

**Answer:** Theoretical peak: 60 GFLOPS ( $4.5\text{ GHz} \times 2\text{ FMA} \times 4\text{ AVX2}$ )

#### Efficiency Results:

- Best: 88.1% (N=10000, NB=256)
- Worst: 2.8% (N=5000, NB=1)
- Average increases with N: 28.1% → 41.0% → 46.2%

### 2.4 Q4: Analyze Influence of Matrix Size (N)

**Answer:** Matrix size significantly impacts performance.

#### Performance Trends:

- N=1000: 16.89 GFLOPS average
- N=5000: 24.62 GFLOPS (+45.8%)
- N=10000: 27.72 GFLOPS (+12.6%)

#### Why Larger Matrices Perform Better:

1. **Cache Amortization:** More computation per cache miss
2. **Reduced Overhead:** Fixed costs amortized over more work
3. **Better Vectorization:** Longer loops, better SIMD utilization

**Diminishing Returns:** Matrix exceeds cache → memory bottleneck.

### 2.5 Q5: Analyze Influence of Block Size (NB)

**Answer:** Block size is critical 5-29× performance difference!

Table 4: Average Performance by Block Size

| NB           | Avg GFLOPS | Efficiency (%) | vs NB=1 |
|--------------|------------|----------------|---------|
| 1            | 2.71       | 4.5            | 1.0×    |
| 8            | 16.60      | 27.7           | 6.1×    |
| 32           | 32.03      | 53.4           | 11.8×   |
| green!20 128 | 40.71      | 67.8           | 15.0×   |
| 256          | 39.85      | 66.4           | 14.7×   |

#### Optimal Block Sizes:

- N=1000 → NB=32 (fits in L1/L2)
- N=5000 → NB=128 (uses L3 effectively)
- N=10000 → NB=256 (maximum cache reuse)

#### Why Small Blocks Fail (NB8):

- Excessive loop overhead
- Poor vectorization (SIMD underutilized)
- Cache thrashing
- Result: 1.69-16.60 GFLOPS (2.8-27.7% efficiency)

## 2.6 Q6: Explain Performance Gaps

**Answer:** Achieved 88.1% efficiency. The 12% gap is due to:

### 1. Memory Bandwidth Bottleneck

- DDR4: 25 GB/s theoretical, 15-20 GB/s actual
- N=10000 matrix (800 MB)  $\ddot{\cup}$  L3 cache (12 MB)
- 5-10% cache miss rate  $\rightarrow$  frequent DRAM access

### 2. Cache Limitations

- L3 miss penalty: 200+ cycles (vs 4 for L1)
- Only 1.5% of matrix fits in L3

### 3. Thermal Throttling

- Max turbo: 4.7 GHz (brief)
- Sustained: 3.5-4.0 GHz
- Reduces peak from 60 to 50 GFLOPS

### 4. Instruction-Level Issues

- Pipeline stalls from data dependencies
- Branch mispredictions
- Limited SIMD registers

#### Efficiency Breakdown:

- Cache: 90-95%
- Memory bandwidth: 85-90%
- Thermal: 80-90%
- Instruction-level: 85-90%
- Software overhead: 95-98%

88.1% achieved efficiency is excellent for single-core performance!

## 3 Performance Speedup

Table 5: Speedup: Optimal vs Worst Block Size

| N     | Best (NB)          | Worst (NB=1) | Speedup        |
|-------|--------------------|--------------|----------------|
| 1000  | 24.54 GFLOPS (32)  | 4.32 GFLOPS  | 5.68 $\times$  |
| 5000  | 49.46 GFLOPS (128) | 1.69 GFLOPS  | 29.22 $\times$ |
| 10000 | 52.84 GFLOPS (256) | 2.11 GFLOPS  | 25.02 $\times$ |