

# Energy-Aware Runtime Resource Harmonizer for Co-running Applications

Vanshika Jain<sup>1</sup>, Varun Parashar<sup>1</sup>, Vivek Kumar<sup>1</sup>, Chiranjib Sur<sup>2</sup>

1 IIIT Delhi, New Delhi, India

2 Shell India Markets Pvt. Ltd, India



# Outline

- ▶ Introduction
- ▶ Motivation
- ▶ Existing Approaches
- ▶ Contributions
- ▶ Implementation
- ▶ Results
- ▶ Conclusion

# Resource Utilization in the Exascale Era

**Increasing number of sockets  
and cores per node**

| Rank of Top500<br>(November 2025) | Sockets<br>Per Node | Cores<br>Per Node |
|-----------------------------------|---------------------|-------------------|
| 1                                 | <b>4</b>            | 96                |
| 2                                 | 1                   | 64                |
| 3                                 | 2                   | 104               |
| 4                                 | <b>4</b>            | <b>288</b>        |
| 5                                 | 2                   | 96                |

1. <https://top500.org/lists/top500/>



# Resource Utilization in the Exascale Era

## Increasing number of sockets and cores per node

| Rank of Top500 (November 2025) | Sockets Per Node | Cores Per Node |
|--------------------------------|------------------|----------------|
| 1                              | <b>4</b>         | 96             |
| 2                              | 1                | 64             |
| 3                              | 2                | 104            |
| 4                              | <b>4</b>         | <b>288</b>     |
| 5                              | 2                | 96             |

## Power usage at supercomputers



1. <https://top500.org/lists/top500/>

2. Patki et.al. [ICS2025]

# Resource Utilization in the Exascale Era

## Increasing number of sockets and cores per node

| Rank of Top500 (November 2025) | Sockets Per Node | Cores Per Node |
|--------------------------------|------------------|----------------|
| 1                              | <b>4</b>         | 96             |
| 2                              | 1                | 64             |
| 3                              | 2                | 104            |
| 4                              | <b>4</b>         | <b>288</b>     |
| 5                              | 2                | 96             |

## Power usage at supercomputers



It is critical to improve resource utilization for achieving energy efficiency

1. <https://top500.org/lists/top500/>

2. Patki et.al. [ICS2025]

# Improving Resource Utilization via Co-location



Applications A, B & C to be executed on a quad-socket system

# Improving Resource Utilization via Co-location



# Improving Resource Utilization via Co-location



Application C  
waiting for the CPUs

# Improving Resource Utilization via Co-location



Each application completed their execution one by one

# Improving Resource Utilization via Co-location



Batch Execution



Co-running Execution

Threads of Application  
A, B & C  
running in parallel

# Improving Resource Utilization via Co-location



# Improving Resource Utilization via Co-location



# Improving Resource Utilization via Co-location



# Improving Resource Utilization via Co-location



Type: **Block-Cyclic**

# Improving Resource Utilization via Co-location



Type: **Block-Cyclic**

# Improving Resource Utilization via Co-location



# Improving Resource Utilization via Co-location



# Improving Resource Utilization via Co-location



Type: **Interleaved**

# Improving Resource Utilization via Co-location



A  
B  
C

# Improving Resource Utilization via Co-location



Type: **Block-Interleaved**

# Improving Resource Utilization via Co-location



Choosing optimal thread placement over Batch execution improves EDP by up to 81%

# Achieving Energy Efficiency on Multicores



# Achieving Energy Efficiency on Multicores



- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level

# Achieving Energy Efficiency on Multicores



- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level

# Achieving Energy Efficiency on Multicores



- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level

Heatmap represents the change in EDP with a particular combination of core-uncore frequency relative to default settings

# Achieving Energy Efficiency on Multicores



- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level

Heatmap represents the change in EDP with a particular combination of core-uncore frequency relative to default settings

# Achieving Energy Efficiency on Multicores



- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level

Heatmap represents the change in EDP with a particular combination of core-uncore frequency relative to default settings

# Achieving Energy Efficiency on Multicores



- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level

Heatmap represents the change in EDP with a particular combination of core-uncore frequency relative to default settings

# Achieving Energy Efficiency on Multicores



- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation

# Achieving Energy Efficiency on Multicores



- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation
  - Thread packing and unpacking technique provides runtime independence

# Achieving Energy Efficiency on Multicores



- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation
  - Thread packing and unpacking technique provides runtime independence

**Heatmap represents the change in EDP by changing the core count relative to default with maximum core allocation**

# Achieving Energy Efficiency on Multicores



- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation
  - Thread packing and unpacking technique provides runtime independence

**Heatmap represents the change in EDP by changing the core count relative to default with maximum core allocation**

# Insights

- Choosing optimal thread placement improves resource utilization for co-running applications
- There is a strong correlation between application behavior and resource requirement

**Cache-Sensitive and Neutral**



**Memory-bound**



# Existing Approaches for Co-Running Applications

| Categories of Resource Management Techniques | DCT only | DVFS only | UFS only | DVFS+ UFS | DCT+ DVFS+ UFS |
|----------------------------------------------|----------|-----------|----------|-----------|----------------|
| Thread Placement for contention reduction    | ✓        | ✗         | ✗        | ✗         | ✗              |
| Runtime Oblivious                            | ✓        | ✓         | ✗        | ✗         | ✗              |
| Model Free                                   | ✓        | ✓         | ✓        | ✓         | ✗              |

# Existing Approaches for Co-Running Applications

| Categories of Resource Management Techniques | DCT only | DVFS only | UFS only | UFS | DCT+ DVFS+ UFS |
|----------------------------------------------|----------|-----------|----------|-----|----------------|
| Thread Placement for contention reduction    | ✓        | ✗         | ✗        | ✗   | ✗              |
| Runtime Oblivious                            | ✓        | ✓         | ✗        | ✗   | ✗              |
| Model Free                                   | ✓        | ✓         | ✓        | ✓   | ✗              |

Our Focus

# Contributions

- ✓ Harmonizer: A library-based resource management framework for co-running applications on multicore multi-socket servers
  - ✓ Model-free and runtime oblivious
- ✓ Dynamically manages thread placement, core frequency, uncore frequency and core allocation
  - ✓ Uses a lightweight daemon for online profiling of hardware PMCs
- ✓ Experimental Evaluations on a quad-socket 72-core Intel Xeon processor
  - ✓ Using several exascale proxy applications (OpenMP, Kokkos and HCLib)
- ✓ Results
  - ✓ Demonstrating substantially energy savings and performance gains

## High-level Architecture of Harmonizer



## High-level Architecture of Harmonizer

A

B

C



Harmonizer  
Daemon  
Process

# High-level Architecture of Harmonizer



# High-level Architecture of Harmonizer



# High-level Architecture of Harmonizer



# High-level Architecture of Harmonizer



# Harmonizer Policy



- Classify each application's Memory Access Pattern (**MAP**)
  - Core-level PMCs
    - Cache misses
    - Cache accesses
  - Uncore PMCs (Socket-level)
    - Integrated Memory Controller (IMC) accesses

# Harmonizer Policy



A

B

C

- Classify each application's Memory Access Pattern (**MAP**)
  - Core-level PMCs
    - Cache misses
    - Cache accesses
  - Uncore PMCs (Socket-level)
    - Integrated Memory Controller (IMC) accesses

# Harmonizer Policy



- Classify each application's Memory Access Pattern (**MAP**)
  - Core-level PMCs
    - Cache misses
    - Cache accesses
  - Uncore PMCs (Socket-level)
    - Integrated Memory Controller (IMC) accesses

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



Optimal placement for a particular mix

**Cache Sensitive – Neutral – Cache Sensitive**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



Optimal placement for a particular mix

**Cache Sensitive – Neutral – Cache Sensitive  
(Block-Cyclic to minimize LLC sharing)**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



**UFS is used to explore optimal UF**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



**UFS exploration possible only over two sockets in this mix because UFS can be applied at socket-level**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



Harmonizer rearranges threads over sockets to maximize application isolation while retaining the behaviour of Block-cyclic placement

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



**UFS exploration now possible on three sockets instead of two**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



UFS exploration now possible on three sockets instead of two

**Reduced exploration space based on MAP identified over each socket**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



DVFS is used to explore optimal CF for each application

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



DVFS is used to explore optimal CF for each application

**Reduced exploration space based on MAP identified for each application**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



**Uniform DVFS settings on each core of sockets hosting a single application**

# Harmonizer Policy

Identify MAP of each Co-Runners



Set Optimal Thread Placement



Explore UFS



Explore DVFS



Explore DCT



**Non-uniform DVFS setting at socket hosting multiple application's threads**

# Harmonizer Policy



# Harmonizer Policy



# Experimental Methodology

## Exascale proxy applications

| Type of Applications | Application                                                 |
|----------------------|-------------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP)<br>MinTally (OpenMP)<br>XSbench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )                  |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)                     |

# Experimental Methodology

## Exascale proxy applications

| Type of Applications | Application                                                  |
|----------------------|--------------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP)<br>MinTally (OpenMP)<br>XS Bench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )                   |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)                      |

| Number of Applications in a Mix | Number of Mixes |
|---------------------------------|-----------------|
| 3                               | 6               |
| 4                               | 3               |

# Experimental Methodology

## Exascale proxy applications

| Type of Applications | Application                                                  |
|----------------------|--------------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP)<br>MinTally (OpenMP)<br>XS Bench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )                   |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)                      |

| Number of Applications in a Mix | Number of Mixes |
|---------------------------------|-----------------|
| 3                               | 6               |
| 4                               | 3               |

## Hardware Platform

- Quad socket Intel Xeon 5318H Cooper Lake
- 18 cores per socket, Total 72 cores (144 CPUs)

# Experimental Methodology

## Exascale proxy applications

| Type of Applications | Application                                                  |
|----------------------|--------------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP)<br>MinTally (OpenMP)<br>XS Bench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )                   |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)                      |

| Number of Applications in a Mix | Number of Mixes |
|---------------------------------|-----------------|
| 3                               | 6               |
| 4                               | 3               |

## State-of-the-Art used for comparison

- Mapper (TACO'22)
- NuPoCo (PACT'18)

## Hardware Platform

- Quad socket Intel Xeon 5318H Cooper Lake
- 18 cores per socket, Total 72 cores (144 CPUs)

# EDP of Harmonizer Relative to Default



# EDP of Harmonizer Relative to Default



| Harmonizer Policy | Mean EDP Improvement (Mix1- Mix3) |
|-------------------|-----------------------------------|
| Thread Placement  | 7.3%                              |
| UFS               | 3%                                |
| DVFS              | N/A                               |
| DCT               | N/A                               |

Improvement in EDP from individual policies

# EDP of Harmonizer Relative to Default



| Harmonizer Policy | Mean EDP Improvement (Mix4- Mix6) |
|-------------------|-----------------------------------|
| Thread Placement  | 26.6%                             |
| UFS               | N/A                               |
| DVFS              | 3%                                |
| DCT               | 13.7%                             |

Improvement in EDP from individual policies

# EDP of Harmonizer Relative to Default



| Harmonizer Policy | Mean EDP Improvement (Mix7- Mix9) |
|-------------------|-----------------------------------|
| Thread Placement  | 14.6%                             |
| UFS               | N/A                               |
| DVFS              | 3.2%                              |
| DCT               | 9.3%                              |

Improvement in EDP from individual policies

# System Throughput Relative to Default



## System Throughput

Geometric mean of speedup  
of each application

## Summary

- Effective system utilization is key to improving energy efficiency in the exascale era
  - Co-running applications can improve system utilization by complementing each other's resource requirements
- Harmonizer dynamically profiles the core and uncore PMCs to characterize the behaviour of co-running applications
  - It applies optimal thread placement for improving the system utilization
  - Dynamically tunes each socket's core and uncore frequencies, and application level core allocation to enhance energy efficiency
- Future Work
  - We plan to extend Harmonizer to handle dynamically varying memory access patterns in applications and scale it to cluster-level environments

# Thank You



Scan to access the Harmonizer artifact