



## Toward Smart HPC via Intelligent Scheduling

Zhiling Lan

[lan@iit.edu](mailto:lan@iit.edu)

Department Computer Science, Illinois Institute of Technology

*Keynote at the ICPP 2020 SRMDPS Workshop*



## Outline

- What is smart HPC?
- What is intelligent (batch) scheduling?
- Two research projects
  - Power aware scheduling
  - Multi-resource scheduling
- Open issues

ILLINOIS INSTITUTE OF TECHNOLOGY

## HPC Trends: Workload

The diagram shows three curved arrows converging: an orange arrow labeled 'Big Data' from the left, a grey arrow labeled 'Machine Learning' from the top, and a blue arrow labeled 'HPC' from the right. Below the arrows are icons for a smartphone, a server tower, and a lightbulb.

The poster features a sunset background with the text: 'The Convergence of High Performance Computing, Big Data, and Machine Learning A NITRD Workshop'. Below it, the date 'October 29-30, 2018' and location 'Natcher Conference Center National Institutes of Health, Bethesda, MD' are mentioned.

The diagram illustrates the architecture of a Large Scale System. It consists of four horizontal layers: 'Large Scale System' (blue), 'System Software' (yellow), 'Machine Learning Applications' (blue), 'Big Data Applications' (blue), and 'Simulation Applications' (blue).

Z. Lan 3

ILLINOIS INSTITUTE OF TECHNOLOGY

## HPC Trends: Platform

The diagram shows a cross-section of a chip. On the left, 'Intel Processor Graphics Gen8 Graphics, Compute, & Media' is shown. To its right are two 'CPU core' units and a 'System Agent'. Below them is a 'Shared LLC' (Last Level Cache). The entire structure is labeled 'Intel Heterogeneous Computing Pipeline'.

This slide details Intel's roadmap for datacenter CPUs. It includes the following components and timelines:

- 2019:** CASCADE LAKE (4nm, INTEL DL BOOST (BFLOAT16), NEW MEMORY STORAGE HIERARCHY)
- 2020:** COOPER LAKE (3nm, INTEL DL BOOST (BFLOAT16), VOLUME RAMP 2H20)
- 2021:** SAPPHIRE RAPIDS (NEXT GENERATION TECHNOLOGIES)

The slide also highlights 'THE ONLY DATACENTER CPU OPTIMIZED FOR CONVERGENCE' with features like INTEL ADVANCED VECTOR EXTENSIONS 512, INTEL DEEP LEARNING BOOST (INTEL DL BOOST), and INTEL OPTANE DC PERSISTENT MEMORY.

**Aurora: Bringing It All Together**

This section shows the architecture of the Aurora supercomputer, featuring 2 INTEL XEON SCALABLE PROCESSORS ('Sapphire Rapids'), 6 X ARCHITECTURE BASED GPUs ('Volta Vpus'), and a unified programming model ('OneAPI'). It highlights 'LEADERSHIP PERFORMANCE For HPC, data analytics, AI', 'UNIFIED MEMORY ARCHITECTURE', 'POWER EFFICIENCY', and 'ALL-TO-ALL CONNECTIVITY WITHIN NODE'.

**AMD Exascale Strategy Hinges on Heterogeneity**

The diagram illustrates AMD's exascale strategy. It shows a network of cabinets connected to a central node, and a detailed view of a single node's internal components: DRAM, GPU, and APU chip. Labels include 'Network', 'Cabinet', 'Computing node', 'TOM switch', 'IO nodes, management nodes, etc.', 'Off-package NVRAM (multiple modules)', 'GPU cores', 'CPU cores', and 'Die-stacked DRAM'.

Z. Lan 4







## Power Awareness

- Actively observes, analyzes, and assesses power behaviors of the system and user jobs
  - To control system wide power consumption
  - To minimize impact on system utilization
  - To be applicable to general HPC applications
- Three components:
  1. Power analysis
  2. Dynamic power learning
  3. Power aware scheduling

S. Wallace, X. Yang, V. Vishwanath, W. Allcock, S. Coghlan, M. Papka, and Z. Lan,  
 "A Data Driven Scheduling Approach for Power Management on HPC Systems", SC'16.

Z. Lan

9



## Workload Power Analysis

- Study power data from Mira environmental database in 2014 -
  - statistics of job power profiles in kw per rack

|                    |        |
|--------------------|--------|
| Minimum            | 36.48  |
| Mean               | 65.67  |
| Maximum            | 160.60 |
| Percentile 05      | 56.60  |
| Percentile 25      | 61.20  |
| Percentile 75      | 71.03  |
| Percentile 95      | 74.57  |
| Percentile 99      | 77.99  |
| Standard Deviation | 6.18   |

HPC jobs have distinct power profiles with the difference being as high as 4.4 times!

Z. Lan

10





## Illustrative Example

| Job                       | Job Size (Racks) | Power Profile (kW/rack) |
|---------------------------|------------------|-------------------------|
| <b>Head of Wait queue</b> |                  |                         |
| J <sub>0</sub>            | 3                | 60                      |
| J <sub>1</sub>            | 1                | 50                      |
| J <sub>2</sub>            | 5                | 30                      |
| J <sub>3</sub>            | 4                | 40                      |

- Allocated onto a 6-rack system with total power cap of 230 kW.
  - Conventional FCFS always selects  $\langle J_0, J_1 \rangle$  leaving 2 racks unused.
  - Our approach selects  $\langle J_1, J_2 \rangle$  as this maximizes utilization under power cap.

Z. Lan

13



## Dynamic Leaner

- To take power data from power monitoring facilities
- At each scheduling instance, the learner has two tasks
  - Estimate peak power requirement of **each job in the queue**
  - Calculate **the available power budget for incoming jobs** by measuring power usage of running jobs

Z. Lan

14



## Estimating Job Power Profile

- Two possibilities
  - No power information for the job
    - If the group's power profile is known, use group power profile
    - Otherwise, assume the maximum
  - Previous power data available
    - Use job's previous profile as its current power profile
- Our learner constantly updates power profiles of jobs and groups using **two-sample T-test**

Z. Lan

15



## Power Aware Scheduling

- To select jobs in the queue for execution
- Two major parts:
  - In contrast to one-by-one scheduling approach, we adopt a **window-based approach**
  - A **0-1 knapsack** problem is formulated to describe power aware scheduling

$$\begin{aligned} \max\left(\sum_{1 \leq i \leq k} x_i \cdot n_i\right) &\leq N - N_{used}, \quad x_i = 0 \text{ or } 1 \\ s.t. \sum_{1 \leq i \leq k} x_i \cdot p_i &\leq PB - P_{used} \end{aligned}$$

Z. Lan

16



## Evaluation

- Trace-based simulation
  - A stream version of CQSim (<https://github.com/SPEAR-IIT/CQSim>)
- Workload traces
  - Mira 2014 traces: workload log & CMCS power data
- Metrics:
  1. Learning accuracy
  2. Capping success rate
  3. Scheduling performance

Z. Lan

17



## Learning Accuracy



94% accurate after just 26 days of execution.

Z. Lan

18









## Existing Solutions

- **Naïve**: run jobs in sequence (Slurm)
- **Constrained**: maximize utilization of one resource (SC'16)
- **Weighted**: maximize the combination of utilizations of multiple resources (ICDCS'12)
- **Bin Packing**: select big jobs iteratively (SIGCOMM'14)

Z. Lan

25



## Illustrative Example

- J1: <6 CPUs, 1 BBs>
- J2: <2 CPUs, 1 BBs>
- J3: <3 CPUs, 5 BBs>
- J4: <3 CPUs, 3 BBs>

| Job Waiting Queue |         |           |          |
|-------------------|---------|-----------|----------|
| Solution          | Methods | CPU Util. | BB Util. |
| 1                 | Naive   | 89%       | 22%      |



Z. Lan

26

ILLINOIS INSTITUTE OF TECHNOLOGY

- J1: <6 CPUs, 1 BBs>
- J2: <2 CPUs, 1 BBs>
- J3: <3 CPUs, 5 BBs>
- J4: <3 CPUs, 3 BBs>

Job Waiting Queue  
J4 J3 J2 J1 → Scheduler → HPC System

| Solution | Methods                            | CPU Util. | BB Util. |
|----------|------------------------------------|-----------|----------|
| 1        | Naive                              | 89%       | 22%      |
| 2        | Constrained, Weighted, Bin Packing | 100%      | 67%      |

HPC System  
CPUs  
BBs

Z. Lan 27

ILLINOIS INSTITUTE OF TECHNOLOGY

- J1: <6 CPUs, 1 BBs>
- J2: <2 CPUs, 1 BBs>
- J3: <3 CPUs, 5 BBs>
- J4: <3 CPUs, 3 BBs>

Job Waiting Queue  
J4 J3 J2 J1 → Scheduler → HPC System

| Solution | Methods                            | CPU Util. | BB Util. |
|----------|------------------------------------|-----------|----------|
| 1        | Naive                              | 89%       | 22%      |
| 2        | Constrained, Weighted, Bin Packing | 100%      | 67%      |
| 3        | new solution                       | 89%       | 100%     |

HPC System  
CPUs  
BBs

Z. Lan 28



## Pareto Set

| Solution | CPU Util. | BB Util. |
|----------|-----------|----------|
| 1        | 89%       | 22%      |
| 2        | 100%      | 67%      |
| 3        | 89%       | 100%     |

Pareto Set



- **S2 and S3 form Pareto set**

- **S2:** higher CPU; lower BB
- **S3:** higher BB; lower CPU



## BBSched for Multi-Resource Scheduling

- Goal:
  - Being capable of identifying Pareto set
- Key components:
  - **Window-based** scheduling to preserve job ordering
  - Formulate as ***multi-objective optimization (MOO)*** problem
  - Rapidly solve the problem via ***genetic algorithm***





## Solving MOO

- Genetic algorithm
  - Approximate true Pareto set iteratively
  - Require much less time than exhaustive search



## Scheduling Decision

- Select a solution out of the Pareto set

| Solution | CPU Utilization | BB Utilization  |
|----------|-----------------|-----------------|
| S1       | 80% <b>10%↓</b> | 70% <b>40%↑</b> |
| S2       | 50% <b>40%↓</b> | 90% <b>60%↑</b> |
| S3       | 70% <b>20%↓</b> | 80% <b>50%↑</b> |
| S4       | 90%             | 30%             |

- S4: **Maximum CPU utilization**
- S1 and S3: BB gain 2x> CPU loss
- S3: **Maximum BB improvement**



## Evaluation

- Trace-based simulation via **CQSim**  
<https://github.com/SPEAR-IIT/CQSim>
- Workload traces (Cori@LBNL, Theta@ALCF)

|                     | Cori                                 | Theta                 |
|---------------------|--------------------------------------|-----------------------|
| Location            | NERSC                                | ALCF                  |
| Scheduler           | Slurm                                | Cobalt                |
| System Types        | Capacity computing                   | Capability computing  |
| Compute Nodes       | 12,076<br>(2,388 Haswell; 9,688 KNL) | 4,392<br>(4,392 KNL)  |
| Aggregated Memory   | 1,304.5TB                            | 913.5TB               |
| Shared Burst Buffer | 1.8PB                                | 1.26PB (projected)    |
| Trace Period        | Apr. 2018 - Jul. 2018                | Jan. 2018 - May. 2018 |
| Number of Jobs      | 2,607,054                            | 70,507                |
| BB Data Source      | Slurm log                            | Darshan log           |
| BB Range            | [1GB, 165TB]                         | [1GB, 285TB]          |

Z. Lan

35



## Workloads

- Insufficient BB requests, as BB is new
- No records on some requests
- S1 → S4, weak to strong confliction

| Workload | % of jobs requesting BB | BB Range |
|----------|-------------------------|----------|
| S1       | 50%                     | 5TB+     |
| S2       | 75%                     | 5TB+     |
| S3       | 50%                     | 20TB+    |
| S4       | 75%                     | 20TB+    |

Z. Lan

36



## Comparison

- Methods:
  - **Baseline**: Naïve method, no optimization
  - **Weighted**: 50% node, 50% BB
  - **Weighted\_CPU**: 80% node, 20% BB
  - **Weighted\_BB**: 20% node, 80% BB
  - **Constrained\_CPU**: maximize node utilization
  - **Constrained\_BB**: maximize BB utilization
  - **Bin\_Packing**
  - **BBSched**:  $w = 20$ ,  $G = 500$ ,  $P = 20$
- Evaluation Metrics
  - **Resource Utilization**: node, BB
  - **Average Job Wait Time**

37



## Node Utilization



**BBSched** improves node utilization by up to 20%

**Constrained\_CPU** has good performance on Original, S1 and S2

S1→S4, **node utilization ↓** due to intensive BB requests









## Open Issues

- Emerging latency-sensitive, short-running, malleable, ensemble-based jobs
  - Borrow techniques from cloud scheduling
  - E.g., hierarchical approach (Mesos)
- Hybrid nodes (CPUs & accelerators)
  - Fine-grained job scheduling
- Existing approaches are heuristics or optimization base
  - Exploit advanced RL to improve scheduling efficiency
- Finally, data collection and real-time processing for driving intelligent scheduling

Z. Lan

45



## SPEAR Research Group



Welcome to the SPEAR (Systems for Performance, Energy, And Resiliency) team's web page!

The team conducts research spanning various areas of parallel and distributed systems including cluster management, interconnection networking, performance modeling and simulation, power and energy efficiency, and fault tolerance. Our mission is to design scalable methods and software for large-scale HPC, AI, and data analysis. The team has a strong collaboration with the ALCF and MCS divisions at Argonne National Lab.



Z. Lan

46

ILLINOIS INSTITUTE OF TECHNOLOGY

## Collaborators at Argonne

The grid contains 15 individual portraits of diverse professionals, likely researchers or faculty members, from the Argonne National Laboratory. They are arranged in three horizontal rows. The first row has five people, the second row has five people, and the third row has five people. The individuals are dressed in various professional attire, including blazers, shirts, and glasses.

47

ILLINOIS INSTITUTE OF TECHNOLOGY

A simple cartoon illustration of a person with dark hair and a thoughtful expression, looking upwards. A thought bubble above their head contains a large black question mark, symbolizing inquiry or a question.

Contact: Zhiling Lan  
Email: [lan@iit.edu](mailto:lan@iit.edu)  
Website: <http://www.cs.iit.edu/~lan>