

# **Exploiting Asymmetric Systems with Flexible System Software**



**Kallia Chronaki**

Department of Computer Architecture  
Universitat Politècnica de Catalunya

This dissertation is submitted for the degree of

*Doctor of Philosophy*

June 2018

BLANK

## **ACKNOWLEDGEMENTS**

---

This work has been supported by the grant SEV-2011-00067 of the Severo Ochoa Program, awarded by the Spanish Government, by the Mont-Blanc 2 project (FP7-610402), by the Ro-MoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2012-34557, CAC2007-00052 and TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the European Union's Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 779877 and by the Spanish Government (SEV2015-0493). The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement nº 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement nº 671697.



## ABSTRACT

---

Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores (fast and slow) AMCs have the potential to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing.

In this thesis we initially explore the potential of these machines when executing current high-performance applications and we search for the most appropriate execution model. Specifically we evaluate several execution models on an ARM big.LITTLE AMC using the PARSEC benchmark suite that includes representative highly parallel applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system level as it improves the user-level scheduling by 23%; at the same time the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%.

Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programmability and flexibility to the programmer as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task scheduling approaches for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph (TDG) of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches

with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to  $1.45\times$  on a real 8-core asymmetric system and up to  $2.1\times$  on a simulated 32-core asymmetric chip.

Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric systems, we obtain performance improvements up to  $15\times$ , averaging to  $3.1\times$  over the baseline.

Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) during the game-play on mobile devices using AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.

# **Exploiting Asymmetric Systems with Flexible System Software**

by  
Kallia Chronaki

A Dissertation  
Presented to the Department of Computer Architecture  
at  
Universitat Politècnica de Catalunya  
in Candidacy for the Degree of  
Doctor of Philosophy.

Thesis Advisors:

Rosa M. Badia, Prof.  
Universitat Politècnica de Catalunya, Spain

Marc Casas, Prof.  
Universitat Politècnica de Catalunya, Spain

Barcelona, July 2018

BLANK

# TABLE OF CONTENTS

---

|                                                      |             |
|------------------------------------------------------|-------------|
| <b>Table of contents</b>                             | <b>ix</b>   |
| <b>List of figures</b>                               | <b>xiii</b> |
| <b>List of tables</b>                                | <b>xvii</b> |
| <b>Nomenclature</b>                                  | <b>xx</b>   |
| <b>1 Introduction</b>                                | <b>1</b>    |
| 1.1 Thesis Motivation . . . . .                      | 4           |
| 1.2 Thesis Contributions . . . . .                   | 5           |
| 1.3 Thesis Organization . . . . .                    | 8           |
| <b>2 Related Work</b>                                | <b>11</b>   |
| 2.1 Asymmetric Multi-Core Systems . . . . .          | 13          |
| 2.2 Scheduling for Heterogeneous Systems . . . . .   | 13          |
| 2.2.1 Process Scheduling . . . . .                   | 13          |
| 2.2.2 Task Scheduling . . . . .                      | 14          |
| 2.3 Hardware-Software co-design . . . . .            | 17          |
| <b>3 Experimental Environment</b>                    | <b>19</b>   |
| 3.1 The ARM big.LITTLE Architecture . . . . .        | 20          |
| 3.2 The TaskSim Simulator . . . . .                  | 22          |
| 3.3 Task-Based Parallel Programming Models . . . . . | 22          |
| 3.3.1 OmpSs Programming Model . . . . .              | 23          |
| 3.4 Applications . . . . .                           | 25          |

|                                                                            |           |
|----------------------------------------------------------------------------|-----------|
| <b>4 Study of Asymmetric Systems</b>                                       | <b>29</b> |
| 4.1 Introduction . . . . .                                                 | 30        |
| 4.2 Scheduling in Asymmetric Multi-Cores . . . . .                         | 31        |
| 4.2.1 Cluster Switching and In-Kernel Switch . . . . .                     | 31        |
| 4.2.2 Global Task Scheduling . . . . .                                     | 32        |
| 4.2.3 Dynamic Scheduling in the Runtime . . . . .                          | 32        |
| 4.3 Evaluation . . . . .                                                   | 33        |
| 4.3.1 Methodology . . . . .                                                | 33        |
| 4.3.2 Exploiting Parallelism in AMCs . . . . .                             | 37        |
| 4.3.3 Adding Little Cores to an SMC . . . . .                              | 40        |
| 4.3.4 Programming Models for AMCs . . . . .                                | 42        |
| 4.4 Conclusions . . . . .                                                  | 44        |
| <b>5 Task-Based Scheduling Solutions</b>                                   | <b>45</b> |
| 5.1 Introduction . . . . .                                                 | 46        |
| 5.2 Criticality-Aware Task Scheduler (CATS) . . . . .                      | 47        |
| 5.2.1 Task Prioritization . . . . .                                        | 48        |
| 5.2.2 Task Submission . . . . .                                            | 49        |
| 5.2.3 Task-to-Core Assignment . . . . .                                    | 53        |
| 5.3 Critical Path Scheduler (CPATH) . . . . .                              | 53        |
| 5.3.1 Task Prioritization . . . . .                                        | 54        |
| 5.3.2 Task Submission . . . . .                                            | 59        |
| 5.3.3 Task-to-Core Assignment . . . . .                                    | 59        |
| 5.4 Hybrid Criticality Scheduler (HYBRID) . . . . .                        | 59        |
| 5.5 Dynamic Heterogeneous Earliest Finish Time Scheduler (dHEFT) . . . . . | 61        |
| 5.6 Evaluation . . . . .                                                   | 63        |
| 5.6.1 Methodology . . . . .                                                | 63        |
| 5.6.2 Real Environment Evaluation . . . . .                                | 66        |
| 5.6.3 Simulations . . . . .                                                | 72        |
| 5.7 Conclusions . . . . .                                                  | 74        |
| <b>6 Runtime Overheads Migration</b>                                       | <b>77</b> |
| 6.1 Introduction . . . . .                                                 | 78        |
| 6.2 Background and Motivation . . . . .                                    | 80        |
| 6.2.1 Task-based Programming Models . . . . .                              | 80        |
| 6.2.2 Motivation . . . . .                                                 | 83        |
| 6.3 Task Generation Express (TaskGenX) . . . . .                           | 84        |

|                   |                                           |            |
|-------------------|-------------------------------------------|------------|
| 6.3.1             | Implementation . . . . .                  | 85         |
| 6.3.2             | Hardware Requirements . . . . .           | 86         |
| 6.4               | Evaluation . . . . .                      | 88         |
| 6.4.1             | Methodology . . . . .                     | 88         |
| 6.4.2             | Homogeneous Multicore Systems . . . . .   | 89         |
| 6.4.3             | Heterogeneous Multicore Systems . . . . . | 92         |
| 6.4.4             | Comparison to Other Approaches . . . . .  | 94         |
| 6.5               | Conclusions . . . . .                     | 95         |
| <b>7</b>          | <b>Real-Time Scheduling</b>               | <b>97</b>  |
| 7.1               | Background and Motivation . . . . .       | 98         |
| 7.2               | Real-Time Scheduling Solution . . . . .   | 99         |
| 7.3               | Evaluation . . . . .                      | 99         |
| 7.3.1             | Methodology . . . . .                     | 99         |
| 7.3.2             | Results . . . . .                         | 100        |
| 7.4               | Conclusions . . . . .                     | 103        |
| <b>8</b>          | <b>Conclusions</b>                        | <b>105</b> |
| 8.1               | Future Directions . . . . .               | 106        |
| <b>References</b> |                                           | <b>109</b> |



# LIST OF FIGURES

---

|     |                                                                                                                                                                                                                                                                   |    |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1.1 | Classification of multi core system architectures . . . . .                                                                                                                                                                                                       | 2  |
| 1.2 | Increased idle time when moving from homogeneous multi-core to asymmetric multi-core. The directions of the arrows indicate the data dependencies.                                                                                                                | 4  |
| 1.3 | Changing the scheduling leads to improvement for AMC systems. The directions of the arrows indicate the data dependencies. . . . .                                                                                                                                | 5  |
| 1.4 | Task generation overheads leads to increased idle time. Numbers inside the boxes show the task ID that is either being created or executed . . . . .                                                                                                              | 6  |
| 1.5 | Research areas studied in this thesis . . . . .                                                                                                                                                                                                                   | 8  |
| 3.1 | Samsung Exynos 5422 processor with ARM big.LITTLE architecture. . . . .                                                                                                                                                                                           | 21 |
| 4.1 | Ideal speedup over 1 little core according to Equation 4.7. Numbers at the bottom of x axis show the total number of cores, numbers above it show the number of big cores . . . . .                                                                               | 35 |
| 4.2 | Execution time speedup over 1 little core for systems that consist of 4 cores in total with 0, 2 and 4 big cores. Different schedulers at the application ( <i>static threading</i> ), OS ( <i>GTS</i> ) and runtime ( <i>task-based</i> ) levels are considered. | 37 |
| 4.3 | Average power measurements on a 4-core system with 0, 2, and 4 big cores.                                                                                                                                                                                         | 37 |
| 4.4 | Normalized energy consumption and average EDP on a 4-core system with 0, 2, and 4 big cores. Static threading on 4 little cores is the baseline in both cases. . . . .                                                                                            | 38 |
| 4.5 | Average results when running on 4 to 8 cores with 4 of them big. Speedup is over 1 little core. Static threading on 4 little cores is the baseline of energy consumption and EDP . . . . .                                                                        | 39 |
| 4.6 | Speedup over 1 little core when running on 4 to 8 cores and 4 of them are big                                                                                                                                                                                     | 40 |
| 4.7 | Average power when running on 4 to 8 cores and 4 of them are big . . . . .                                                                                                                                                                                        | 40 |

|      |                                                                                                                                                                                                                                                                                                                                                                     |    |
|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 4.8  | Speedup over 1 little core when running on 4 to 8 cores and 4 of them are big. Four different programming models are considered: Static threading using <code>pthreads</code> , parallel loops with static scheduling (loop static), parallel loops with dynamic scheduling (loop dynamic), and a task-based solution with dynamic scheduling (task-based). . . . . | 43 |
| 5.1  | CATS overview. Nodes are marked with the <i>bottom level</i> of each task. Pattern-filled nodes mark the critical tasks. . . . .                                                                                                                                                                                                                                    | 49 |
| 5.2  | Task submission with CATS. Gray nodes indicate finished tasks and pattern-filled nodes indicate critical tasks. . . . .                                                                                                                                                                                                                                             | 51 |
| 5.3  | Priority assignment with CPATH taking into account the task costs. Task costs are assumed known and are shown in the tables. . . . .                                                                                                                                                                                                                                | 54 |
| 5.4  | Priority assignment with HYBRID scheduler. Priority update when the edge between tasks 12 and 13 is created . . . . .                                                                                                                                                                                                                                               | 61 |
| 5.5  | Cholesky factorization task dependency graphs. . . . .                                                                                                                                                                                                                                                                                                              | 63 |
| 5.6  | Task cost distribution for each application. Results are based on 4BIG-core executions. <i>x</i> axis shows the cost of the tasks and <i>y</i> axis shows the number of tasks with the corresponding task cost. . . . .                                                                                                                                             | 66 |
| 5.7  | Schedulers comparison on ODROID-XU3 . . . . .                                                                                                                                                                                                                                                                                                                       | 67 |
| 5.8  | Speedup of CATS, CPATH, HYBRID, dHEFT and BF on 8 cores. . . . .                                                                                                                                                                                                                                                                                                    | 69 |
| 5.9  | Average speedups obtained for each scheduler . . . . .                                                                                                                                                                                                                                                                                                              | 71 |
| 5.10 | Speedups obtained for each scheduler and each application . . . . .                                                                                                                                                                                                                                                                                                 | 72 |
| 5.11 | Improvement of heterogeneous schedulers over BF for simulated 16 and 32 core heterogeneous systems for Cholesky $32 \times 32$ . . . . .                                                                                                                                                                                                                            | 73 |
| 5.12 | Improvement of heterogeneous schedulers over BF for simulated 16 and 32 core heterogeneous systems for QR. . . . .                                                                                                                                                                                                                                                  | 74 |
| 5.13 | Improvement of heterogeneous schedulers over BF for simulated 16 and 32 core heterogeneous systems for heat diffusion. . . . .                                                                                                                                                                                                                                      | 75 |
| 6.1  | Master thread activity for Cholesky as we increase the number of cores. . .                                                                                                                                                                                                                                                                                         | 83 |
| 6.2  | Communication mechanism between master/workers and SRT threads. . . .                                                                                                                                                                                                                                                                                               | 86 |
| 6.3  | SoC architecture including three types of cores: out of order, in-order and RTopt. . . . .                                                                                                                                                                                                                                                                          | 87 |
| 6.4  | Speedup of TaskGenX compared to the speedup of Baseline and Baseline+RTopt for each application for systems with 8 up to 512 cores. The average results of (a) show the average among all workloads shown on (a) and (b) . . . . .                                                                                                                                  | 90 |

|     |                                                                                                                                                                                                                                                                                                                             |     |
|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| 6.5 | Canneal performance as we modify $r$ ; $x$ -axis shows the number of cores. . . . .                                                                                                                                                                                                                                         | 92  |
| 6.6 | Average speedup among all 11 workloads on heterogeneous simulated systems. The numbers at the bottom of $x$ axis show the total number of cores and the numbers above them show the number of big cores. Results are separated depending on the type of core that executes the master thread: a big or little core. . . . . | 93  |
| 6.7 | Average improvement over baseline; $x$ -axis shows the number of cores. . . . .                                                                                                                                                                                                                                             | 94  |
| 7.1 | Median frames per second (FPS) results for each game ( $x$ axis) and scheduling set-up (series). . . . .                                                                                                                                                                                                                    | 100 |
| 7.2 | Prescribed surface temperature (PST) results for each game ( $x$ axis) and scheduling set-up (series). . . . .                                                                                                                                                                                                              | 101 |
| 7.3 | FPS stability observed for each game ( $x$ axis) and scheduling set-up (series). . . . .                                                                                                                                                                                                                                    | 101 |
| 7.4 | Comparison of the FPS-PST trade-offs for different scheduling policies running on top of STC. . . . .                                                                                                                                                                                                                       | 102 |



# LIST OF TABLES

---

|     |                                                                                                                                                                                          |    |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 4.1 | Benchmarks used from the PARSEC benchmark suite and their measured performance ratio between big and little cores . . . . .                                                              | 34 |
| 5.1 | Evaluated benchmarks and relevant characteristics. The inputs of QR and Heat diffusion are arrays of doubles and the inputs of Cholesky and Int. Histogram are arrays of floats. . . . . | 65 |
| 6.1 | Evaluated benchmarks and relevant characteristics . . . . .                                                                                                                              | 88 |



# NOMENCLATURE

---

2D SFLEX Bidirectional Work Stealing and Flexible

2DS STRICT Bidirectional Work Stealing and Strict

AMC Asymmetric Multi-Core System

BAR BSC Application Repository

BF Breadth-First Scheduler

BSC Barcelona Supercomputing Center

CATS Criticality-Aware Task Scheduler

CFS Completely Fair Scheduler

CPATH Critical Path-Aware Scheduler

CPI Cycles Per Instruction

dHEFT Dynamic Heterogeneous Earliest Finish Time Scheduler

EDP Energy Delay Product

FIFO First-In First-Out

FPS Frames Per Second

GTS Global Task Scheduler

HEFT Heterogeneous Earliest Finish Time Scheduler

HPC High Performance Computing

**HYBRID** Hybrid Criticality-Aware Scheduler

**ILP** Instruction Level Parallelism

**ILP** Instruction-Level Parallelism

**ISA** Instruction Set Architecture

**MLP** Memory-Level Parallelism

**OS** Operating System

**plist** Predecessors' List

**PST** Prescribed Surface Temperature

**RTS** Real-Time Scheduler

**SIMD** Single Instruction Multiple Data

**slist** Successors' List

**SMC** Symmetric Multi-Core System

**SoC** System on Chip

**SS FLEX** Simple Work Stealing and Flexible

**SS STRICT** Simple Work Stealing and Strict

**STC** Samsung Temperature Controller

**TaskGenX** Task Generation Express

**TDG** Task Dependency Graph

**tt-is** Task Type - Input Size

# CHAPTER 1

## INTRODUCTION

---

**Kallia: Definition of AMCs:** Asymmetric Multi-core (AMC) systems architecture is an interesting case of heterogeneous multi-core architecture. This multi-core systems architecture features cores with different microarchitectures but with a single Instruction Set Architecture (ISA) and potentially shared memory resources. Figure 1.1 shows a classification of multi-core systems and how AMCs relate to the rest of the multi-cores. Multi-core systems can be either homogeneous or heterogeneous. A homogeneous multi-core consists of multiple identical processing cores that have the same micro-architecture and instruction set architecture. A heterogeneous multi-core can fall into two categories. The first category consists of multi-cores with different micro-architectures and ISA; systems with GPUs or other compute accelerators fall in this category. The second category consists of multi-cores that have different micro-architecture but a single ISA; AMC systems fall in this category.

AMCs have been used in different areas of parallel computing. In recent years they made their appearance in the mobile market and are nowadays the most commonly used architecture for mobile processors. The most widely used AMC architecture for mobile processors is the Arm big.LITTLE architecture [47] which combines low-power simple in-order cores (*little*) with fast out-of-order cores (*big*) to achieve high performance while keeping power dissipation low. Mobile processors are also utilized in HPC platforms aiming to energy savings [62]. Another area where AMCs have been successful is the supercomputing market. The Sunway TaihuLight supercomputer topped the Top500<sup>1</sup> list in 2016 using AMCs. In this setup, big cores, that offer support for speculation to exploit Instruction-Level Parallelism (ILP), run the master tasks such as the OS and runtime system. Little cores are equipped

---

<sup>1</sup>We refer to the list published in November 2017



Figure 1.1 Classification of multi core system architectures

with wide Single Instruction Multiple Data (SIMD) units and lean pipeline structures for energy efficient execution of compute-intensive code.

**Kallia: Challenges Using AMCs:** AMCs are generally easy to program as the programmer does not need to express the functionality of an application for many different ISAs. However, there are still challenges that need to be addressed that are mainly related to the performance of the AMC. Like in other heterogeneous systems, load balancing and scheduling are fundamental challenges that must be addressed to effectively exploit all the resources in AMC platforms [43, 47, 54, 55, 80, 87]. To preserve load balance, the system has to make sure that the cores of the system remain busy and are fairly loaded depending on their type. For example running a barrier-based multi-threaded application where each thread runs on one core on an AMC might result in performance degradation as there is a high risk of starvation of the fast cores [69].

Additionally, choosing which task to execute in each type of core of an AMC is not as straightforward. Due to the different benefits offered by each core type, such decisions require runtime information, like task dependencies or whether a task is memory or computationally bound. An interesting approach against the above challenges is to take advantage of the heterogeneity of an AMC by providing novel asymmetry-aware schedulers. The state-of-the-art asymmetry-aware scheduling approaches perform thread scheduling within the Linux kernel. The most commonly used example in this category is the Global Task Scheduler (GTS) [32] that is implemented on top of the Linux Completely Fair Scheduler (CFS) and is aware of the asymmetry of the system. However, such approaches lack flexibility due to the high thread-migration overheads [69].

**Kallia: Solutions – mention to Task-based models and process scheduling** The above challenges can be tackled by using task-based parallel programming models. These parallel

---

```

1 #pragma omp task in(a) out(b)
2 void task1(int *a, int *b) {
3     *b = *a/2 + 3 * (*a);
4     return;
5 }
6 #pragma omp task in(c) inout(d)
7 void task2(int *c, int *d) {
8     *d = 3 * (*c) + (*d) / 2;
9     return;
10}
11 int main() {
12     int x = 10, y = 5;
13     for(int i = 0; i < 3; i++) {
14         task1(&x, &y);
15         task2(&y, &x);
16     }
17     return 0;
18 }
```

---

Listing 1.1 Example code using the OmpSs task-based programming model.

programming models have been widely used during the last decade in the development of parallel applications. They form an appealing solution as they significantly ease the parallel programming by providing a higher level of abstraction to the programmer through a directive-based or language-based interface. With these models challenges such as load balancing, scheduling, or thread migration costs are partially solved by using them out-of-the box. They allow the programmer to split the application in multiple work units called *tasks* (a function can be a task) that can potentially execute simultaneously<sup>2</sup>. As an example, Listing 1.1 shows a simple usage of a directive-based task-based programming model (the OmpSs programming model [9]). In this code, the *pragma omp task* directives define the functions that act as tasks, namely task1 and task2. The input and output dependencies between these tasks are derived at execution time from the *in*, *out* and *inout* directionality clauses. When these task functions are being called by the main function, the runtime system creates tasks that can run in parallel if there are no task dependencies between them. The runtime system of the task-based programming model is responsible of maintaining software threads and distribute the tasks to the appropriate thread for execution. Typically the number of software threads that the runtime system maintains is equal to the number of available cores in the system and one thread is bound to each core. We further explain how

---

<sup>2</sup>They can execute simultaneously given there are no task dependencies.



Figure 1.2 Increased idle time when moving from homogeneous multi-core to asymmetric multi-core. The directions of the arrows indicate the data dependencies.

the task-based runtime system functions in Section 3.3.

## 1.1 THESIS MOTIVATION

As was illustrated by the example of Listing 1.1 task-based programming models offer a very convenient abstraction layer to the programmer for parallelizing applications. However, these parallel programming languages exhibit some challenges when moving to asymmetric systems as their runtime system is platform agnostic. The current scheduling implementation in task-based programming models assumes that all tasks can be evenly distributed among the cores of the system and treats all tasks and cores as equal. In some cases, a bad scheduling decision on an asymmetric system can lead to significant performance degradation.

Figure 1.2 shows the execution representation of a simple example using a task-based programming model. This example consists of three types of tasks that have different execution times. In this example the tasks of type taskA have dependencies between them as the arrows indicate which means that the taskA tasks cannot execute simultaneously. The leftmost representation shows how a task-based programming model would execute this set of tasks on a homogeneous multi-core (where all cores are the same) while the representation on the right shows how the same example would execute on an asymmetric multi-core where the cores number three and four are faster than the cores number one and two. As we can see the idle time of the cores three and four is increased when using this scheduling but this is something that can be addressed if we provide a better scheduling policy in our task-based runtime system. Figure 1.3 shows one solution to improve this scheduling while respecting the inter-task dependencies.

Another challenge introduced by the task-based programming models is the significant



Figure 1.3 Changing the scheduling leads to improvement for AMC systems. The directions of the arrows indicate the data dependencies.

task generation overhead. The high task generation overheads occur on both homogeneous and asymmetric multi-core systems. Spending a lot of time on task generation can result in decreased performance as the scheduler is unable to detect the tasks and send them to the appropriate core. The leftmost representation of Figure 1.4 shows how the task generation affects the scheduling and increases the idle time. The black rectangles indicate the time spent to create one task and the number inside these boxes show the task that was just created. Looking at the yellow rectangles we can see that the execution of each task is delayed due to the delay in task generation. The rightmost part of Figure 1.4 shows how by reducing the task generation time scheduling of the tasks is improved and the CPU idle time is significantly reduced. This leads to inefficient schedules with increased idle time. Accelerating the task generation overheads increases performance of task-based programming models.

## 1.2 THESIS CONTRIBUTIONS

The performance of task-based programming models can be boosted by providing sophisticated scheduling policies that take into account the platform's asymmetry.

The contributions made in this thesis rely on the efficient utilization of asymmetric multi-core systems. Figure 1.5 shows the research areas around performance of AMC systems. The areas shown in yellow color indicate the fields of our study. This thesis starts with a wide experimental evaluation of highly parallel applications on AMC systems. We then approach the two important challenges of using task-based programming models on AMC platforms. These challenges include first, the scheduling problem, and second the high task generation overheads introduced while using these parallel programming tools. Apart



Figure 1.4 Task generation overheads leads to increased idle time. Numbers inside the boxes show the task ID that is either being created or executed

from scheduling within task-based programming models, this thesis includes the high-level description of a thread scheduler for asymmetric systems targeting this time mobile devices. We tackle the above challenges and provide the following contributions:

1. A thorough study [28] of the potential of the AMC systems when running out-of-the-box HPC applications. We compare scheduling on different levels of the software stack and we conclude that using a task-based programming model is indeed the most efficient solution as it allows the runtime system to maintain load balance even if the system is asymmetric. This study serves as a verification that for HPC, scheduling through task-based programming models is indeed more efficient compared to app-based scheduling, as shown on Figure 1.5.
2. The design, implementation and evaluation of three novel scheduling policies that are aware of the system's asymmetry [29, 31]. These schedulers have different criteria for rating the importance of the executing tasks of an application. According to their distinct criteria, each scheduler identifies the *critical tasks* of the application and executes them on the fast cores of the system. They then leave the *non-critical tasks* to be executed by the slower cores of the system. The research areas that were studied during at this part of the thesis are the scheduling and dependence analysis of the runtime system as shown on Figure 1.5.
3. The TaskGenX proposal [30], a hardware-software co-design scheme to reduce task generation overheads of task-based programming models. We implement the TaskGenX runtime system, that decouples the task generation from the other runtime activities and sends it for execution on the runtime optimized accelerator. Furthermore, we draw the requirements of the hardware accelerator in terms of performance with the

hope to influence hardware designers for the implementation of such a component. The research area that is related to this contribution is the resource allocation from Figure 1.5 of the runtime systems.

4. A high-level description of a thread scheduler for AMC systems that targets mobile devices. Our approach takes scheduling actions depending on the current temperature of the device and manages to increase the frames per second for three game applications while keeping the temperature stable. This study is briefly described due to NDA agreement as it was part of my internship in Samsung Electronics Research Institute, UK. This part of the thesis is within the Android framework as shown in Figure 1.5.

The publications that support this thesis are listed below:

1. **Kallia Chronaki**, Miquel Moreto, Marc Casas, Alejandro Rico, Rosa M. Badia, Eduard Ayguade, Mateo Valero: "On the Maturity of Parallel Applications for Asymmetric Multi-Core Processors", *under review, JPDC*.
2. [28] **Kallia Chronaki**, Miquel Moreto, Marc Casas, Alejandro Rico, Rosa M. Badia, Eduard Ayguade, Mateo Valero: "POSTER: Exploiting Asymmetric Multi-Core Processors with Flexible System Software". Poster presentation in International Conference on Parallel Architecture and Compilation Techniques (PACT) 2016, Haifa, Israel.
3. [31] **Kallia Chronaki**, Alejandro Rico, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, Mateo Valero: "Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures". Paper presentation in International Conference of Supercomputing (ICS) 2015, Newport Beach, California.
4. [29] **Kallia Chronaki**, Alejandro Rico, Marc Casas, Miquel Moreto, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, Mateo Valero: "Task Scheduling Techniques for Asymmetric Multi-Core Systems". Published in IEEE Transactions on Parallel and Distributed Systems (TPDS) 2017 volume 28 number 7.
5. [30] **Kallia Chronaki**, Marc Casas, Miquel Moreto, Jaume Bosch, Rosa M. Badia "TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism". Paper presentation in International Supercomputing Conference (ISC) 2018, Frankfurt, Germany.

The contributions of this thesis have also been used as part of other relevant publication:



Figure 1.5 Research areas studied in this thesis

- [23] Emilio Castillo, Miquel Moreto, Marc Casas, Lluc Alvarez, Enrique Vallejo, **Kallia Chronaki**, Rosa M. Badia, Jose Luis Bosque, Ramon Beivide, Eduard Ayguade, Jesus Labarta, Mateo Valero: "CATA: Criticality Aware Task Acceleration for Multicore Processors". Paper presentation in IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016.

### 1.3 THESIS ORGANIZATION

The rest of this document is organized as follows: Chapter 2 presents the related work of this thesis. Chapter 3 presents the background of this thesis. We first report the characteristics of AMC architectures, we then explain how task-based programming models are organized and how their runtime system operates, we explain a few things about the TaskSim simulator [83], that is our tool for evaluating the impact of our implementations on larger systems and finally we provide the list of applications used in our evaluation together with a short description for each one.

Chapter 4 provides our thorough and detailed study of highly parallel applications on asymmetric systems. This study is a proof that scheduling on the runtime system is the most efficient way of utilizing an AMC system. Chapter 5 describes and evaluates our three novel scheduling approaches (CATS, CPATH and HYBRID). We implement these approaches within the OmpSs programming model and evaluate them using real scientific applications. To see their impact on larger systems we use TaskSim simulator.

Chapter 6 presents our software-hardware co-design proposal, TaskGenX. We show the implementation done and we evaluate our proposal using TaskSim simulator and real ap-

plications. Chapter 7 includes the thread-based scheduling within the Android framework together with its evaluation on a mobile device using three games with intensive graphics. Finally, Chapter 8 concludes this thesis.



# CHAPTER 2

## RELATED WORK

---

This chapter presents the related work that has been taking place on the research topics of this thesis. Our focus is the efficient utilization of asymmetric multi-core systems. There has been plenty of research on asymmetric multi-core systems. Some of the studies focus on the system design [10, 59, 70], while others explore the challenges that appear in efficiently utilizing such a heterogeneous system [54, 55, 60].

Our main contributions can be split into two categories: i) scheduling ii) software-hardware co-design We separate the research studies according to the category they fall to and explain how they relate to our contributions.

Section 2.1 shows the work that has been done on AMCs and is mostly generic and related to either the system design of these systems or exploring the challenges that appear in efficiently utilizing such heterogeneous systems.

Section 2.2 shows the work that has been done and relates to scheduling for heterogeneous systems. This includes process scheduling as well as task scheduling. The process scheduling works relate more to the work presented in Chapter 7, while the works of task scheduling relate to the work presented in Chapter 5. We extensively study the research of task-based scheduling and we categorize the scheduling heuristics found in the literature. Furthermore, we present some studies that focus on heterogeneous systems with compute accelerators and finally works on schedulers that assume some sort of task criticality in their decisions.

Section 2.3 presents the work related to hardware-software co-design. This is also related to asymmetric multi-core systems and scheduling but in focuses in another type of contribution. Works that fall in this category differ from scheduling algorithms as they include a specific-purpose hardware design as well as a runtime system that utilizes this

hardware efficiently. In our case the proposed hardware is an asymmetric system with a compute accelerator.

## 2.1 ASYMMETRIC MULTI-CORE SYSTEMS

There have been plenty of studies on asymmetric multi-core systems. Some studies focus on the system design, while others explore the challenges that appear in efficiently utilizing such a heterogeneous system [69]. Kumar et al [59] present the idea of an AMC system and proposed a feedback-based way to dynamically migrate processes among the different cores. Their approach led to energy savings but at the cost of performance degradation. To determine the core that most effectively executed a workload, Kumar et al [60] proposed the use of sampling. This method minimizes the execution time of each single thread and increases performance. Other studies focused on the pipeline design of such AMCs and the area that should be devoted to each component in the system [10, 70]. Other studies on AMCs proposed hardware support to detect critical sections of a parallel application and schedule the most critical ones on the big cores [54, 55, 87]. These approaches are orthogonal to the ones evaluated in this thesis and could benefit from them to further improve the final performance of the system.

## 2.2 SCHEDULING FOR HETEROGENEOUS SYSTEMS

### 2.2.1 PROCESS SCHEDULING

Process scheduling on AMCs is one of the most challenging topics in this area of study. Bias scheduling [57] is an OS scheduler that characterizes the running threads according to their memory or execution intensity. It then schedules the computation intensive threads to the big cores of the system while the memory intensive threads to the little cores of the system. The experimental evaluation is done on Intel Xeon processors and the heterogeneous system is emulated by changing the configuration of three out of the four cores of the processor.

Cong et al propose the Energy-Efficient [33] OS scheduler based on energy estimation. The evaluation is performed on the Intel QuickIA [27] platform that integrates an Intel Xeon with an Atom processor. Code instrumentation is used to schedule different phases of the application on each processor. However, the separate core and memory subsystems in these architectures incur power and performance overheads for application migration, which makes dynamic mapping ineffective for fine-grained migration

Van Craeynest et al. [94] propose the fairness-aware OS scheduler that focuses on AMC architectures. They focus on heterogeneous single-ISA architectures and they use some of the applications evaluated in this thesis but their research lacks the real heterogeneous system; instead they simulate a heterogeneous processor with big and little cores to estimate

performance and fairness (not energy).

The performance impact estimation (PIE) scheduler [93] is based on the impact of memory-level parallelism (MLP) and instruction-level parallelism (ILP) on the overall cycles per instruction (CPI) and focuses on improving performance. The scheduler predicts the impact of each different core-type of the system on the MLP, ILP and it assumes hardware support for CPI. The evaluation of this paper is performed through simulation by mimicking a heterogeneous system like ARM big.LITTLE. Further there are no energy or power measurements.

Rodrigues et al [84] propose a thread scheduling technique that estimates power and performance when deciding to assign a thread to a specific core of the heterogeneous system. However their evaluation is based on simulated results and not on a real platform.

Further to these existing studies there are the state-of-the-art mainstream OS schedulers that are currently used for ARM big.LITTLE systems. These include *cluster switching* (CS) [32], *in-kernel switch* (IKS) [68] and *global task scheduling* (GTS) [32]. The CS approach allows only one of the core clusters to be active at any given moment and the OS scheduler acts as if it schedules a symmetric multi-core system. In the IKS each little core is paired with one big core; depending on the CPU utilization the system chooses which core type will represent each pair of cores. In contrast to the above, GTS allows all cores to be active any given moment and depending on the CPU utilization schedules threads to the appropriate cores. These approaches are described in more detail in Chapter 4 where GTS is evaluated.

Finally, Energy-Aware Scheduling (EAS) is an on-going effort in the Linux community to introduce the energy factor in the OS scheduler [3, 50]. It is based on performance and power profiling to set performance and power capacities and let the Linux completely fair scheduler assign slots to processes considering the different core capacities. EAS is not yet part of the Linux kernel and, therefore, GTS is the most sophisticated state of the art scheduling method in production on current big.LITTLE processors.

### 2.2.2 TASK SCHEDULING

The search for efficient task scheduling on multi-core systems has been intensively studied. Most scheduling heuristics in the literature target homogeneous multiprocessors, nevertheless there exist an important number of studies in heterogeneous multiprocessors. In this section we give an overview of different categories of schedulers for heterogeneous systems, we explain some details about schedulers targeting specific systems using compute accelerators and explain details of previous works on criticality-aware schedulers.

### *Categories of Heterogeneous Schedulers*

There are previous works on schedulers for heterogeneous systems that form four different types of schedulers: listing, clustering, guided-random, and duplication-based schedulers.

Listing schedulers [1, 35, 49, 65, 91] have two scheduling stages. In the first stage, each task is given a priority based on the policy defined in each algorithm. In the second stage, tasks are assigned to processors depending on their priorities. Most criticality-aware schedulers fall in this category, and we discuss them in Section 2.2.2. The schedulers proposed in this thesis are also listing schedulers.

Clustering schedulers [49, 51, 98, 100] first separate tasks into clusters, where each cluster is to be executed on the same processor. During the clustering stage, the algorithm assumes an unlimited number of available processors in the system. If the number of clusters exceeds the number of available cores, the *merging* stage joins multiple clusters so that they match the number of available processors. An example is the Levelized Min Time [51] clustering scheduler. This heuristic clusters tasks that can execute in parallel according to their *level* (i.e. sibling nodes in a graph have the same level), and assigns priorities to the tasks in a cluster according to its execution time, (i.e. tasks with the highest execution time have the highest priority). The task-to processor assignment is done in decreasing order of priority.

Guided-random schedulers [64, 76, 101] randomize their schedules by applying policies influenced by other sciences. Genetic algorithms [101] group tasks into generations and schedule them according to a randomized genetic technique. Chemical reaction algorithms [64] mimic molecular interactions to map tasks to processors. Some of these guided-random approaches are designed for heterogeneous systems [64, 101]. The scheduler by Page et al. [76] enables dynamic scheduling of multiple-sized tasks for heterogeneous systems. However, it does not support dependencies between tasks.

Duplication-based schedulers [2, 11, 103] aim to eliminate communication costs between processors by scheduling tasks and their successors on the same processor. If a task has many successors, it is duplicated and executed in multiple cores prior to its successors so all successor tasks get the data from their predecessors with the lowest communication cost. This scheduling potentially introduces redundant duplications of tasks which may lead to bad schedules. The Heterogeneous Economical Duplication scheduler [2] performs task duplication in an economical manner as it removes the redundant duplicates if they do not affect performance.

These previous works schedule tasks statically and assume the prior knowledge of the task execution times on the different processor types in the heterogeneous system. In addition, none of these works, take into account the criticality of tasks with respect to task

dependencies.

### *Schedulers for Compute Accelerators*

The schedulers in the previous subsection target the scheduling of generic TDGs on generic heterogeneous architectures. In this subsection we cover schedulers that target specific systems with compute accelerators. These works are more focused on the scheduling of tasks on the target platform based on the abstractions provided by the corresponding mixture of programming models for the general-purpose processors and the compute accelerators in the system.

Most heterogeneous systems with compute accelerators nowadays combine general-purpose CPUs and GPU compute accelerators. There is a set of programming models providing abstractions to ease the development of applications on these platforms. OmpSs [9, 41] offers this abstraction by allowing multiple implementations of a given task to be executed on different processing units [79]. The scheduler then assigns the execution of a task to the best resource according to its earliest finish time. Another case is StarPU [7], a library that offers runtime heterogeneity support and provides priority schedulers for task-to-processor allocation. AHP [78] is another framework that generates software pipelines for heterogeneous systems and schedules tasks to their earliest executor, based on profiling information gathered prior to runtime.

None of these works, however, take into account the criticality of tasks regarding task dependencies, but they rather focus on the earliest execution time of individual tasks on the processor types in the specific system configuration.

### *Criticality-Aware Schedulers*

Several previous works propose scheduling heuristics that focus on the critical path of a TDG to reduce total execution time [35, 49, 65, 71, 91]. To identify the tasks on the critical path, most of these works use the concept of *upward rank* and *downward rank*. The upward rank of a task is the maximum sum of computation and communication cost of the tasks in the dependency chains from that task to an exit node in the graph. The downward rank of a task is the maximum sum of computation and communication cost of the tasks in the dependency chain from an entry node to that task. Each task has an upward rank and downward rank for each processor type in the heterogeneous system, as the computation and communication costs differ across core types.

The Heterogeneous Earliest Finish Time (HEFT) algorithm [91] maintains a list of tasks sorted in decreasing order of their upward rank. At each schedule step, HEFT assigns the task with the highest upward rank to the processor that finishes the execution of the task

at the earliest possible time. Another work is the Longest Dynamic Critical Path (LDCP) algorithm [35]. LDCP also statically schedules first the task with the highest upward rank on every schedule step. The difference between LDCP and HEFT is that LDCP updates the computation and communication costs on multiple processors of the scheduled task by the costs discovered in the processor to which it was assigned.

The Critical-Path-on-a-Processor (CPOP) algorithm [91] also maintains a list of tasks sorted in decreasing order as in HEFT, but in this case it is ordered according to the addition of their *upward rank* and *downward rank*. The tasks with the highest *upward rank + downward rank* belong to the critical path. On each step, these tasks are statically assigned to the processor that minimizes the critical-path execution time.

The main weaknesses of these works are that (a) they assume prior knowledge of the computation and communication costs of each individual task on each processor type, (b) they operate statically on the whole TDG, so they do not apply to dynamically scheduled applications where only a part of the TDG is available at any given time, and (c) most of them use synthetic TDGs that are not necessarily representative of the dependencies in real workloads.

## 2.3 HARDWARE-SOFTWARE CO-DESIGN

In this thesis, apart from proposing scheduling algorithms for task-based programming models, we also provide a hardware-software proposal for accelerating the most intensive parts of the runtime system. This section provides the related work on task-based programming models as well as on hardware-software co-design proposals like TaskGenX.

Task-based programming models are widely spread as they facilitate the parallel execution on homogeneous or heterogeneous multi-core environments. State of the art task-based runtime systems include the OpenMP [18], OmpSs [41], StarPU [7], Cilk++ [44, 45] and Swan [95]. All these models support tasks and maintain a TDG specifying the inter-task dependencies. This means that the runtime system is responsible for the task creation, the dependence analysis as well as the scheduling of the tasks. However, none of these runtime systems offers automatic offloading of task creation.

The fact that task-based programming models are so widely spread makes TaskGenX-like approaches very important and also gives importance to studies that focus on adding hardware support to boost performance of task-based runtime systems. Even if their work focuses more on the hardware part of the design, their contributions are very relative to our study as we can distinguish which parts of the hardware is more beneficial to be accelerated.

Carbon [61] accelerates the scheduling of tasks by implementing hardware ready queues.

Carbon maintains one hardware queue per core and accelerates all possible scheduling overheads by using these queues. Nexus# [34] is also a distributed hardware accelerator capable of executing the *in*, *out*, *inout*, *taskwait* and *taskwait on* pragmas, namely the task dependencies. TDM [24] mitigates runtime system overheads by introducing a distributed hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations. Unlike Carbon, TDM and Nexus, TaskGenX accelerates only task creation. Moreover, ADM [85] is another distributed approach that proposes hardware support for the inter-thread communication to avoid going through the memory hierarchy. This aims to provide a more flexible design as the scheduling policy can be freely implemented in software. These designs require the implementation of a hardware component for each core of an SoC. Our proposal assumes a centralized hardware unit that is capable of operating without the need to change the SoC.

Task Superscalar [42] and Picos++ [88] use a single hardware component to accelerate parts of the runtime system. In the case of Task superscalar, all the parts of the runtime system are transferred to the accelerator. Picos++ [88] is a hardware-software co-design that supports nested tasks. This design enables the acceleration of the inter-task dependencies on a special hardware. Swarm [53] performs speculative task execution. Instead of accelerating parts of the runtime system, Swarm uses hardware support to accelerate speculation. This is different than our design that decouples only task creation.

Our work diverges to prior studies for two main reasons that rely on their implementation and their practicability. First, their implementation typically requires changes in hardware of the SoC. This means that they need an expensive design where each core of the chip has an extra component. TaskGenX offers a much cheaper solution by requiring only a single specialized core that, according to our experiments, can manage the task creation for 512-core SoCs. Secondly, none of the previous studies is aiming at accelerating exclusively task creation overheads. According to our study task creation becomes the main bottleneck as we increase the number of cores and our proposal is the first that takes this into account resulting in a minimalist approach of runtime overheads acceleration.

# **CHAPTER 3**

## **EXPERIMENTAL ENVIRONMENT**

---

To familiarize the reader, this chapter describes the background of this thesis. This study includes software enhancements targeting asymmetric multi-core systems. Thus, we first give some background information about the asymmetric multi-core architecture used in this work, which is the ARM big.LITTLE architecture [47]. The second part of this chapter, provides information about the parallel programming models, that is the currently used method for parallel programming in HPC applications. Finally we provide a high-level description of the applications used in this work.

### 3.1 THE ARM BIG.LITTLE ARCHITECTURE

The ARM big.LITTLE [32, 47] is a state-of-the-art AMC architecture that has been successfully deployed in the mobile market. The observation that mobile devices typically combine phases with low and high computational demands motivated this original design. ARM big.LITTLE combines simple in-order cores with aggressive out-of-order cores in the same System-on-Chip (SoC) to provide high performance and low power. *Big* and *little* cores support the same architecture so they can run the same binaries and therefore easily combined within the same system. Current cores implementing the ARMv7-A and ARMv8-A ISA support big.LITTLE configurations.

The little cores in a big.LITTLE system are designed targeting energy efficiency. Current implementations have relatively short pipelines with up to dual-issue in-order execution. L1 caches are split for instructions and data and can be dimensioned according to the target domain from 8 to 64 KB in size [58]. The big cores are designed for high performance. Current designs have deeper pipelines with up to seven-issue out-of-order execution, increased number of functional units and improved floating-point capabilities. L1 data cache is up to 64 KB and L1 instruction cache is up to 64 KB[19, 36, 48]. Little and big cores are typically integrated in a hierarchical manner. A set of cores form a cluster that may include a cache that is shared among cores in the cluster [48]. Then, multiple clusters can be interconnected through an on-chip network and share a last-level cache and connection to main memory and peripherals.

In this thesis, we use of one of the commercially available development boards featuring a big.LITTLE architecture: the Hardkernel Odroid-XU3 development board. As shown in Figure 3.1, the Odroid-XU3 includes an 8-core Samsung Exynos 5422 chip with four ARM Cortex-A15 cores and four Cortex-A7 cores. The four Cortex-A15 share a 2 MB 16-way 64-byte-cache-line L2 cache, while the Cortex-A7 cores share a 512 KB L2 cache. A single memory controller provides access to 2 GB of LPDDR3 RAM with dual 32-bit channels at 1866 MT/s. The reason we use this platform instead of the more up-to-date Juno platform [6] is that even if the latter features the more advanced Cortex A53 and Cortex A57 cores, it is limited to six cores instead of the 8 cores in Odroid-XU3.

The Cortex-A7 cores in this SoC support dual-issue of instructions and their pipeline length is between 8 and 10 stages. The L1 instruction cache is 32KB two-way set associative, with virtually indexed and physically tagged cache-lines that can hold up to 8 instructions. The core supports instruction prefetch by predicting the outcome of branches; the prefetch unit can fetch up to a maximum of four instructions per cycle. The L1 data cache is four-way set associative with physically-indexed and physically-tagged cache lines



Figure 3.1 Samsung Exynos 5422 processor with ARM big.LITTLE architecture.

and uses a pseudo-random replacement policy [5]. Dynamic Voltage and Frequency Scaling (DVFS) techniques adjust the frequency of the little cores from 200MHz up to 1.4GHz.

The Cortex-A15 cores in this SoC support triple-issue of instructions and their pipeline length is between 15 and 24 stages [92]. The L1 instruction and data caches of the Cortex-A15 are both 32 KB and 2-way set-associative with 64 byte cache lines. The processor supports speculative instruction execution by maintaining a 2-level global history-based dynamic predictor with a branch target buffer [4]. The instruction decode unit performs register renaming to remove the Write-After-Write and the Write-After-Read hazards, and promote instruction reordering [4]. The instruction dispatch unit analyzes instruction dependences before issuing them for execution. The integer execute unit includes 2 Arithmetic Logical Units with support for operand forwarding. DVFS techniques vary the frequency of the big cores from 200 MHz up to 2 GHz. For the rest of the thesis, we refer to Cortex-A15 cores as *big* and to Cortex-A7 cores as *little*.

All the real machine experiments in this thesis are performed on the Hardkernel Odroid XU3 that features the ARM big.LITTLE architecture. To avoid machine overheating, we make use of the cpufreq driver to set big cores at 1.6GHz and little cores at 800MHz.

## 3.2 THE TASKSIM SIMULATOR

To evaluate our contributions on larger systems we make use of the TaskSim simulator [46, 83]. TaskSim is a trace driven simulator, that supports the specification of homogeneous or heterogeneous systems with many cores. The tracing overhead of the simulator is less than 10% and the simulation is accurate as long as there is no contention in the shared memory resources on a real system [46]. By default, TaskSim allows the specification of the amount of cores and supports up to two core types in the case of heterogeneous asymmetric systems. This is done by specifying the number of cores of each type and their difference in performance between the different types (performance ratio) in the TaskSim configuration file.

Our evaluation consists of experiments on both symmetric and asymmetric platforms with the number of cores varying from 8 to 512. In the case of asymmetric systems, we simulate the behaviour of an ARM big.LITTLE architecture [52].

To simulate our approaches using TaskSim we first run each application/input in the TaskSim trace generation mode. This mode enables the online tracking of task duration and synchronization overheads and stores them in a trace file. To perform the simulation, TaskSim uses the information stored in the trace file and executes the application by providing this information to the runtime system. For our experiments we generate three trace files for each application/input combination on a Genuine Intel 16-core machine running at 2.60GHz.

## 3.3 TASK-BASED PARALLEL PROGRAMMING MODELS

Parallel programming models [9, 15, 17, 81], are widely used to facilitate the programming of parallel codes for multi-core systems. These programming models offer an abstraction layer to the programmer so that multi-threaded programming of an application becomes easier. They support code annotations that the programmer can add to the application’s sequential code and transform it into parallel. These annotations include the specification of parallel loops, atomic operations, critical regions or task clauses.

Our main focus in this thesis is the task annotation with dependency tracking which OpenMP [18] supports since its 4.0 release [74]. A task is a piece of code<sup>1</sup> in the application that can execute simultaneously to other tasks and cooperatively produce results. In a parallel application there can be many tasks that perform the same computations on different data or tasks that perform different computations on the same data. By using task

---

<sup>1</sup>A piece of code can be a function or a code block.

annotations, the programmer decomposes the application into tasks and specifies the input and output data dependencies between them. Parallel programming models typically consist of two parts; a compiler and a runtime system. The compiler is responsible to parse the code annotations and translate them to code by adding calls to the programming model's runtime system. The runtime system consists of software threads and is responsible for the efficient execution of the tasks with respect to the data dependencies as well as the availability of resources. This thesis mainly focuses on enhancements of the runtime system in parallel programming models.

The runtime system creates and manages the software threads for the execution of the tasks. Typically one software thread is being bound to each core. One of the threads is the *master thread*, and the rest are the *worker threads*. The master thread starts executing the compiler generated application's code sequentially and creates the tasks as it encounters them. Following task creation, is the analysis of the dependencies of the created task and the insertion of it in the Task Dependency Graph (TDG).

A TDG is a distributed graph structure that connects each task of the application with the rest of the tasks according to their existing input and output data dependencies. A dependency tracking mechanism is responsible for the maintenance of the input and output data of a task. Within this mechanism, the runtime system tracks the memory addresses that tasks write or read and according to this, it manages the tasks that are ready for execution<sup>2</sup>. Tasks that their input data have been produced are marked as *ready* and can start execution, while tasks that are still waiting for input are postponed until their inputs are produced by other executing tasks.

To enable the efficient parallel execution of the tasks, the scheduler of the runtime system keeps track of the available tasks to be executed and manages the task-to-thread allocation. To do this, the scheduler maintains a *ready queue* and when all of a tasks dependencies are satisfied (i.e., the task becomes *ready*) it is inserted in the *ready queue*. All threads have access to this queue which is a first-in-first-out data structure; whenever a thread becomes idle, it pops the next task from the queue and executes it. In this thesis we make use of OmpSs [9], a mainstream task-based programming model and the main influence of the updated OpenMP 4.0 [18].

### 3.3.1 OMPSS PROGRAMMING MODEL

The OmpSs programming model is a task-based programming model that offers a high level abstraction to the implementation of parallel applications for various homogeneous and heterogeneous architectures [9, 41]. As a task-based programming model, OmpSs enables the

---

<sup>2</sup>When the input data of a task is produced, the task can start execution.

annotation of function declarations with the task directive, which declares a task. Every invocation of a such function creates a task that is executed concurrently with other tasks or parallel loops. OmpSs also supports task dependencies and it uses the StarSs [40] dependency tracking mechanisms. OmpSs is built with the support of the Mercurium compiler, responsible for the translation of the OmpSs annotation clauses to source code, and the Nanos++ runtime system [13], responsible for the internal creation and execution of the tasks.

As a task-based parallel programming model, OmpSs enables the annotation of function declarations with the task directive. If a function is declared as a task, then every invocation of this function creates a task that is executed concurrently with other tasks or parallel loops. The accessible data to each task are the arguments of the function. OmpSs uses the StarSs [40] dependency tracking mechanisms and each task may be annotated with the *in*, *out*, *inout* clauses. These clauses allow the specification of scalars, arrays and pointers as input, output or input and output data of a task. The implementation of a barrier is supported under the *taskwait* clause, and it can also be used with the addition of the *on* clause, to declare a barrier for the group of tasks that produce a specific piece of data. These original OmpSs features can now be found in OpenMP 4.0 [18].

Nanos++ is an environment designed to serve as the runtime platform of OmpSs. It provides device support for heterogeneity and includes different plug-ins for implementations of scheduling policies, throttling policies, thread barriers, dependency tracking mechanisms, work-sharing and instrumentation. This design allows to maintain the runtime features by adding or removing plug-ins. Thus, the implementation of a new scheduler, or the support of a new architecture becomes simple.

The implementations of the different scheduling policies in Nanos++ perform various actions on the states of the tasks. A task is *created* if a call to this task is discovered but it is waiting until all its inputs are produced by other previous tasks. When all the input dependencies are satisfied, the task becomes *ready*. The ready tasks of the application at a given point in time are inserted in the *ready queues* as stated by the scheduling policy. Ready queues can be thread-private or shared among multiple threads. When a thread becomes idle, the scheduling policy picks a task from the ready queues for that thread to execute.

The default scheduling policy of Nanos++ employs a *breadth-first* policy (BF) [39]. The BF scheduler implements a single first-in-first-out ready queue shared among all threads. When a task is ready, it is inserted in the tail of the ready queue and when a core becomes available, it retrieves a task from the head of the queue. Tasks are ordered according to their ready time: the earliest ready task resides at the head of the queue. Since the ready queue is shared, there is no need for work stealing and the load is balanced automatically. BF does

not differentiate among core types and assigns tasks in a first-come-first-served basis.

The Nanos++ internal data structures support task prioritization. The task priority is an integer field inside the task descriptor that rates the importance of the task. If the scheduling policy supports priorities, the ready queues are implemented as *priority queues*. In a priority queue, tasks are sorted in a decreasing order of their priority. The insertion in a priority queue is always ordered and the removal of a task is always from the head of the queue, i.e., the task with the highest priority. The priority of a task can be either set in user code, by using the *priority* clause, which accepts an integer priority value or expression, or dynamically by the scheduling policy, as is described in the next section.

## 3.4 APPLICATIONS

In the evaluation of our contributions we use 13 scientific applications. With the prevalence of many-core processors and the increasing relevance of application domains that do not belong to the traditional HPC field, comes the need for programs representative of current and future parallel workloads. The PARSEC benchmark suite [16, 102] features state-of-the-art, computationally intensive algorithms and very diverse workloads from different areas of computing. In our experiments, we make use of the original PARSEC codes together with a task-based implementation of nine benchmarks of the suite [26]. Additionally, we evaluate some representative benchmarks from the BSC Application Repository (BAR) [12]. These applications are implemented using the OmpSs programming model.

Below we provide a brief description of each application and its task-based parallelization scheme. We provide more evaluation-specific characterization of the applications in the next chapters.

**Blackscholes** is an Intel RMS benchmark that calculates the prices of a portfolio analytically with the Black-Scholes partial differential equation. The OmpSs version of this benchmark is embarrassingly parallel as it equally divides the work into a certain number of independent blocks, greater than the number of available cores. The independent blocks can be processed simultaneously by the available threads to apply the Black-Scholes equation.

**Bodytrack** is an application that tracks a marker-less human body using multiple cameras through an image sequence. The OmpSs version implements a two-stage parallel pipeline for the image processing. In the first stage the implementation allows the concurrent processing of all the frames of the image sequence. The second stage is the application of a particle filter to the images in order to mark the human body. The two stages are synchronized through the OmpSs dataflow annotations. By allowing the parallel processing of frames, OmpSs achieves the overlapping of the I/O and serial code with the tasks.

**Canneal** optimizes the routing cost of chip design by using a simulated cache-aware annealing. To converge to an optimal solution, it continuously swaps elements that need to be placed in a large space. In the parallel version of canneal, the elements are stored in a list that is accessed among all threads. The OmpSs version spawns several independent tasks to perform the swaps on random elements of the list. Atomic operations are used to prevent the same elements to be modified simultaneously by more than one threads.

**Cholesky factorization** is a dense matrix operation that is used for solving linear equations in linear least square systems. It takes as input one symmetric and positive square matrix  $A$  and calculates a triangular matrix  $L$ , where  $LL^T = A$ . The OmpSs implementation of Cholesky blocks the input matrix into square blocks of floats and each task is responsible for performing the factorization on one block.

**Dedup** is a data compression algorithm that combines global and local compression. The algorithm first splits the input data into non-equal data chunks; each chunk then passes through 5 different pipeline stages. The OmpSs version groups and merges the pipeline stages into two and each stage is taskified and run in a pipelined manner. This method helps in overlapping the communication with the computation.

**Facesim** takes as input a 3D model of a human face and a time sequence of muscle activation and computes a visually realistic animation of the modeled face. The OmpSs implementation uses data parallelism with the task clause. The tasks are dynamically scheduled to the available cores and the synchronization is achieved through barriers.

**Ferret** is a content-based similarity search application of feature-rich data (audio, images, video, etc.). It takes as input a number of image queries and a database where the similarity search is performed. The OmpSs implementation uses pipelined parallelism with 5 pipeline stages implemented as tasks and dependency tracking between them. Each image query passes is used as input to this pipeline structure.

**Fluidanimate** is a benchmark that uses the Extended Smoothed Particle Hydrodynamics method to simulate an incompressible fluid for interactive animations. The OmpSs implementation first partitions the fluid surface into grids. It then uses this data-partition within parallel do-all loops that implement the appropriate hydrodynamic methods.

**Heat diffusion** uses the Gauss-Seidel method to compute the heat distribution on a matrix from  $x$  heat sources. Heat diffusion implements an iterative solver of the equation that invokes the Gauss-Seidel method until the desired convergence is reached. The code splits the matrix into blocks and creates one task per block for the Gauss-Seidel computation.

**Integral histogram** is a method to compute a cumulative histogram for each pixel of an image. With this algorithm, the histogram of a Cartesian data space is computed in constant time. The OmpSs implementation performs a blocked cross-weave scan for each block of

the image. It consists of a horizontal and a vertical scan that transmit histograms to the blocks that reside on the right or below the current block. The horizontal scan processes one image block at a time and transmits the histograms to the block on the right. The vertical scan processes one block and transmits the histograms to the block below the current block. Due to these transmissions, the application introduces many task dependencies.

**QR Factorization** is a linear algebra algorithm that is used to solve the linear least squares problem [21]. QR decomposes a matrix  $A$  of size  $m \times m$  into the product of two matrices,  $Q$  and  $R$ , such that  $A = QR$ . In this equation,  $Q$  is an orthogonal matrix and  $R$  is an upper triangular matrix. We evaluate the performance of a blocked, communication avoiding QR implementation in OmpSs. Each OmpSs task processes one block of the input matrix. The first set of tasks factorizes the blocks of the main diagonal of the matrix, and produces a Householder reflector for each block. This reflector is the input of the second set of tasks that apply it to the blocks that reside on the right of the diagonal block. What is next, is the factorization of the combination of a diagonal block with a block beneath the diagonal and finally the resulting Householder reflector is applied on the right of the combined blocks.

**Streamcluster** is a benchmark for solving the online clustering problem. The input of this benchmark is a stream of points. Streamcluster partitions them and their centers in a certain number of clusters. The OmpSs version creates independent tasks that operate on the input stream of data points. The data is split among tasks and each task is responsible for the process of a specific part of the input.

**Swaptions** is an Intel RMS workload that uses the Heath-Jarrow-Morton framework to price a portfolio of swaptions. The portfolio is stored in an array; the OmpSs version generates independent tasks for the portfolio pricing and each task is responsible for one part of the array.



# **CHAPTER 4**

## **STUDY OF ASYMMETRIC SYSTEMS**

---

This chapter presents an initial thorough study of scheduling on different levels of the software stack on an asymmetric multi-core (AMC) system. We evaluate how ready are HPC applications to move to an asymmetric environment taking into account performance and energy consumption. We compare scheduling approaches on the application, OS and run-time system levels.

## 4.1 INTRODUCTION

Energy efficiency has become the main challenge for future processor designs, motivating prolific research to face the *power wall*. Using heterogeneous processing elements is one of the approaches to increase energy efficiency [66, 99]. Asymmetric multi-core (AMC) systems is an interesting case of heterogeneous systems to utilize for energy efficiency. These systems maintain different types of cores that support the same instruction-set architecture. The different core types are designed to target different (performance or power) optimization points [10, 60, 97].

AMCs have been mainly deployed for the mobile market. Mobile processors are also utilized in HPC platforms aiming to energy savings [62]. Asymmetric mobile SoCs combine low-power simple cores (*little*) with fast out-of-order cores (*big*) to achieve high performance while keeping power dissipation low. Another area where AMCs have been successful is the supercomputing market. The Sunway TaihuLight supercomputer topped the Top500 list in 2016 using AMCs. In this setup, big cores, that offer support for speculation to exploit Instruction-Level Parallelism (ILP), run the master tasks such as the OS and runtime system. Little cores are equipped with wide Single Instruction Multiple Data (SIMD) units and lean pipeline structures for energy efficient execution of compute-intensive code.

Like in other heterogeneous systems, load balancing and scheduling are fundamental challenges that must be addressed to effectively exploit all the resources in AMC platforms [43, 47, 54, 55, 80, 87]. Mobile applications rely on multi-programmed workloads to balance the load in the system, while supercomputer applications rely on hand-tuned code to extract maximum performance. However, these approaches are not always suitable for general-purpose parallel applications.

In this chapter, we evaluate several execution models on an AMC using the PARSEC benchmark suite [102]. This suite includes parallel applications from multiple domains such as finance, computer vision, physics, image processing and video encoding. We quantify the performance loss of executing the applications *as-is* on all cores in the system. These applications were developed on homogeneous platforms and are bound to suffer from load imbalance on parallel regions that statically distribute the work evenly across cores without considering their performance disparity.

To overcome this matter, we consider two possible solutions at the OS and runtime levels to exploit AMCs effectively. The first solution delegates scheduling to the OS. We evaluate the built-in heterogeneity-aware OS scheduler currently used in existing mobile platforms that automatically assigns threads to different core types based on CPU utilization.

The second solution is to transfer the responsibility to the runtime system so it dynam-

ically schedules work to different core types based on work progress and core availability. We evaluate the impact of using an inherently load-balanced execution model such that of task-based programming models. Recent examples [8, 14, 38, 41, 73, 82, 89, 95, 96] include clauses to specify inter-task dependencies and remove most barriers which are the major source of load imbalance on AMCs. Another approach of scheduling in the runtime system is to change the existing statically-scheduled work-sharing constructs for the applications implemented in OpenMP to use dynamic scheduling.

This chapter provides a comprehensive evaluation of representative parallel applications on a real AMC platform: the Odroid-XU3 development board with ARM big.LITTLE architecture. We analyze the effectiveness of the aforementioned scheduling solutions in terms of performance, power and energy. We show why parallel applications are not ready to run on AMCs and how OS and runtime schedulers can overcome these issues depending on the application characteristics. Further we point out in which aspects the built-in OS scheduler falls short to effectively utilize the AMC. Finally, we show how the runtime system approach overcomes these issues, and improves the OS and static threading approaches by 13% and 23% respectively.

The rest of this chapter is organized as follows: Section 4.2 provides information on scheduling at the OS and runtime system levels. Section 4.3 shows the performance and energy results and associated insights. Finally, Section 4.4 concludes this work.

## 4.2 SCHEDULING IN ASYMMETRIC MULTI-CORES

Scheduling a set of processes on an AMC system is more challenging than the traditional process scheduling on SMCs. An efficient OS scheduler has to take into account the different characteristics of the cores and act accordingly [22]. There have been three mainstream OS schedulers for ARM big.LITTLE systems: *cluster switching*, *in-kernel switch* and *global task scheduling*, described in the next sections. In the case of parallel applications, *dynamic scheduling at the runtime system level* can be exploited to balance the workload among the different cores and is described in section 4.2.3.

### 4.2.1 CLUSTER SWITCHING AND IN-KERNEL SWITCH

In the Cluster Switching (CS) approach [32], only one of the clusters is active at any given time: either the cluster with little cores or the cluster with big cores executes. Thus, the OS scheduler operates on a *de-facto* symmetric multi-core with only four cores, namely the cores of the current active cluster. The policy to change the operating cluster is based on CPU utilization. When idle, background processes are executed on the little cores. When

CPU utilization surpasses a threshold, all processes (foreground and background) are migrated to the big cluster. When running on the big cluster, if CPU utilization decreases below a given lower threshold, the entire workload is moved to the little cluster.

In the In-Kernel Switch (IKS) approach [68], each little core is paired with a big core and it is seen as a single core. On idle, background processes are run on little cores. When the CPU utilization on a given little core surpasses a threshold, the execution on that core is migrated to the big core. When the CPU utilization decreases on that big core below a given threshold, the execution migrates to the associated little core. Thus, at the same time, little and big cores may co-execute, but only one of each pair is active at a given point in time, effectively exploiting just half of the cores concurrently. For both CS and IKS, an enhanced `cpufreq` driver manages the switching within each core pair.

#### 4.2.2 GLOBAL TASK SCHEDULING

The Global Task Scheduling (GTS) [32] allows running applications on all cores in the asymmetric multi-core. In GTS, all cores are available and visible to the OS scheduler, and this scheduler is aware of the characteristics of the core types. Each process is assigned to a core type depending on its CPU utilization: high CPU utilization processes are scheduled to big cores and low CPU utilization processes to little cores. GTS also migrates processes between big and little cores when their CPU utilization changes. As a result, cores are active depending on the characteristics of the workload.

The key benefit of GTS is that it can use all the cores simultaneously, providing higher peak performance and more flexibility to manage the workload. In GTS tasks are directly migrated to cores without needing the intervention of the `cpufreq` daemon, reducing response time and minimizing the overhead of context switches. As a consequence, Samsung reported 20% improvement in performance over CS for mobile benchmarks [32]. Also, GTS supports clusters with different number of cores (e.g. with 2 big cores and 4 little cores), while IKS requires to have the same number of cores per cluster.

#### 4.2.3 DYNAMIC SCHEDULING IN THE RUNTIME

Current programming models for shared memory systems such as OpenMP rely on a runtime system to manage the execution of the parallel application. In this work, we make use of two types of programming models: loop- and task-based. Loop-based scheduling distributes the iterations of a loop among the threads available in the system, following a traditional *fork-join* model. OpenMP supports loop-based scheduling through its *parallel*

for directives. This clause implies a barrier synchronization at the end of the loop<sup>1</sup>, and supports either static or dynamic loop scheduling.

With static loop scheduling, the iterations of a loop are divided to as many chunks as the number of cores. Then, every core executes the assigned chunk, leading to a low-overhead static scheduling. In addition, OpenMP supports dynamic loop scheduling. It generates more chunks than cores, and assigns them to the available cores at runtime. This is more suitable to asymmetric multi-core systems where the cores are not similar and a static iteration assignment would cause load imbalance.

Recent advances in programming models recover the use of task-based programming models to simplify parallel programming of multi-cores [14, 41, 73, 95, 104]. In these models the programmer splits the code in sequential pieces of work (tasks) and specifies the data dependencies among them. With this information the runtime system schedules tasks and manages synchronization. These models ease programmability [14, 41, 73, 95, 96, 104], and also increase performance by avoiding global synchronization points.

To evaluate this approach we make use of OpenMP tasking support [73]. OpenMP allows expressing tasks and data dependences between them using equivalent code annotations. It conceives the parallel execution as a *task dependence graph* (TDG), where nodes are sequential pieces of code (tasks) and the edges are control or data dependences between them. The runtime system builds this TDG at execution time and dynamically schedules tasks to the available cores. Tasks become ready as soon as their input dependencies are satisfied. The scheduling of the ready tasks is done in a first-come-first-served manner, using a FIFO scheduler. Even though this scheduler is not aware of the task computational requirements or the core type and its performance and power characteristics, it can balance the load as long as there are ready tasks available thanks to the lack of global synchronization.

## 4.3 EVALUATION

### 4.3.1 METHODOLOGY

All the experiments in this chapter are performed on the Hardkernel Odroid XU3 described in Section 3.1. As mentioned earlier, to avoid machine overheating, we make use of the cpufreq driver to set big cores at 1.6GHz and little cores at 800MHz. Table 4.1 shows the applications used in this chapter as well as their corresponding characteristics. These are the input data sets used, the parallelization technique and their *performance ratio* between big and little cores. The performance ratio shows how many times faster a big core is compared

---

<sup>1</sup>unless specified otherwise with the nowait clause

Table 4.1 Benchmarks used from the PARSEC benchmark suite and their measured performance ratio between big and little cores

| Benchmark     | Input Data Set                                             | Parallelization | Perf ratio |
|---------------|------------------------------------------------------------|-----------------|------------|
| Blackscholes  | 10,000,000 options                                         | data-parallel   | 2.18       |
| Bodytrack     | 4 cameras, 261 frames, 4,000 particles, 5 annealing layers | pipeline        | 4.16       |
| Canneal       | 2.5 million elements, 6,000 steps                          | unstructured    | 1.73       |
| Dedup         | 351 MB data                                                | pipeline        | 2.67       |
| Facesim       | 100 frames, 372,126 tetrahedra                             | data-parallel   | 3.40       |
| Ferret        | 3,500 queries, 59,695 images database, find top 50 images  | pipeline        | 3.59       |
| Fluidanimate  | 500 frames, 500,000 particles                              | data-parallel   | 3.32       |
| Streamcluster | 200K points per block, 5 block                             | data-parallel   | 3.48       |
| Swaptions     | 128 swaptions, 1 million simulations                       | data-parallel   | 2.78       |

to a little core. The performance ratio differs among applications because each application performs different computations that exhibit different benefits from the micro-architecture of the big or little cores. To compute the performance ratio we obtain the execution time of each workload when it executes sequentially on one big core and on one little core using the task-based model. We then use Equation 4.1 to compute the performance ratio of each workload.

$$\text{Perf.ratio}(\text{method}) = \frac{\text{Exec.time}(0, 1, \text{task-based})}{\text{Exec.time}(1, 0, \text{task-based})} \quad (4.1)$$

To estimate the impact of the different kinds of cores, we evaluate seven AMC system configurations with different numbers of *little* (L) and *big* (B) cores, denoted L+B. For each configuration and benchmark, we report the average performance of five executions in the application parallel region. Then, we report the application speedup over its execution time on one little core. Equation 4.2 shows the formula to compute this speedup. In this Equation, L is the number of little cores, B is the number of big cores and method is the scheduling method used.  $\text{Exec.time}(L, B, \text{method})$  is the execution time of the application when it runs on a system with L little cores, B big cores using a specific scheduling method.

$$\text{Speedup}(L, B, \text{method}) = \frac{\text{Exec.time}(1, 0, \text{method})}{\text{Exec.time}(L, B, \text{method})} \quad (4.2)$$

In this platform, there are four separated current sensors to measure, in real time, the power consumption of the A15 cluster, the A7 cluster, the GPU and DRAM. To gather power and energy measurements, a background daemon reads the machine power sensors periodically during the application execution with negligible overhead. Sensors are read at their refresh rate, every 270ms, and the values of A7 and A15 clusters' sensors are collected. With the help of timestamps, we correlate the power measurements with the application parallel region in a *post-mortem* process<sup>2</sup>. The reported power consumption is the aver-

---

<sup>2</sup>The parallel region duration is several orders of magnitude longer than the reading frequency of power



Figure 4.1 Ideal speedup over 1 little core according to Equation 4.7. Numbers at the bottom of x axis show the total number of cores, numbers above it show the number of big cores

age power tracked during five executions of each configuration, considering the application parallel region only. We then report average power in Watts along the execution.

To compute the energy consumption, we use the reported power consumption and we multiply it with the average execution time of the executions performed for the power measurements. Equation 4.3 shows how energy is computed in our evaluation.

$$Energy(L, B, method) = Avg.Power(L, B, method) \times Avg.exec.time(L, B, method) \quad (4.3)$$

The reported results show the normalized energy with respect to the energy consumed when using four little cores and the static-threading method, as shown in Equation 4.4.

$$NormalizedEnergy(L, B, method) = \frac{Energy(L, B, method)}{Energy(4, 0, static - threading)} \quad (4.4)$$

The Energy Delay Product (EDP) shows the overall efficiency of the different scheduling methods. This metric is computed using the Equation 4.5 and is the product of the energy and execution time.

$$EDP(L, B, method) = Energy(L, B, method) \times Avg.exec.time(L, B, method) \quad (4.5)$$

Our reported results show the normalized EDP with respect to the EDP when using four little cores and the static-threading method as shown in Equation 4.6.

$$NormalizedEDP(L, B, method) = \frac{EDP(L, B, method)}{EDP(4, 0, static - threading)} \quad (4.6)$$

The original codes make use of the `pthreads` parallelization model for all the selected

benchmarks. The taskified applications follow the same parallelization strategy implemented with OmpSs (similar to OpenMP 4.0) task annotations. The task-based implementation is done following two basic ideas: i) remove barriers where possible, by adding explicit data-dependencies; and ii) remove application-specific load balancing mechanisms, such as application-specific pools of threads implemented in `pthreads` and delegate this responsibility to the runtime.

The performance ratio is an important characteristic of each workload and affects its performance on an asymmetric system. Table 4.1 includes the observed performance ratio for each application. `Bodytrack` is the application that benefits the most from running on the big core with a performance ratio of  $4.16\times$ . The out-of-order execution of the big core together with an increased number of in-flight instructions significantly improves the performance of this application. In contrast, `canneal` is the benchmark with the lowest performance ratio,  $1.73\times$ , as this is a memory-intensive benchmark that does not benefit as much from the extra computation power of the big core. In general, performance ratios are above  $2.5\times$  for seven out of nine benchmarks, reaching  $3.03\times$  on average.

Taking into account these performance ratios, we can estimate the ideal speedup of the platform for each workload assuming a perfect parallelization strategy. Equation 4.7 shows the equation for the ideal speedup over 1 little core computation according to the number of big (B) and little (L) cores.

$$\text{Ideal speedup}(\text{workload}, B, L) = B \times \text{Perf\_ratio}(\text{workload}) + L \quad (4.7)$$

Figure 4.1 shows the ideal speedup of the system for each application for the varying numbers of cores. This speedup assumes that the applications are fully parallel with no barriers or other synchronization points. Thus, these speedups are an upper bound of the achievable application performance.

We measure execution time, power, energy and EDP of nine applications from the PARSEC benchmark suite [16]. We compare these metrics for three different scheduling approaches:

- *Static threading*: scheduling decisions are made at the application level. The OS is not allowed to migrate threads between the clusters of big and little cores.
- *GTS*<sup>3</sup>: dynamic coarse-grained OS scheduling using the GTS scheduler integrated in the Linux kernel [32, 52] using the default PARSEC benchmarks.

---

<sup>3</sup>We choose to evaluate GTS instead of CS and IKS because it is the most advanced scheduling approach supported in the Linux kernel.



Figure 4.2 Execution time speedup over 1 little core for systems that consist of 4 cores in total with 0, 2 and 4 big cores. Different schedulers at the application (*static threading*), OS (*GTS*) and runtime (*task-based*) levels are considered.



Figure 4.3 Average power measurements on a 4-core system with 0, 2, and 4 big cores.

- *Task-based*: dynamic fine-grained scheduling at the runtime level with the task-based implementations of the benchmarks provided in PARSECSs [26].

#### 4.3.2 EXPLOITING PARALLELISM IN AMCs

This section examines the opportunities and challenges that current AMCs offer to emerging parallel applications. With this objective, we first evaluate a system with a constant number of four cores, changing the level of asymmetry to evaluate the characteristics of each configuration. In these experiments, all applications run with the original parallelization strategy that relies on the user to balance the application (*Static threading*). We also evaluate the OS-based dynamic scheduling (*GTS*) and the task-based runtime dynamic scheduling (*Task-based*) for the same applications. The system configurations evaluated in this section are: i) Four little cores (0+4); ii) Two big and two little cores (2+2); and iii) Four big cores (4+0).

For these configurations, Figure 4.2 shows the speedup of the PARSEC benchmarks with respect to running on a single little core. Figure 4.3 reports the average power dissipated on the evaluated platform. Finally, Figure 4.4 shows the total energy consumed per application



Figure 4.4 Normalized energy consumption and average EDP on a 4-core system with 0, 2, and 4 big cores. Static threading on 4 little cores is the baseline in both cases.

for the same configurations. Energy results are normalized to the energy measured with four little cores (higher values imply higher energy consumptions). Average EDP results are also included in this figure.

Focusing on the average performance results, we notice that all approaches perform similarly for the homogeneous configurations. Specifically, applications obtain the best performance on the configuration 4+0, with an average speedup of 9.5× over one little core. When using four little cores, an average speedup of 3.8× is reached for all approaches. This shows that all the approaches are effective for this core count. In the configuration 2+2, *Static threading* slightly improves performance (5.0× speedup), while *GTS* and *Task-based* reach significantly higher speedups: 5.9× and 6.8×, respectively.

Contrarily, in terms of power and energy, the most efficient configuration is running with four little cores, as the performance ratio between the different cores is inversely proportional to the power ratio [47]. On average, all the approaches reach a power dissipation of 0.75W for the 0+4 configuration, while *Task-based* reaches 3.5W for the 4+0 configuration which is the one with the highest average power dissipation. In configuration 2+2, average energy values for *Static threading* and *Task-based* are nearly the same, as the increase in power from 1.6W to 2.1W is compensated by a significant improvement in performance of 30%.

Finally, in terms of EDP using the four big cores is the optimal, as the performance improvements compensate the increase in total energy. In configuration 2+2, *Task-based* achieves the same EDP results as in 0+4, but with 81% better performance. For the asymmetric configuration, *Task-based* achieves the best performance-energy combination since its dynamic scheduling is effectively utilizing the little cores.

Next, we focus on the obtained results per benchmark. For applications with an extensive use of barriers (blackscholes, facesim, fluidanimate, streamcluster and swaptions) or with a memory intensive pattern (canneal), the extra computational power offered by the big cores in configuration 2+2 is not exploited. As a result with *Static threading* performance



Figure 4.5 Average results when running on 4 to 8 cores with 4 of them big. Speedup is over 1 little core. Static threading on 4 little cores is the baseline of energy consumption and EDP

is only slightly improved by 1% on average when moving from 0+4 to the 2+2 configuration. This slight improvement comes at the cost of much more power and energy consumption (79% and 77% respectively). These results are explained three-fold: i) load is distributed homogeneously among threads in some applications; ii) extensive usage of barriers force big cores to wait until little cores reach the barrier; and iii) high miss rates in the last-level cache cause frequent pipeline stalls and prevent to fully exploit the computational power of big cores. To alleviate these problems, the programmer should develop more advanced parallelization strategies that could benefit from AMCs, as performed in the remaining applications, or rely on dynamic scheduling at OS or runtime levels.

The three remaining applications are parallelized using a pipeline model (bodytrack, dedup, and ferret) with queues for the data-exchange between pipeline stages and application-specific load balancing mechanisms designed by the programmer. As a result, *Static scheduling* with these applications benefits from the extra computational power of the big cores in the configuration 2+2. These mechanisms are not needed in the *Task-based* code; in this approach the code of the application is simplified and the runtime automatically allows the overlapping of the different pipeline stages. Thus, on the asymmetric configuration, *Task-based* further improves the obtained performance, reaching a 13% average improvement over *GTS*. Clearly, these applications benefit in performance by the increased number of big cores, while power and energy are increasing since the big cores are effectively utilized.

Generally, relying on the programmer to statically schedule asymmetric configurations does not report good results, as it is very hard to predict the system's behaviour at application-level. Only applications that implement advanced features with user-level schedulers and load balancing techniques, can benefit from asymmetry, at the cost of programmability effort. Relying on the OS scheduler is a suitable alternative without code modifications, but relying on the runtime to dynamically schedule tasks on the asymmetric processor achieves much better performance, power and energy results.



Figure 4.6 Speedup over 1 little core when running on 4 to 8 cores and 4 of them are big



Figure 4.7 Average power when running on 4 to 8 cores and 4 of them are big

#### 4.3.3 ADDING LITTLE CORES TO AN SMC

In the following experiments, we explore if an application running on a symmetric multicore (SMC) with big cores can benefit from adding small cores that help in its execution. Having more computational resources increases the ideal speedup a parallel application can reach, but it also introduces challenges at application, runtime and OS level. Thus, we examine how many small cores have to be added to the system to compensate the cons of having to deal with AMCs.

To evaluate this scenario, we explore configurations 4+0, 4+1, 4+2, 4+3 and 4+4. In these experiments, the number of big cores remains constant (four), while the number of little cores increases from 0 to 4. First we focus on the average results of speedup, power, energy and EDP, shown in Figure 4.5.

The speedup chart of Figure 4.5 shows that *Static threading* does not benefit from adding little cores to the system. In fact, this approach brings an average 6% slowdown when adding four little cores for execution (4+4). This is a result of the static thread scheduling; because the same amount of work is assigned to each core, when the big cores finish the execution of their part, they become idle and under-utilized. GTS achieves a limited speedup of 8% with the addition of four little cores to the 4+0 configuration. The addition of a single little

core brings a 22% slowdown (from 4+0 to 4+1) and requires three additional little cores to reach the performance of the symmetric configuration (4+3). Finally, the *Task-based* approach always benefits from the extra computational power as the runtime automatically deals with load imbalance. Performance improvements keep growing with the additional little cores, reaching an average improvement of 15% over the symmetric configuration when 4 extra cores are added.

The power chart of Figure 4.5 shows oppositional benefits among the three approaches. We can see that *Static threading* and *GTS* benefit from asymmetry, effectively reducing average power consumption. *Static threading* reduces power consumption when moving from the 4+0 to the 4+4 system by 23% while *GTS* does so by 6.2%. On the other hand, the *task-based* approach keeps the big cores busy for most of the time so it maintains the average power nearly constant.

The reduction in power, results to reduced average energy in the case of *Static threading* in configuration 4+4, as shown on the energy chart of Figure 4.5. As discussed in Section 4.3.2, little cores are more energy efficient than big cores, at the cost of reduced performance. In all the approaches, at least two extra little cores are needed to reduce energy. In configuration 4+4, energy is reduced by 14% for *Static threading*, 15% for *GTS*, and 16% for *Task-based*. Consequently, we can state that asymmetry reduces overall energy consumption.

To see the impact on both performance and energy efficiency we plot the average EDP on the rightmost chart of Figure 4.5. In this chart the lower values are the better. The *task-based* approach is the one that has the best performance-energy combination for the asymmetric configurations since it maintains the lowest EDP for all cases. *Static threading* manages to reduce the average EDP by 6% while *GTS* and *task based* approaches do so by 24% and 36% respectively.

Figure 4.6 shows a more detailed exploration of the performance results. As Table 4.1 shows, the applications with barrier synchronization are blackscholes, facesim, fluidanimate, streamcluster and swaptions. For these applications the most efficient system configuration with the *Static threading* approach is the 4+0. Little cores increase execution time due to load imbalance effects. Since the big cores reach barriers earlier, power is reduced for these applications, as shown in Figure 4.7. Energy reduction is less significant with a few extra little cores as the performance degradation is higher, but as the number of little cores increases, energy is reduced.

Applications with more advanced load balancing techniques like pipelined parallelism (bodytrack, dedup and ferret), benefit of the asymmetric hardware and balance the load among all the cores. As a result, performance improves as we increase the number of little cores. In the case of bodytrack, *GTS* reduces performance by 15% when adding four little

cores. We attribute this to the cost of the thread migration from one core to the other in contrast to the *Static threading* approach that does not add such overheads. In the case of dedup, results show more variability. This benchmark is very I/O intensive and, depending on the type of core that executes these I/O operations, performance drastically changes. In order to deal with this problem, a smarter dynamic scheduling mechanism would be required. Finally, canneal does not scale according to its ideal speedup reported on Figure 4.1 as it has a memory intensive pattern that limits performance.

Figure 4.7 shows the average power. The barrier-synchronized applications (blacksc-holes, facesim, fluidanimate, streamcluster and swaptions) reduce power because of their imbalance; since big cores have long idle times with the *Static threading* approach, they do not spend the same power as *GTS* and *Task-based*. For pipeline-parallel applications, both bodytrack and ferret maintain nearly the same power levels among the configurations for each scheduling approach. Dedup is an exception, as the results highly depend on the core that executes the aforementioned I/O operations. Yet, the effect of the lower power for *Static threading* is observed in all the benchmarks and is because the big cores are under-utilized.

This section proves that adding little cores to an SMC with big cores presents significant challenges for the application, OS and runtime developers. Little cores increase load imbalance and can degrade performance as a result. Relying on the programmer to deal with this asymmetry is complex, but a dynamic OS scheduler such as *GTS* helps in mitigating these problems, providing an average performance increase of 10%. However, the optimal performance results are obtained with the *Task-based* approach, as they improve static threading by 23% on average. In terms of power and energy, the AMC provides significant benefits, although the SMC with little cores remains the most energy-efficient configuration. The answer to the question of which system configuration provides the best power-performance balance, can be found on the average EDP chart of Figures 4.4 and 4.5, and is the use of the entire 8-core system with the *Task based* approach.

#### 4.3.4 PROGRAMMING MODELS FOR AMCS

As we saw in the previous section, current implementations of parallel applications are not ready to fully take advantage of an AMC system. Applications that are statically threaded using the low-level `pthreads` library usually suffer from load imbalance since their implementations assume that the work has to be equally distributed among the available cores. Implementing advanced load balancing schemes, such as work pools, in `pthreads` requires a significant development effort.

As an alternative, many parallel applications are implemented using loop-based scheduling with the OpenMP `parallel for` directives. In this case, the runtime library is in charge of



Figure 4.8 Speedup over 1 little core when running on 4 to 8 cores and 4 of them are big. Four different programming models are considered: Static threading using `pthreads`, parallel loops with static scheduling (loop static), parallel loops with dynamic scheduling (loop dynamic), and a task-based solution with dynamic scheduling (task-based).

scheduling work to the available threads in the system, either statically or dynamically, as described in Section 4.2.3.

We compare these solutions to the task-based approach evaluated in the previous sections. Figure 4.8 shows the results obtained from running blackscholes, bodytrack, fluidanimate and swaptions on all the scheduling models: static threading, static loop scheduling, dynamic loop scheduling and task-based scheduling. We chose these applications as they are the only ones implemented using the OpenMP loop directives.

Looking at the average results in Figure 4.8, we can observe that the task-based solution achieves the best results when the system is asymmetric. Task-based improves the static threading by up to 59% on 5 cores, while dynamic loop scheduling improves by up to 54%. The OpenMP version with static scheduling reaches an average 26% improvement over the static-threading approach with `pthreads`.

Taking a closer look to the results we observe that for bodytrack, an application with sophisticated parallelization techniques, static-threading achieves better results than loop-static. This is because the static-threading implementation contains specific parallelization techniques that cannot be completely expressed using the loop-static method. The loop-dynamic method improves performance for bodytrack by up to 4% due to the runtime decisions of the iteration execution, but the optimal solution is offered by the task-based approach that achieves up to 16% improvement over static-threading, due to the flexibility in expressing irregular parallelization strategies.

Blackscholes, fluidanimate and swaptions, consist of independent tasks and are a good

fit for loop parallelism. The first observation is that both applications benefit from the loop-static approach on an SMC with 4 big cores. Moreover, the task-based approach is still the optimal for blackscholes and fluidanimate, reaching up to 83% improvement over static threading for 5 cores, while for swaptions loop-dynamic is the most appropriate, improving the baseline by up to 2.6 $\times$ . The difference in the benefits of these applications relies on the task granularity; blackscholes consists of 6400 tasks that are about a hundred times smaller than each one of the 128 tasks of swaptions. This shows that loop-dynamic is more efficient on coarse-grained applications. Finally, fluidanimate, that is also a fine-grained application that consists of 128 500 tasks, also benefits from the task-based approach. For this benchmark, static and dynamic loop scheduling achieve similar performance; this is due to the limited parallelism per parallel region, as the loop-based implementation consists of multiple barriers between small parallel regions, fact that diminishes the effect of dynamic vs static scheduling.

## 4.4 CONCLUSIONS

In this extensive evaluation of highly parallel applications on an ARM big.LITTLE AMC system we showed that current implementations of parallel applications using pthreads are not ready to fully utilize an AMC. Implementing highly sophisticated parallelization strategies such as parallel pipelines (ferret) to exploit AMCs at the application level requires a significant programming effort and is not applicable to all workloads. The built-in GTS heterogeneity-aware OS scheduler only partially mitigates the slowdown of static threading when using both big and little cores. Both dynamically-scheduled loop- and task-based versions achieve higher performance with increased utilization which results in increased power. This leads to similar energy consumption as static threading and GTS, which ends up with better results in EDP.

Overall, GTS and static threading are not suitable solutions to run intensive multi-threaded applications on AMCs. Dynamic scheduling is essential to distribute the load across different core types. A loop-based implementation with dynamic scheduling is appropriate when the parallel work granularity is large and the potential imbalance at the tail of the loop is insignificant compared to the overall parallel region duration. A task-based implementation with inter-task dependencies allows removing barriers, which is the preferred solution, especially when the granularity of parallel regions is small.

# CHAPTER 5

## TASK-BASED SCHEDULING SOLUTIONS

---

In the previous chapter we showed that the most efficient way to utilize an AMC with HPC applications is by performing scheduling in the runtime system level, i.e. using a parallel programming model. In this chapter our goal is to enhance the current parallel programming models and make them more efficient when it comes to AMCs.

As performance and energy efficiency have become the main challenges for next-generation high-performance computing, asymmetric multi-core architectures can provide solutions to tackle these issues. Parallel programming models need to be able to suit the needs of such systems and keep on increasing the application's portability and efficiency. This chapter presents three task scheduling approaches that target asymmetric systems. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application, or by finding the earliest executor of a task. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches against an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to  $1.45\times$  in a real 8-core asymmetric system and up to  $2.1\times$  in a simulated 32-core asymmetric chip.

## 5.1 INTRODUCTION

To effectively utilize AMC systems taking into account their heterogeneity, load balancing and scheduling become two of the main challenges [63]. An approach towards these challenges is the use of task-based programming models [7, 18, 40, 41]. The modern task-based programming models schedule tasks dynamically according to the availability of resources. They also allow the specification of dependencies between tasks, enabling the runtime system to automatically perform scheduling and synchronization decisions.

Even though task-based programming models is a powerful mechanism, the efficient mapping of ready tasks to different types of cores on an asymmetric system remains a challenge when considering the reduction of the total execution time. Task-based parallel applications expose different characteristics that can affect the total application duration such as complex task dependency graphs (TDGs) with long critical paths or different levels of task cost variability. In such cases it is very common that the tasks in the critical path determine the total application duration. These characteristics influence researchers to develop smart scheduling techniques within a task-based programming model and accelerate the overall application. The criticality-aware schedulers detect the critical tasks of an application and increase performance by running critical tasks on fast cores. Some previous works [35, 49, 65, 91] tackled this issue using static scheduling over the whole TDG to statically map tasks to processors on a heterogeneous system. However, they required the knowledge of profiling information and most of them were evaluated on synthetic randomly-generated TDGs.

This chapter presents three novel dynamic task schedulers that detect the critical path of the in-flight dynamic snapshot of the TDG. Furthermore, this chapter shows the potential of the proposed dynamic scheduling techniques compared to a state-of-the-art dynamic heterogeneous scheduler [91]. Specifically we compare our approaches against the a dynamic implementation of the heterogeneous earliest finish time scheduler (HEFT) [91]. We implement these scheduling policies in the OmpSs [9, 41] programming model that supports dynamic scheduling and dependency tracking as described in Section 3.3.

Compared to previous works, all the scheduling policies described and evaluated in this chapter are based on information discoverable at runtime, are implementable and work on a real asymmetric multi-core platform with real applications and therefore, using real TDGs. The contributions of this chapter are the following:

- The Criticality-Aware Task Scheduler (CATS) that dynamically assigns critical tasks to fast cores in an AMC system. Tasks are defined to be critical if they are part of the *longest path* in the in-flight dynamic state of the dependency graph. The flexibility

and work stealing policy of this scheduler are configurable. Flexibility increases the number of tasks considered critical. Work stealing may be uni- or bi-directional: only fast cores can steal from slow cores, or slow cores can also steal from fast cores.

- The Critical Path scheduler (CPATH) that dynamically assigns the tasks that belong to the *critical path* of the TDG to the fast cores of the system. To do so, CPATH tracks the execution time of the tasks, assigns cost-based priorities and, according to these priorities it detects the critical tasks.
- The Hybrid Criticality scheduler (HYBRID) that incorporates the features of CATS and CPATH by assigning to the fast cores tasks that belong *either to the critical path or to the longest path* of the TDG, depending on the runtime circumstances. HYBRID uses mixed priorities that are cost-based or level-based. This technique also keeps track of the task costs but if this information is not available it uses the mechanisms of CATS that dynamically detects the longest dependency chain of the in-flight dynamic state of the TDG
- An evaluation of CATS, CPATH and HYBRID schedulers compared to the state of the art heterogeneous scheduler HEFT [91], all of them implemented in the OmpSs programming model. Moreover we evaluate these approaches next to the default FIFO scheduler that serves as our baseline. The results show that all heterogeneous schedulers improve overall performance reaching up to 45% improvement. Furthermore, we describe their features such as the high per-task overheads of CPATH, the inability of dHEFT to improve performance when the task number increases as well as the benefit of HYBRID scheduler compared to CATS when task cost variability increases.

## 5.2 CRITICALITY-AWARE TASK SCHEDULER (CATS)

The Criticality-Aware Task Scheduling generally applies to task-based programming models supporting task dependencies, but for simplicity we explain it in the context of the OmpSs programming model. CATS uses bottom-level longest-path priorities and consists of three steps:

**Task prioritization:** when a task is created and added to the TDG, it is assigned a priority and the priority of the rest of tasks in the graph is updated accordingly.

**Task submission:** when a task becomes *ready*, i.e., all its predecessors finished their execution, it is submitted to a *ready queue*. At this point, the algorithm decides whether the task is considered *critical* or *non critical*. The task is then inserted in the corresponding ready queue:

tasks in the *critical ready queue* will be executed by fast cores, and tasks in the *non-critical ready queue* will be executed by slow cores.

**Task-to-core assignment:** when a core becomes idle, it retrieves a task from one of the two ready queues to execute. The different policies and scenarios of this step are detailed in Section 5.2.3.

These steps are performed dynamically and potentially in parallel in different cores. Thus, while some tasks are being prioritized, previously created tasks may be submitted, and others assigned to available cores or executed.

To give an overview of the scheduling process, Figure 5.1 shows a scheme of the operation of CATS. In the TDG on the left, each node represents a task and each edge of the graph represents a dependency between two tasks. The number inside each node is the *bottom level* of the task: the length of the longest path in the dependency chains from this node to a leaf node. The priority of a task is given by its bottom level. The pattern-filled nodes indicate tasks that are considered critical. The number outside each node is the task id and is used in the text to refer to each task. Critical tasks are inserted in the critical queue, and non-critical tasks to the non-critical queue. The insertion is ordered with the highest priorities at the head of the queue and the lowest priorities at the tail. Slow cores retrieve tasks from the head of the non-critical queue and fast cores from the critical queue. The following sections describe these scheduling steps in detail.

### 5.2.1 TASK PRIORITIZATION

Each task in the TDG has a list to include its predecessors (*plist*). Every time an edge is added into the TDG on the creation of a new task, the corresponding predecessor of the dependency is added in the *plist* of its successor. For example, in Figure 5.1, when the dependency between tasks 2 and 5 occurs, the task number 2 is inserted into the *plist* of the task number 5. Thus, the *plist* of task number 5 becomes {2}. Accordingly, the *plist* of task number 10 will be {2, 9} when the edge 9→10 is inserted to the TDG.

The priority given to a task is the *bottom level* of the task. The *bottom level* is computed by traversing the TDG upwards starting from the successor that the currently created edge is pointing to. The priority of this successor is 0 because it is a leaf node of the graph, as it is the last created task. Then, using *plist* for each task, the algorithm navigates to the upper levels of the TDG and updates the priority on each visited node. This way not all the graph is updated, but only the tasks that are predecessors in the paths to the new edge. The algorithm also stops going up through a path, when it finds a priority larger than the one it would be updated to.

Listing 5.1 shows the algorithm for task prioritization. The complexity of this is  $O(n^2)$ ,



Figure 5.1 CATS overview. Nodes are marked with the *bottom level* of each task. Pattern-filled nodes mark the critical tasks.

$n$  being the number of tasks. This function is called on the creation of a new edge with the successor as argument. The algorithm traverses the *plist* of the successor task (line 5) and if the priority of the current predecessor is lower than the bottom level of the successor plus one, it updates the current predecessor's priority to that value (lines 7-8). If the updated predecessor task is ready (i.e., it sits in one of the ready queues), the scheduler reorders the ready queue so it remains ordered considering the updated priority (lines 9-10). Then, the same actions are performed recursively for each predecessor of the *plist* to update all the possible upward paths from the successor.

The terminate conditions for the TDG navigation are two: (a) if the *plist* of the current task (*currPred*) is empty, so either we reach an entry node or the predecessors of the task have finished execution; or (b) if the priority of the current task (*currPred*) remains unchanged, which means that the successor task (*succ*) does not belong to the longest path because its predecessor already has a higher priority.

### 5.2.2 TASK SUBMISSION

The purpose of this step is to divide the tasks into two groups: *critical* and *non-critical*. Critical tasks are tasks that belong to the longest path of the dynamic TDG, namely the path with the maximum number of tasks (or nodes). Thus, the longest path starts from the task with the

---

```

1 void prioritize_task(task *succ) {
2     int blev = succ->priority;
3     list plist = plistOf(succ);
4     task *currPred;
5     while( not isEmpty( plist ) ) {
6         currPred = plist.next();
7         if(priorityOf(currPred) < blev+1) {
8             currPred->priority = blev+1;
9             if(isReady(currPred))
10                 readyQueueOf(currPred)->reorder();
11             prioritize_task(currPred);
12         }
13     }
14 }
```

---

Listing 5.1 Pseudo-code task prioritization with CATS.

maximum bottom level. At runtime, the longest path changes as tasks complete execution and new tasks are created. CATS manages to detect these changes and dynamically decide if the submitted task belongs to the longest path of the TDG.

When a task's dependencies are satisfied, the task becomes ready for execution and is to be inserted in the *ready queues*. Ready queues are priority queues that keep tasks in a decreasing order of task priorities, i.e., the task with the maximum priority resides on the head of the queue. Critical tasks are inserted in the critical queue and non-critical tasks in the non-critical queue. The pattern-filled nodes in Figure 5.1 represent the critical tasks in that graph.

To determine the criticality of a task, CATS keeps track of the last discovered critical task. Then, for each task that becomes ready, CATS checks the following conditions: (a) if the priority of the current ready task is higher or equal to the priority of the last discovered critical task and, (b) if the current ready task is the highest-priority immediate successor of the last discovered critical task.

In the first case, the algorithm detects new longest paths that may have been created by the application throughout the execution of a prior longest path. In this case, the scheduler can either be *strict* or *flexible*:

- *Strict*: marks as critical tasks with priority higher than the priority of the last critical task.
- *Flexible*: marks as critical tasks with priority higher or equal to the priority of the last critical task.



Figure 5.2 Task submission with CATS. Gray nodes indicate finished tasks and pattern-filled nodes indicate critical tasks.

As a result, the flexible scheduler ends up with more critical tasks than the strict. The flexibility of the scheduler can be set by the programmer through an environment variable. The task that satisfies the second condition is a task with a lower priority than the maximum but the task belongs to the longest path because it is the highest priority immediate successor of the last detected critical task.

Listing 5.2 shows a simplified version of the task submission code, that is of complexity  $O(n)$  ( $n$  is the number of tasks). The variable `maxPriority` (line 1) is used to store the priority of the last critical task, and `maxPriorityTask` (line 2) is used to store the last critical task. Initially, `maxPriority` is set to 1 and `maxPriorityTask` is set to `NULL`. This avoids the scheduling of independent tasks (i.e., tasks with zero priority) to fast processors at the start of the execution. On the first ready task, if its priority is higher or equal than 1 (line 5), it is considered to be the first task of the longest path. Therefore, it is inserted in the critical queue and the variables `maxPriority` and `maxPriorityTask` are updated accordingly (lines 9-11) to determine correctly the criticality of the next submitted task.

If the priority of the submitted task is equal to `maxPriority - 1`, we check if it also belongs to the successors of the task with the maximum priority (lines 6-7) and therefore to the longest path. If these two conditions are met, the task is determined to be critical, it is inserted in the critical queue and, as before, the variables `maxPriority` and `maxPriorityTask`

---

```

1 int maxPriority = 1;
2 task *maxPriorityTask = NULL;
3
4 void submit_task(task *t) {
5     if( t->priority >= maxPriority or
6         ( t->priority == maxPriority -1 and
7             t ∈ succListOf(maxPriorityTask) ) )
8     { //the task is critical
9         critical_queue.push(t);
10        maxPriority = priorityOf(t);
11        maxPriorityTask = t;
12        return;
13    }
14    //the task is non-critical
15    non_critical_queue.push(t);
16 }
```

---

Listing 5.2 Pseudo-code for task submission with CATS.

are updated (lines 9-11). In the rest of the cases the task is not considered critical and it is inserted in the non-critical queue.

Figure 5.2 shows an example of a TDG during task submission. The gray nodes in the graph are tasks that have finished execution and the pattern-filled nodes are critical tasks. The numbers inside the nodes indicate their priority and the numbers outside the nodes show the task id, which is assigned in task creation order. The variable `maxPriority` corresponds to the priority of the last critical task and the `maxTaskSucc` is the list of the successors of the last critical task, filled with the task ids of the successors. Initially, `maxPriority` is set to 1 and `maxTaskSucc` is empty. When task 2 is about to be submitted, it is inserted in the critical queue because its priority is higher than the maximum, which at the beginning is 1. Then, the value of `maxPriority` is set to 6 (priority of task 2), and the `maxTaskSucc` list is updated with the successors of task 2. At the point where all the gray tasks have finished execution, the values of `maxPriority` and `maxTaskSucc` are updated as shown in Figure 5.2. For every newly-ready task, the conditions listed above are evaluated. When task 7 is submitted, it is not considered as critical because it does not belong to the `maxTaskSucc` list and its priority is not equal to `maxPriority-1`. Contrarily, task 8 satisfies both conditions and so the task is inserted in the critical queue.

### 5.2.3 TASK-TO-CORE ASSIGNMENT

Task-to-core assignment takes place dynamically and in parallel to the previous steps and its time complexity is  $O(n)$ ,  $n$  being the number of tasks. When a core becomes idle, it checks the corresponding ready queue (depending on the core type) to get a task to execute. Fast cores retrieve critical tasks from the critical queue, while slow cores retrieve non-critical tasks from the non-critical queue. Each ready queue is shared among the cores of the same type so there is no need for work stealing among cores of the same type.

If tasks in an application are imbalanced, i.e., the majority are non-critical and only a few tasks are critical, or vice versa, one of the types of processors would be overloaded and the other would starve for work. This can happen in applications with wide graphs and a large amount of tasks, where the ratio between critical tasks and the total amount of tasks may be small. To leverage the resources, the work-stealing mechanism for CATS lets fast cores steal work from slow cores whenever the critical queue becomes empty. Also, CATS can be configured to perform *bidirectional work stealing* so slow processors can also steal tasks from the critical queue if the non-critical queue is empty. We evaluate these different options and show the results in Section 5.6.2.

## 5.3 CRITICAL PATH SCHEDULER (CPATH)

The Critical Path scheduler (CPATH) dynamically detects the critical path of the TDG. Like CATS, CPATH separates tasks into two groups: critical and non-critical tasks. The detected critical tasks are executed by the fast cores in the system and non-critical tasks are executed by slow cores. The difference with CATS is the algorithm for critical path detection. CPATH takes into account the task execution time, about which CATS is unaware. To do so, CPATH implements a more complex and accurate critical path detection algorithm that takes into account task execution time.

CPATH scheduler consists of three steps:

**Task prioritization:** this step takes place when a task is finishing its execution. This is different than CATS since at the end of a task execution the algorithm may record the task execution time (task cost). According to the discovered task cost CPATH assigns priorities to tasks by traversing the TDG from top to bottom, introducing the cost of  $O(2n^2)$ , where  $n$  is the number of tasks.

**Task submission:** when a task becomes *ready*, it is submitted to a *ready queue*. At this point, CPATH decides whether or not the task is *critical* and inserts it in the corresponding ready queue. This step has only slight implementation differences with CATS and complexity of  $O(n)$ .



Figure 5.3 Priority assignment with CPATH taking into account the task costs. Task costs are assumed known and are shown in the tables.

**Task-to-core assignment:** this step is identical to CATS.

### 5.3.1 TASK PRIORITIZATION

Each task of the TDG keeps a list with its successors (*slist*). This list is being built when an edge (dependency between two tasks) is added in the TDG. So when a task dependency occurs, the corresponding successor task is added in the *slist* of its predecessor. For example, on Figure 5.3, when the dependency between tasks 2 and 4 occurs, the *slist* of task number 2 becomes {4}. This goes on for all the added edges of the TDG, therefore when the edge 2→5 is inserted in the TDG, the task number 5 is inserted in the *slist* of task number 2; so the *slist* of task number 2 becomes {4, 5}.

The goal of this step is to assign priorities based on the *bottom cost* of the tasks of the TDG. We define the *bottom cost* of a node on a directed acyclic graph as the maximum estimated time in the dependency chains from this node to a leaf node. So the main difference between the *bottom level* and the *bottom cost* is the consideration of the estimated time.

Figure 5.3 is used to describe the priority assignment with CPATH. The specific TDG contains tasks of two different types and two different input sizes. Node color shows the different task types and the outline of the circle (dashed or solid) shows the different input sizes. The upper table in Figure 5.3 indicates the execution time of the tasks according

---

```

1 void taskExit (task* finished) {
2     if( stateOf(finished) == init ) {
3         finished->state = in_progress;
4         return ;
5     }
6     if( stateOf(finished) == in_progress ) {
7         timesSet[finished] = finished->execTime;
8         finished->state = tracked;
9     }
10    task* succ;
11    for( succ in finished->successors )
12        if( numPredecessorsOf(succ) == 1 ) {
13            lock();
14            if( succ &lt; entryNodes )
15                entryNodes->push(succ);
16            unlock();
17        }
18    list<task*>* updatedList = new list<task >();
19    for( node in entryNodes )
20        updatePriorities(node, updatedList);
21    for( node in updatedList )
22        node->unsetUpdated();
23 }
```

---

Listing 5.3 Pseudo-code for taskExit, the function called by the cores used as reference for tracking the task costs

to their type and input size. The algorithm assumes that task instances of the same type with the same input size have the same (or very similar) execution time. To track this information, CPATH discovers the cost of every possible task type-input size duple (tt-is duple) that appears on the TDG. The numbers inside the nodes show the bottom cost-based priorities that CPATH assigns and the numbers outside the nodes show their task ID.

The task prioritization step takes place every time a task finishes execution. CPATH uses a vector to store task costs and keeps one entry per tt-is. Because CPATH needs to discover the unbiased critical path of the TDG, it uses one of the core types as reference to track the task costs. In our experiments we chose to use as reference the fast cores since this way the learning phase (that is, the phase where CPATH discovers the task costs) becomes shorter. To avoid wrong task cost prediction of future tasks, CPATH ignores the first execution of each tt-is because usually it takes more time.

Listings 5.3 and 5.4 show how the critical path scheduler performs task prioritization.

Whenever a task finishes execution on one of the cores used as reference (here: fast cores) the runtime makes a call to the `taskExit` routine shown in Listing 5.3. At this point, the runtime is aware of the execution time of the finished task. This function has the responsibility to update the known task costs and also perform the prioritization of the tasks on the TDG. The prioritization is done by the `updatePriorities` function of Listing 5.4. This function is responsible for TDG traversal.

The `taskExit` function in Listing 5.3 takes as an argument the task that has just finished. In order to keep track of whether the execution time of the tt-is has been discovered we implement a small finite state machine within this stage. Every tt-is has three possible states. The initial state is the `init` state; this means that the specific tt-is has not yet been executed so its execution time is totally unknown. When a tt-is is executed for the first time its state changes from `init` to `in_progress`. This means that a task of this tt-is has been executed once, but CPATH ignores this cost because the first instance may not be representative due to cold start effects and one sample may not be enough history for prediction. While the tt-is of a node is in `init` or `in_progress` state its execution time is considered to be 1. After the second execution of a tt-is the state of it becomes `tracked` meaning that the execution time has been tracked and can be used for the computation of the priorities.

After the first checks of the tt-is state (lines 2-9 of Listing 5.3) the algorithm traverses the `slist` of the finished task and searches for the successors that become ready by the end of the execution of this task. This is identified by the fact that the ready-to-be successors have one unique (remaining) predecessor (e.g. the just finished task). These successors are inserted in the `entryNodes` list (lines 11-16 of Listing 5.3). For each one of the entry nodes the `updatePriorities` function is called (line 19 of Listing 5.3); this performs a top to bottom traversal of the TDG and updates the priorities.

Due to the properties of the top-to-bottom TDG traversal, the algorithm has to make sure that every node is prioritized only once per `updatePriorities` call. This is controlled by checking the `updated` flag of each node of the TDG. To visualize this situation let us assume that task number 2 of the TDG on Figure 5.3 finishes. Then the `entryNodes` list contains three tasks that will start the update:  $\{4, 5, 6\}$ . The update that starts from task number 4 marks tasks 4, 7, 11 and 13 as updated. Then, during the update of task number 5, the algorithm knows that task 13 has already been prioritized during the same update so there is no need to apply the algorithm at this node again. This example does not show too much optimization because in this case the update of only one node is saved, but in real applications this node could have numerous successors for whom the priority update would be a large overhead.

The raising of the `updated` flag is something temporal and is only used for helping the prioritization of a single update. There are cases when CPATH needs to raise a permanent

---

```

1 int updatePriorities (task* currT , list* updated) {
2     if( currT == NULL ) return 0;
3     if( isVisited(currT) )
4         return priorityOf(currT);
5     successors = currT->successors ;
6     int maxSucc = -1;
7     bool succVisited = true;
8
9     for(succ in successors) {
10        int succPriority ;
11        //Avoid double update
12        if( (not isUpdated(succ)) or (not isVisited(succ)) ) {
13            succPriority = updatePriorities(succ , updated);
14            succ->setUpdated();
15            updated->push(succ );
16        }
17        else
18            succPriority = priorityOf(succ );
19        if(succPriority > maxSucc)
20            maxSucc = succPriority ;
21        succVisited = succVisited and isVisited(succ );
22    }
23    if( timeIsTracked(currT) ) {
24        currT->priority = (maxSucc + timesSet[currT ]);
25        if(succVisited and groupOf(currT ) < twDetected)
26            currT->setVisited();
27    }
28    else
29        currT->priority = maxSucc + 1;
30
31    return priorityOf(currT );
32 }
```

---

Listing 5.4 Pseudo-code for task prioritization with CPATH

flag in order to mark that the priority of the task will not change again in the future, e.g. it is the final priority. This happens when the execution times of all the tt-is that appear on the TDG have been discovered, for the tasks that their priorities are up to date. To mark these tasks CPATH uses the `visited` flag. If a task is `visited`, there is no need to get prioritized again. To clarify this, let us assume that in Figure 5.3 the task costs of the tt-is TaskA-Input2 and TaskB-Input2 are known. During the next prioritization, tasks 11 (TaskB-Input2), 12 (TaskA-Input2) and 13 (TaskA-Input2) in the TDG will be set as `visited`, because their priorities consist of the sum of known task execution times and they do not have any successors (with unknown execution times). So, an additional priority update in cases like this is redundant.

Listing 5.4 shows what happens during the the update of one entry node. The arguments of this function are `currT`, that is the entry node being updated, and `updated`, that is the list with the updated nodes. This list is being filled throughout the priority update in order to unset the `updated` flag later. The lines 2-4 of Listing 5.4 perform the checks that would cause the traversal to finish. If the node is not `visited`, then the algorithm traverses its successors. Note that, at this point, there is no check for `updated` flag, since tasks in the `entryNodes` are unlikely to be updated. Updated nodes can only be discovered through recursive calls and this check is performed later. If a successor is `updated` or `visited`, the priority update is skipped for the reasons explained above. Otherwise, the `updatePriorities` is called recursively for the current successor. This happens until we detect a node that is `updated`, `visited` or is a leaf node (node with no successors) of the TDG. When the algorithm reaches a node ready for update it calculates its priority by summing the highest priority of its successors to the execution time, if known, of the current node (lines 24, 29). Finally, the `visited` flag of the task is being updated.

There are three conditions that mark a task as `visited`: (a) if its execution time is known (line 23), (b) if **all** of its successors are `visited` (line 25) or (c) if we have encountered a `taskwait` (barrier) after the creation of this task (line 25). The last condition confirms that it is safe to mark this task as `visited` as there will be no future successors of this task on the current TDG. To track this information we use an atomic variable, `twDetected`, which is increased every time a `taskwait` is encountered. At creation time, each task is assigned a group ID which is the value of the `twDetected` at that moment. If the group ID of a task is less than the current `twDetected` value then this means that a `taskwait` has occurred after the creation of this task.

### 5.3.2 TASK SUBMISSION

The task submission is implemented using the same `critical` and `non-critical` ready queues as in CATS. Listing 5.2 can be used to describe the task submission of CPATH. The only modification needed is in the condition of the lines 6 and 7 of Listing 5.2. In addition to the `maxPriority`, CPATH keeps track of the `maxExecTime` which is the cost of the last discovered critical task. CPATH extends the condition of the critical task consideration by checking whether the priority of the current task is equal to `maxPriority - 1` or if it is equal to `maxPriority - maxExecTime`. Moreover, the value of `maxExecTime` is updated accordingly to the `maxPriority`.

### 5.3.3 TASK-TO-CORE ASSIGNMENT

Task-to-core assignment in CPATH is identical to CATS as described in Section 5.2.3. It takes place dynamically and in parallel to the previous steps. Depending on the core type, whenever a core becomes idle it retrieves a task from its corresponding ready queue; fast cores are responsible for the execution of the tasks in the critical queue and slow cores for the tasks in the non-critical queue. Each ready queue is shared among the cores of the same type so there is no need for work stealing among cores of the same type. Finally, as with CATS, the work-stealing mechanism prevents load imbalance by allowing big cores to steal work from the little cores.

## 5.4 HYBRID CRITICALITY SCHEDULER (HYBRID)

The Hybrid Criticality Scheduler (HYBRID) is a combination of the CATS and CPATH scheduling policies. HYBRID keeps the simplicity of the implementation of CATS and introduces the task execution time only if available. This results in an efficient low-overhead scheduler that computes the critical path of a TDG more faithfully than CATS and with lower overheads than CPATH. This section describes HYBRID through its relation to CATS and CPATH described in Sections 5.2 and 5.3. We focus our description on the task prioritization, since task submission and task-to-core assignment for HYBRID are identical to CPATH.

As shown, CPATH computes priorities on task completion. The algorithm for priority computation is an expensive operation and is in the critical path of the execution: on task completion the core becomes available but the start of the next task is delayed by priority computation. Also, when multiple cores are completing tasks, there will be contention on accessing the TDG for priority computation. On the other hand, CATS computes priorities

during task creation. The computation of priorities during task creation is more efficient because, unless there is nested parallelism, one core creates all tasks and therefore there is no contention due to multi-threading on priority computation. The downside is that there is potentially less information available on tt-is pair execution time on task creation, as some task type may have not been executed yet at the time all tasks are created.

HYBRID tracks task execution time on task completion and stores this information in a vector. This means that it also implements the `taskExit` function of CPATH that is called on task completion but, in the case of HYBRID, `taskExit` is only responsible of recording the execution time of the exiting task. This functionality is represented in lines 2-9 of Listing 5.3 and, after this code, the function returns. The priority assignment, taking place on task creation, remains similar to CATS<sup>1</sup> with the only difference that task cost is used for priority computation only if known and, otherwise, the cost is assigned to 1 and priority is increased according to CATS (lines 7 and 8 of Listing 5.1).

When comparing CPATH and HYBRID schedulers their logical operation is similar. However the difference in their implementation may result in different task priorities potentially leading to different schedules. For applications with small TDGs, HYBRID may not be able to compute an accurate critical path because task creation does not overlap with a sufficient amount of task exits. Therefore, task execution information will not be available during priority computation and HYBRID will prioritize based on bottom-level priorities (like CATS). If the application has a large TDG and task creation overlaps with a sufficient amount of task exits, HYBRID will use bottom-cost priorities.

Figure 5.4 shows an example of task prioritization with HYBRID. The tables show the state (or exec. time) of the tt-is pairs that appear on the TDG. Gray or white nodes indicate different task types (A or B respectively) and solid or dashed node outlines indicate task input size (1 or 2 respectively). The numbers inside the nodes show task priorities and the numbers outside the nodes show the task id.

On the leftmost TDG, the algorithm has no information about any of the tt-is costs. As the leftmost table shows, for all the possible tt-is the state is `init` meaning no task has been executed yet. Since the tasks of the TDG have been created, they have been prioritized using the CATS priority assignment method and the bottom level based priorities. On the rightmost TDG, tasks 1, 2, 3, 4, 5, 6 and 8 have been executed and a new task has appeared on the TDG: task number 13. When the edge 12→13 is created, tasks begin to be prioritized. Initially, the priority of the new task 13 is the cost of this task's tt-is, i.e., type A and input 2 (TaskA-Input2). Since there are no successors of this task, this becomes its initial priority. Then, the `plist` of task 13 is traversed and the priority of task 12 changes to

---

<sup>1</sup>All of the HYBRID scheduling steps have the same time complexity as CATS



Figure 5.4 Priority assignment with HYBRID scheduler. Priority update when the edge between tasks 12 and 13 is created

$\text{priorityOf}(13)+\text{costOf}(\text{TaskA-Input2})$  since task 12 is corresponding to the TaskA-Input2 tt-is. Moving to the upper levels, task 10 is of tt-is TaskA-Input1 that is on the `init` state, thus unknown cost. This translates to the use of bottom level based prioritization so the priority of task 10 becomes  $\text{priorityOf}(12)+1$ . Finally, task 9 is prioritized using the cost of the TaskA-Input2 tt-is and the TDG navigation stops since there are no other predecessors.

## 5.5 DYNAMIC HETEROGENEOUS EARLIEST FINISH TIME SCHEDULER (DHEFT)

The Heterogeneous Earliest Finish Time (HEFT) algorithm [91] is a well-known state-of-the-art static scheduling algorithm for asymmetric systems. HEFT consists of two compile-time phases that use profiling information: the *task prioritizing phase* and the *processor selection phase*. In the first phase, the algorithm assigns priorities to the tasks based on their *upward rank*, that is, the length of the critical path from a given task to the exit task including task computation and communication costs [91]. When task prioritizing is done, the tasks are sorted according to their priorities. In the *processor selection phase* the algorithm searches for each task the appropriate processor to execute it. By keeping communication and compu-

tation costs, HEFT assigns each task to the processor that will finish its execution at the earliest possible time. Topcuoglu et al. [91] present their results based on evaluation on synthetic TDGs and assume known task execution and communication times at compile time. The scheduling is static, so all the decisions are taken before execution.

In this paper, since the evaluation consists of running real applications with unknown task costs, the best way to compare HEFT to our proposal is by using a dynamic version of HEFT algorithm (dHEFT). The dHEFT is implemented in the OmpSs programming model and is based on the implementation used in the evaluation of CATS [31]. This version assumes two different types of cores (fast and slow) and keeps records of the task costs in each core. DHEFT discovers the task costs at runtime, computes the mean cost of each task for each core type and then finds the core that will finish the task at the earliest possible time.

To find the earliest possible executor, dHEFT maintains one list per core (wlist) including the ready tasks waiting to be executed by that core. When a task becomes ready, dHEFT first inserts it in the ordered ready queue; then the task with the highest upward rank is selected and dHEFT checks if there are execution time records for this task. If the number of records is sufficient (we require a minimum of three records) then the estimated cost of the task is considered stable. Using that estimated execution time, the task is scheduled to the earliest executor by consulting the wlist of all cores. If the number of records is not sufficient for one of the core types, then the task is scheduled to the earliest executor of this core type to get another record of that task-type and core-type execution time. In all cases, dHEFT updates the history of records on every task execution to adapt for phase changes in the application.<sup>2</sup>

The initial dHEFT version presented in previous work [31] lacks the *task prioritizing phase* of the original HEFT algorithm. This paper, uses an improved version of dHEFT that adds this functionality by prioritizing tasks according to their *upward rank*. The implementation of this is similar to the CPATH prioritization step. When the prioritized tasks become ready, they are inserted in a sorted ready queue in decreasing order of their priorities. The algorithm then accesses the tasks in the order of their priorities to find the earliest executor for each of them.

---

<sup>2</sup>The time complexity of the task submission step is  $O(nN)$  and the task-to-core assignment is  $O(n)$ , where  $n$  is the number of tasks and  $N$  is the number of cores.



Figure 5.5 Cholesky factorization task dependency graphs.

## 5.6 EVALUATION

### 5.6.1 METHODOLOGY

In the first part of our real environment evaluation, we measure the execution time of four criticality-sensitive applications using CATS ad the default OmpSs scheduler. We evaluate four different CATS configurations based on the options explained in Section 5.2 and provide a detailed comparison between them. The options consist of whether the task submission policy is strict or flexible, and the type of the work stealing mechanism. This results in the following configurations:

- Flexible with simple work stealing (SS FLEX)
- Flexible with bidirectional work stealing (2DS FLEX)
- Strict with simple work stealing (SS STRICT)
- Strict with bidirectional work stealing (2DS STRICT)

As mentioned in Section 3.3.1, the default OmpSs scheduler employs a *breadth-first* policy (BF) [39]. The BF scheduler implements a single first-in-first-out ready queue shared among all threads. When a task is ready, it is inserted in the tail of the ready queue and when a core becomes available, it retrieves a task from the head of the queue. Tasks are ordered according to their ready time: the earliest ready task resides at the head of the queue. Since the ready queue is shared, there is no need for work stealing and the load is balanced automatically. BF does not differentiate among core types and assigns tasks in a first-come-first-served basis.

After the analysis of the CATS configurations we measure the execution time of five applications using CATS, CPATH, HYBRID, dHEFT and the default OmpSs BF scheduler that is explained in Section 3.3.1. The execution time corresponds to the average of 10 executions of the application on each machine set-up.

Our test bed comprises a real big.LITTLE processor and a simulated heterogeneous system.

To perform our experiments, we use the Hardkernel **Odroid-XU3** development board that has an 8-core Samsung Exynos 5422 with an ARM big.LITTLE architecture. This platform is described in detail in section 3.1. In our experiments, we evaluate a set of possible combinations of fast and slow cores varying the total number of cores from two to eight.

To evaluate heterogeneous scheduling on larger multi-core systems we use the heterogeneous multi-core TaskSim simulator [83] described in section 3.2. TaskSim allows the specification of a heterogeneous system with two different types of cores: fast and slow. We can configure the amount of cores of each type and the difference in performance between the different types (performance ratio) in the TaskSim configuration file. In our experiments, we evaluate the effectiveness of the schedulers on 8 distinct heterogeneous machine configurations. These comprise systems with 16 or 32 total number of cores, and the number of fast cores ranging from 1 to 16. We set the performance ratio between fast and slow cores to  $4.5\times$  because this is the average performance ratio observed on the real machine for the benchmarks of this evaluation.

For both real and simulated platforms, each set-up has a given number of *total* and *big* cores. For all the scheduling approaches we present their speedup over the execution on one *little* core, shown in Equation 5.1.

$$\text{Speedup}(\text{total}, \text{big}) = \frac{\text{Exec. time}(1, 0)}{\text{Exec. time}(\text{total}, \text{big})} \quad (5.1)$$

### *Applications*

We use five scientific applications implemented in the OmpSs programming model: Cholesky factorization, QR factorization, Heat diffusion, Integral Histogram and Bodytrack, also described in Section 6.2.1. These benchmarks are accessible in the BSC Application Repository [12] and in the PARSECSs library [26].

Table 5.1 shows the different configurations and characteristics of the applications. For cholesky, QR, heat diffusion and integral histogram the input is a square matrix divided into blocks. The *Problem size* column of Table 5.1 shows the dimension of the input matrix and the block size. For example cholesky 8K 1024 takes as input a matrix of  $8192\times 8192$  and it is then divided to  $8\times 8$  blocks of 1024 elements each. Each task created in the par-

Table 5.1 Evaluated benchmarks and relevant characteristics. The inputs of QR and Heat diffusion are arrays of doubles and the inputs of Cholesky and Int. Histogram are arrays of floats.

| Application    | Problem size | #Tasks  | #Task types | Avg task exec. time ( $\mu s$ ) | Per task overheads ( $\mu s$ ) |          |          | Measured perf. ratio |
|----------------|--------------|---------|-------------|---------------------------------|--------------------------------|----------|----------|----------------------|
|                |              |         |             |                                 | CATS                           | CPATH    | HYBRID   |                      |
| Cholesky       | 8K 1024      | 120     | 4           | 10 314 660                      | 81.19                          | 115.29   | 112.41   | 3.48                 |
|                | 8K 512       | 816     |             | 1 551 322                       | 104.76                         | 238.02   | 194.28   |                      |
|                | 16K 512      | 5984    |             | 1 551 322                       | 104.76                         | 238.02   | 194.28   |                      |
| QR             | 8K 512       | 1 496   | 4           | 11 651 079                      | 1 419.33                       | 2 580.41 | 1 451.74 | 6.86                 |
| Heat diffusion | 8K 512       | 5 124   | 3           | 93 198                          | 145.17                         | 748.84   | 170.00   | 3.68                 |
| Int. Histogram | 8K 512       | 2 048   | 2           | 514 096                         | 217.45                         | 62.07    | 263.62   | 2.23                 |
| Bodytrack      | native       | 408 525 | 6           | 41 869                          | 93.90                          | 120.93   | 120.93   | 4.14                 |

allelized versions operates on one block. The bigger the block size, the less the number of blocks created, which leads to less tasks. From the applications of Table 5.1, QR and Heat diffusion operate on doubles while Cholesky and Integral Histogram operate on floats. The performance ratio between big and little cores depends on the application. For example, the difference between the issue rate and throughput of double-precision floating point units of both types of cores is larger than the difference for single-precision floating point instructions. Therefore, applications with heavy double-precision operation (e.g. QR) get a larger benefit from running on the big cores, than single-precision dominated applications (e.g integral histogram), as shown in Table 5.1.

The average per task overhead for each scheduler is negligible compared to the average task execution time shown in Table 5.1. Specifically, CATS has the lowest per task overheads. Next is HYBRID and the least efficient is CPATH. This is because of the complexity of the CPATH algorithm that takes place whenever the TDG needs to be updated. On the other hand, CATS and HYBRID have negligible overheads caused by the task prioritization. For dHEFT, the search of the appropriate worker for a task becomes an obstacle in performance. Table 5.1 lacks the per task overheads of dHEFT because they appear to be too high due to the fact that the most intensive computations of dHEFT take place during the cores' idle time. Thus, the natural idle time of cores is also encountered as scheduling overhead and could not be separated, so it is unfair to present such results for comparison. Normally these obstacles in heterogeneous schedulers are paid off by the more effective task execution.

Figure 5.5 shows the TDGs for input sizes of (a)  $8 \times 8$  and (b)  $16 \times 16$  blocks. Critical tasks are denoted as red nodes. The TDG becomes wider as the number of blocks increases. This reduces the percentage of critical tasks that is 17.5% in the case of the  $8 \times 8$  input and 5.51% in the case of the  $16 \times 16$  input. The  $8 \times 8$  blocks case shows a narrower TDG that makes the application more criticality sensitive than the  $16 \times 16$  blocks case that exposes



Figure 5.6 Task cost distribution for each application. Results are based on 4BIG-core executions. *x* axis shows the cost of the tasks and *y* axis shows the number of tasks with the corresponding task cost.

more parallelism. We evaluate both configurations to show the impact of scheduling on different criticality sensitiveness of the application configuration.

To more precisely characterise the benchmarks, we plot the task cost variability for each benchmark on Figure 5.6. We normalize the task execution time of the applications with respect to the smaller task observed among all applications so that we can easily compare the task sizes of different applications as shown on the distributions. For each of these plots, the *x* axis shows the normalized task cost (i.e. task execution time) and the *y* axis the number of tasks that correspond to this task cost (e.g. how many tasks have this cost). This is used in the next section to classify how heterogeneous each application is and explain the behavior of the heterogeneous schedulers that take into account the execution time.

### 5.6.2 REAL ENVIRONMENT EVALUATION

This subsection includes all the experiments and results obtained from the real asymmetric system ODROID-XU3. It includes one section that explores the different CATS configurations and another section that compares the best CATS configuration among the rest of the schedulers.



Figure 5.7 Schedulers comparison on ODROID-XU3

### Evaluation of CATS Configurations

Figure 5.7b shows the improvement of the CATS configurations over BF, and the speedup obtained with CATS and BF, for Cholesky on an  $8 \times 8$  blocked matrix. CATS consistently achieves better performance than BF and the improvement over BF increases as the number of cores is increased. Specifically, the improvement is observed to be up to 30% for systems with seven and eight cores.

Figure 5.7c shows the performance on a  $16 \times 16$  block input matrix, where the improvement of CATS is smaller and ranges from 2 to 9% and all schedulers perform fairly well. In this case the opportunities for enhancement are limited, since, according to Figure 5.8, BF

performance approaches the ideal speedup. The lower improvement in this case comes from the fact that the application is less sensitive to the critical path. The task graph is wider (as shown in Figure 5.5) and, accordingly, the percentage of critical tasks is lower. Specifically the percentage of critical tasks for this input is limited to 5.5% compared to the percentage observed with Cholesky  $8 \times 8$  that is 17.5%. However, CATS still outperforms BF by 7% when using the eight cores in the system.

Figure 5.7d shows the improvement of CATS over BF and their speedup for QR factorization. QR consists of double precision operations which cause the big cores to be  $4.26 \times$  faster than the little cores. Thus, the ideal speedup of the system for QR is  $20.1 \times$ . CATS achieves a  $15 \times$  speedup by shortening the execution of the dynamic longest path in the TDG.

Figure 5.7e shows the improvement of different CATS configurations over BF, and the speedup obtained with CATS and BF, for heat diffusion. CATS consistently improves the scheduling of the asymmetric system from 15% to 22%, since the main criterion of scheduling is the TDG structure. CATS achieves a  $13 \times$  speedup when using all eight cores, thus getting very close to the ideal  $15.3 \times$  shown in Figure 5.8.

Figure 5.7f shows the improvement of CATS over BF, and the speedup obtained with CATS and BF, for integral histogram. The impact of CATS is again positive for all configurations, since the improvement is at least 5% with the peak being 15% for 8 cores. Experiments on larger systems show that the specific application's scalability saturates beyond 16 cores. This is because the intensive dependencies between tasks reduce the available parallelism that is mainly present among the diagonal-placed blocks due to the cross-weave scanning process. Moreover, the performance ratio of this single-precision application is  $1.7 \times$ , so the ideal speedup is limited to  $10.8 \times$ .

An interesting observation is that for single-precision benchmarks the improvement of CATS is proportional to the percentage of critical tasks. Integral histogram achieves greater improvements in comparison to Cholesky  $16 \times 16$  because it schedules immediately more tasks to the fast cores of the system. This is due to the fact that this benchmark has a higher percentage of critical tasks (7.32%) compared to Cholesky  $16 \times 16$  that has 5.5%. On the other hand, QR and heat diffusion show the opposite effect: QR has the largest performance ratio and a larger percentage of critical tasks than heat diffusion. However, heat diffusion shows higher overall improvement over the different configurations. We attribute this to the larger sensitivity of heat diffusion to the critical path, which allows CATS to achieve a large improvement even for a configuration of one big and one little core.

In all cases, the SS FLEX configuration of CATS achieves the best performance, since it produces a decent amount of critical tasks for the big cores, fact that shortens the longest



Figure 5.8 Speedup of CATS, CPATH, HYBRID, dHEFT and BF on 8 cores.

path of the TDG. A smaller amount of critical tasks is produced by the SS STRICT policy, which causes a slight imbalance that is fixed through the work stealing mechanism but with lower effectiveness. The 2DS FLEX configuration, produces the same amount of critical tasks as the SS FLEX, but the bi-directional work stealing allows little cores to steal critical tasks, which lengthens the critical path execution and directly increases overall execution time.

#### *Evaluation of CATS CPATH and HYBRID*

As it is shown in Section 5.6.2, the most efficient CATS configuration is when using the simple work stealing and flexible policy (SS FLEX). The same applies for CPATH and HYBRID schedulers, thus the results presented in this section refer to this configuration of the schedulers. Figure 5.8 shows the speedup of CATS, CPATH, HYBRID, dHEFT and BF when running the applications on all eight cores of the Odroid-XU3. Cholesky and Integral Histogram operate on single-precision data, while QR and Heat Diffusion operate on double-precision. Double-precision applications get larger speedups over one little core because they benefit from a larger performance ratio when running on a big core. In the case of Bodytrack, the out-of-order processing power of the big cores helps on the efficient execution and creates a high performance ratio between big and little cores. For most of the cases, CATS scales better than the rest of the schedulers. The shortening of the critical path by running all critical tasks on big cores effectively reduces total execution time when running on all cores. CPATH scheduler does not achieve as high speedup as the other heterogeneous scheduling approaches but it still outperforms the baseline (BF) approach.

Figure 5.9 shows the average speedup obtained for each scheduler and machine set-up. Overall, the heterogeneous schedulers outperform the platform-unaware BF scheduler. Specifically, CATS and HYBRID achieve a higher speedup by detecting critical tasks. We

observe that their performance is approximately the same and this is due to the fact that HYBRID exploits the same CATS criticality in case the execution time of the task is not yet resolved. CPATH is less effective due to the additional overheads of the top-to-bottom TDG traversal. Since the evaluated dHEFT version is improved from previous studies [31], it shows better performance, although it still does not reach the efficiency of CATS and HYBRID because of its task criticality agnosticism.

Moving in more detail, Figure 5.10 shows the speedup obtained for each application, scheduler and machine set-up. We classify the benchmarks according to their task cost variability to easier explain the results.

Heat diffusion is the kernel with the lowest task variability (e.g. the most homogeneous benchmark) as shown in Figure 5.6d. CATS, HYBRID and dHEFT increase the performance of heat by 10% on 8 cores and obtain similar results for the other numbers of cores by rearranging the tasks according to the type of the resources. Due to its high per-task overheads shown on Table 5.1 and the homogeneity of the benchmark, CPATH scheduler cannot outperform BF scheduler. Moreover, for this benchmark, CPATH detects only 23% of the tasks to be critical while CATS and HYBRID detect approximately 54%, when running on 8 cores. This happens because with CPATH, it is more likely to have zero-priority tasks during the task submission step, due to the post-exit task priority assignment that the algorithm introduces. These tasks are considered non-critical, which limits the utilization of the big cores with CPATH.

Cholesky  $16 \times 16$  has also low task cost variability. The improvements of CATS, dHEFT and HYBRID over BF are limited to around 7% when running on 8 cores. These schedulers perform almost the same for the rest numbers of cores and CPATH performs almost the same as BF. The increased overheads of CPATH do not pay off with better schedules since, for the same reason as in the case of Heat diffusion, only 10% of the tasks are marked as critical on 8 cores (while 21% CATS and 16% HYBRID).

Bodytrack shows low task cost variability, since 99% of its tasks have similar execution times. In this case, contrarily to the previous benchmarks CPATH manages to achieve similar speedups to CATS and HYBRID and outperform BF by up to 15%. This is due to the very high number of tasks of bodytrack; CPATH overcomes its overheads by using the detected task execution times for a higher number of tasks. In other words, the learning phase of CPATH becomes a smaller proportion of the total execution of the benchmark. Since bodytrack has so many tasks, the per-task overhead of CPATH is around 120us while for CATS it is 93us. On the other hand, dHEFT shows poor performance because of the overheads of analyzing a TDG with a high number of tasks to compute the earliest finish time schedule.



Figure 5.9 Average speedups obtained for each scheduler

Integral histogram is characterized by medium task cost variability and high amount of tasks. This benchmark is dependency intensive with limited parallelism, which makes scheduling decisions very important. CATS and HYBRID schedulers achieve the best results since they focus more on the TDG structure and dependencies, improving BF by 30% and 27% respectively. CPATH and dHEFT are slightly less efficient and improve BF by 19 and 21% respectively.

For Cholesky  $8 \times 8$ , the heterogeneous schedulers CATS, HYBRID and dHEFT constantly improve the performance of BF and reach up to 45% improvement on 8 cores. It is observed here that dHEFT indeed performs better when the number of tasks is limited as this workload has 120 tasks in total. The additional overheads of CPATH do not compensate with increased performance in this case because there are not enough tasks to apply the better scheduling.

QR factorization is the highest task cost variability benchmark as shown in Figure 5.6c. This is the reason why HYBRID gradually outperforms CATS as we increase the number of cores. With a small additional overhead, as Table 5.1 shows, HYBRID manages to detect critical tasks that reside on the critical path and boost their execution reaching 17% improvement over the baseline. For this benchmark, CPATH also reaches a 13% improvement over BF since task cost matters in this case. However, CPATH speedup is still limited compared to HYBRID because of the higher scheduling overheads which in this case is  $1.8 \times$  higher than CATS overheads. dHEFT also improves BF by finding the earliest executor of each task, but the improvement is limited to 11% which is lower than the other approaches.

This section showed a straight comparison between different heterogeneous schedulers. It is important to note that schedulers like CPATH and HYBRID, that detect the time-based critical path, are the best choices when the application has a large amount of tasks. This



Figure 5.10 Speedups obtained for each scheduler and each application

is because the additional overheads of these schedulers for critical path computation take place only when there are new tasks on the TDG or when there is a task exit of an untracked tt-is. When the TDG has been completely created, and as soon as the cost of every tt-is of the application has been tracked, the schedules of these approaches are purely beneficial. On the other hand, schedulers like dHEFT perform the same steps for every single task that becomes ready, affecting the entire execution since the exit of a task triggers the execution of its successors that become ready. Thus, as the number of tasks is increased, the additional scheduling overheads are increased when using dHEFT-like approaches. CATS scheduler is an efficient scheduling solution for any number of tasks and task cost distributions. The additional CATS overheads take place only during task creation and are smaller than CPATH overheads with the drawback of not considering the task execution time. If we have to choose the best and most generic heterogeneous scheduling approach among the presented schedulers the HYBRID scheduler is the best choice, since it computes an accurate critical path only when it comes at a low cost.

### 5.6.3 SIMULATIONS

To estimate the impact of the heterogeneity-aware schedulers on larger systems, we run three benchmarks using the TaskSim simulator [83]. The results contain a fixed scheduling overhead for all configurations, regardless of the dynamic overheads during execution (e.g., work stealing). We simulate Cholesky, QR and Heat diffusion. These applications feature



Figure 5.11 Improvement of heterogeneous schedulers over BF for simulated 16 and 32 core heterogeneous systems for Cholesky  $32 \times 32$ .

different levels of task cost variability and have a proper amount of tasks so that the error introduced by the static overhead assumption remains negligible (e.g., bodytrack that creates 408 525 tasks should not be compared to a 5 000 task benchmark and static overhead). For Cholesky, we use an input matrix of  $16384 \times 16384$  floats creating  $512 \times 512$  blocks, which results in a  $32 \times 32$  blocked matrix. This is because the other Cholesky configurations do not scale to 32 cores due to the limited task number. However, the task cost variability is similar to the  $16 \times 16$  input since the task size is not modified. Integral Histogram is excluded from the simulated evaluation because it does not scale beyond 16 cores.

Figures 5.11, 5.12 and 5.13 show the improvement of dHEFT, CATS, HYBRID and CPATH over BF in systems with 16 and 32 cores for Cholesky, QR and heat respectively. In these experiments, the performance ratio between fast and slow cores is set to 4.5, which is the average performance ratio among the benchmarks. The heterogeneous schedulers utilize fast cores more effectively than BF, which results in larger improvements with higher number of fast cores.

Figure 5.11 shows the improvement of the schedulers over the baseline for Cholesky. The improvement for 16 cores is comparatively small. This is due to the increased problem size used in this experiment. This benchmark creates a small amount of critical tasks in the  $32 \times 32$  input, which makes the workload less sensitive to critical tasks and limits the improvement of CATS and HYBRID to a maximum of 17%, while CPATH and dHEFT outperform BF by up to 10%.

Figure 5.12 shows that the best option for QR, the application with the highest task cost



Figure 5.12 Improvement of heterogeneous schedulers over BF for simulated 16 and 32 core heterogeneous systems for QR.

variability, on systems with 16 or 32 cores is the HYBRID scheduler, as was also shown in the real platform evaluation, bringing improvements of 30 and 56%. CATS also performs well but CPATH falls short in detecting an appropriate amount of critical tasks which makes the little cores overloaded and the big cores waste their resources in work stealing.

For heat diffusion, Figure 5.13 shows that CATS achieves the best results outperforming BF by a factor of  $2\times$ . Moreover, HYBRID achieves similar results as it performs similar schedules as CATS. However, CPATH fails to achieve optimal results because it overloads the big cores during the learning phase while the little cores remain under-utilized.

## 5.7 CONCLUSIONS

We introduced the first critical-path-aware dynamic scheduler for heterogeneous systems as well as the first hybrid criticality-aware scheduler. Like CATS and contrary to previous works on criticality-aware scheduling that use synthetic TDGs and require prior knowledge of profiling information, our proposals work on real platforms with real applications and do not require off-line profiling.

We implemented and evaluated our scheduling proposals in the runtime system of the OmpSs programming model. We showed that even if the accuracy of CPATH is higher in terms of task criticality identification, it does not always increase performance. Factors like the number of tasks and task cost variability play an important role on choosing the most appropriate scheduling policy and improve the performance of task-based applications. The



Figure 5.13 Improvement of heterogeneous schedulers over BF for simulated 16 and 32 core heterogeneous systems for heat diffusion.

implementations shown in this paper will be included in the next stable release of the OmpSs programming model. Furthermore, the described policies are expected to be applicable to other task-based programming models with support for task dependencies.

In conclusion, this chapter showed the potential of different heterogeneous schedulers to speed up dependency-intensive applications and take advantage of the asymmetric compute resources.



# CHAPTER 6

## RUNTIME OVERHEADS MIGRATION

---

As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community.

This chapter makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric systems, we obtain performance improvements up to  $15\times$ , averaging to  $3.1\times$  over the baseline.

## 6.1 INTRODUCTION

Since the end of Dennard scaling [37] and the subsequent stagnation of CPU clock frequencies, computer architects and programmers rely on multi-core designs to achieve the desired performance levels. While multi-core architectures constitute a solution to the CPU clock stagnation problem, they bring important challenges both from the hardware and software perspectives. On the hardware side, multi-core architectures require sophisticated mechanisms in terms of coherence protocols, consistency models or deep memory hierarchies. Such requirements complicate the hardware design process. On the software side, multi-core designs significantly complicate the programming burden compared to their single-core predecessors. The different CPUs are exposed to the programmer, who has to make sure to use all of them efficiently, as well as using the memory hierarchy properly by exploiting both temporal and spatial locality. This increasing programming complexity, also known as the Programmability Wall [25], has motivated the advent of sophisticated programming paradigms and runtime system software to support them.

Task-based parallelism [9, 15, 17, 81] has been proposed as a solution to the Programmability Wall and, indeed, the most relevant shared memory programming standards, like OpenMP, support tasking constructs [74]. The task based model requires the programmer to split the code into several sequential pieces, called tasks, as well as explicitly specifying their input and output dependencies. The task-based execution model (or runtime system) consists of a master thread and several worker threads. The master thread goes over the code of the application and creates tasks once it encounters source code annotations identifying them. The runtime system manages the pool of all created tasks and schedules them across the threads once their input dependencies are satisfied. To carry out the task management process, the parallel runtime system creates and maintains a Task Dependency Graph (TDG). In this graph nodes represent tasks and edges are dependencies between them. Once a new task is created, a new node is added to the TDG. The connectivity of this new node is defined by the data dependencies of the task it represents, which are explicitly specified in the application’s source code. When the execution of a task finalizes, its corresponding node is removed from the TDG, as well as its data dependencies.

This task-based runtime system constitutes of a software layer that enables parallel programmers to decouple the parallel code from the underlying parallel architecture where it is supposed to run on. As long as the application can be decomposed into tasks, the task-based execution model is able to properly manage it across homogeneous many-core architectures or heterogeneous designs with different core types. A common practice in the high performance domain is to map a single thread per core, which enables the tasks running on that

thread to fully use the core capacity. Finally, another important asset of task-based parallelism is the possibility of automatically managing executions on accelerators with different address spaces. Since the input and output dependencies of tasks are specified, the runtime system can automatically offload a task and its dependencies to an accelerator device (e.g., GPU) without the need for specific programmer intervention [20]. Additional optimizations in terms of software pre-fetching [77] or more efficient coherence protocols [67] can also be enabled by the task-based paradigm.

Despite their advantages, task-based programming models also induce computational costs. For example, the process of task creation requires the traversal of several indexed tables to update the status of the parallel run by adding the new dependencies the recently created tasks bring, which produces a certain overhead. Such overhead constitutes a significant burden, especially on architectures with several 10's or 100's of cores where tasks need to be created at a very fast rate to feed all of them. This paper proposes the Task Generation Express (TaskGenX) approach. Our proposal suggests that the software and hardware are designed to eliminate the most important bottlenecks of task-based parallelism without hurting their multiple advantages. This paper focuses on the software part of this proposal and draws the requirements of the hardware design to achieve significant results. In particular, this paper makes the following contributions beyond the state-of-the-art:

- A new parallel task-based runtime system that decouples the most costly routines from the other runtime activities and thus enables them to be off-loaded to specific-purpose helper cores.
- A detailed study of the requirements of a specific-purpose helper core able to accelerate the most time consuming runtime system activities.
- A complete evaluation via trace-driven simulation considering 11 parallel OpenMP codes and 25 different system configurations, including homogeneous and heterogeneous systems. Our evaluation demonstrates how TaskGenX achieves average speedups of  $3.1\times$  when compared against currently use state-of-the-art approaches.

The rest of this chapter is organized as follows: Section 6.2 describes the task-based execution model and its main bottlenecks. Section 6.3 describes the new task-based runtime system this paper proposes as well as the specialized hardware that accelerates the most time-consuming runtime routines. Section 6.4 describes the evaluation of TaskGenX via trace-driven simulation and Section 6.5 concludes this work.

```

1   ...
2   //task_clause
3   memalloc(&task , args , size );
4   createTask(deps , task , parent , taskData );
5   ...

```

Listing 6.1 Compiler generated pseudo-code equivalence for task annotation.

```

1 void createTask(DepList dList , Task t ,
2                 Task parent , Data args ) {
3     initAndSetupTask(task1 , parent , args );
4     insertToTDG( dList , task1 );
5 }

```

Listing 6.2 Pseudo-code for task creation.

## 6.2 BACKGROUND AND MOTIVATION

### 6.2.1 TASK-BASED PROGRAMMING MODELS

Task-based parallel programming models [9, 15, 17, 81], are widely used to facilitate the programming of parallel codes for multi-core systems. These programming models offer annotations that the programmer can add to the application’s sequential code. One type of these annotations is the task annotations with dependency tracking which OpenMP [18] supports since its 4.0 release [74]. By adding these annotations, the programmer decomposes the application into *tasks* and specifies the input and output data dependencies between them. A compiler is responsible to translate the annotations into code by adding calls to the programming model’s runtime system. The runtime system consists of software threads and is responsible for the efficient execution of the tasks with respect to the data dependencies as well as the availability of resources.

When the compiler encounters a task annotation in the code, it transforms it to the pseudo-code shown in Listing 6.1. Memalloc is performing the memory allocation for the task and its arguments. Next is a runtime call, which is the `createTask`, responsible for the linking of the task with the runtime system. At this point a task is considered *created* and below are the three possible states of a task inside the runtime system:

- *Created*: A task is initialized with the appropriate data and function pointers and it is inserted in the Task Dependency Graph (TDG). The insertion of a task in the TDG implies that the data dependencies of the tasks have been identified and the appropriate data structures have been created and initialized.

---

```

1 void insertToTDG(DepList dList, Task t) {
2     if( dList.isEmpty() ) {
3         readyQ->push(t);
4         return;
5     }
6     Dependency entry;
7     for( d in dList ) {
8         entry = depMap[d.address()];
9         if( entry==NULL) depMap.add(entry, t);
10        if(d.accessType() == "write")
11            entry.addLastWriter(t);
12        if(d.accessType() == "read") {
13            entry.addReader(t);
14            entry.lastWriter()->addSuccessor(t);
15        }
16    }
17 }
```

---

Listing 6.3 Pseudo-code for TDG insertion

- *Ready*: When all the data dependencies of a created task have been satisfied, the task is ready and it is inserted in the *ready queue* where it waits for execution.
- *Finished*: When a task has finished execution and has not been deleted yet.

The runtime system creates and manages the software threads for the execution of the tasks. Typically one software thread is being bound to each core. One of the threads is the *master thread*, and the rest are the *worker threads*. The master thread starts executing the code of Listing 6.1 sequentially. The allocation of the task takes place first. What follows is the task creation, that includes the analysis of the dependencies of the created task and the connection to the rest of the existing dependencies. Then, if there are no task dependencies, which means that the task is *ready*, the task is also inserted in the ready queue and waits for execution. Listing 6.2 shows the pseudo-code for the task creation step within the runtime. The `createTask` function is first initializing the task by copying the corresponding data to the allocated memory as well as connecting the task to its parent task (`initAndSetupTask`). After this step, the task is ready to be inserted in the TDG. The TDG is a distributed and dynamic graph structure that the runtime uses to keep the information about the current tasks of the application. The insertion of a task in the TDG is done by the `insertToTDG` function. This function takes as arguments a list with all the memory addresses that are to be written or read by the task (`dList`), and the task itself. Listing 6.3 shows the pseudo-

---

```

1 void task_finish(Task *t) {
2     depMap.removeReaderWriter(t);
3     if(t->successors.empty()) delete t;
4     else {
5         for(succ in t->successors) {
6             succ.decreasePredecessors();
7             if(succ.numPredecessors == 0)
8                 readyQ->push(succ);
9         }
10    }
11 }
```

---

Listing 6.4 Pseudo-code for task\_finish runtime activity.

code for the TDG insertion. If for a task the dList is empty (line 2), this means that there are no memory addresses that need to be tracked during the execution; thus, the task is marked as *ready* by pushing it to the *ready queue* (line 3). Each entry of dList contains the actual memory address as well as the access type (read, write or read-write). The runtime keeps a distributed unified dependency tracking structure, the depMap where it stores all the tracked memory addresses together with their writer and reader tasks. For each item in the dList the runtime checks if there is an existing representation inside the depMap (line 8). If the memory address of an entry of the dList is not represented in the depMap, it is being added as shown in line 9. If the address of a dList item exists in the depMap, this means that a prior task has already referred to this memory location, exhibiting a data dependency. According to the access type of d, the readers and the writers of the specific address are updated in the depMap (lines 10-15).

To reduce the lookup into the depMap calls, every time the contents of a memory address are modified, the tasks keep track of their *successors* as well as the number of *predecessors*. The *successors* of a task are all the tasks with inputs depending on the output of the current task. The *predecessors* of a task are the tasks whose output is used as input for the current task. When a read access is identified, the task that is being created is added to the list of successors of the last writer task, as shown on line 20 of Listing 6.2.

As tasks are executed, the dependencies between them and their successors are satisfied. So the successor tasks that are waiting for input, eventually become *ready* and are inserted to the ready queue. When a task goes to the *finished* state, the runtime has to perform some actions in order to prepare the successor tasks for execution. These actions are described in Listing 6.4. The runtime first updates the depMap to remove the possible references of the task as reader or writer (line 2). Then, if the task does not have any successors, it can safely



Figure 6.1 Master thread activity for Cholesky as we increase the number of cores.

be deleted (line 3). If the task has successors, the runtime traverses the successor list and for each successor task it decreases its predecessor counter (lines 5-6). If for a successor task its predecessor counter reaches zero, then this task becomes *ready* and it is inserted in the *ready queue* (lines 7-8).

The runtime activity takes place at the task state changes. One state change corresponds to the task creation, so a task from being just allocated it becomes *created*. At this point the runtime prepares all the appropriate task and dependency tracking data structures as well as inserts the task into the TDG. The second change occurs when a task from being *created* it becomes *ready*; this implies that the input dependencies of this task are satisfied so the runtime schedules and inserts the task into the ready queue. The third change occurs when a running task finishes execution. In this case, following our task states, the task from being *ready* it becomes *finished*; this is followed by the runtime updating the dependency tracking data structures and scheduling possible successor tasks that become ready. For the rest of the paper we will refer to the first state change runtime activity as the task creation overheads (*Create*). For the runtime activity that takes place for the following two state changes (and includes scheduling and dependence analysis) we will use the term runtime overheads (*Runtime*).

### 6.2.2 MOTIVATION

Figure 6.1 shows the runtime activity of the master thread during the execution of the Cholesky<sup>1</sup> benchmark on 8, 16, 32, 64 and 128 cores<sup>2</sup>. The execution time represented

<sup>1</sup>Details about the benchmarks used are in Section 3.4

<sup>2</sup>The experimental set-up is explained in Section 3.1

here is the wall clock time during the parallel region of the benchmark. Each one of the series represents a different runtime overhead from the ones described above. The percentage of time spent on task creation is increasing as we increase the number of cores. This is because the creation overhead is invariant of core count: the more we reduce the application’s execution time by adding resources the more important this step becomes in terms of execution time. In contrast, the task execution time percentage is decreased as we increase the number of cores because the computational activity is being shared among more resources. One way to reduce the task creation overhead is by introducing nested parallelism. In this programming technique, every worker thread is able to generate tasks thus the task creation is spread among cores and its overhead is reduced. However, not all applications can be implemented with this parallelization technique and there are very few applications using this scheme. *Runtime* decreases as we increase the number of cores because this activity is also shared among the resources. This is because this part of the runtime takes place once the tasks finish execution and new tasks are being scheduled. So the more the resources, the less the runtime activity per thread, therefore less activity for the master thread.

Our motivation for this work is the bottleneck introduced by task creation as shown in Figure 6.1. Our runtime proposal decouples this piece of the runtime and accelerates it on a specialized hardware resulting in higher performance.

### 6.3 TASK GENERATION EXPRESS (TASKGENX)

In this paper we propose a semi-centralized runtime system that dynamically separates the most computationally intensive parts of the runtime system and accelerates them on specialized hardware. To develop the TaskGenX we use the OpenMP programming model [18], [74]. The base of our implementation is the Nanos++ runtime system responsible for the parallel execution and it is used in this paper as a replacement of the entire OpenMP’s default runtime.

Nanos++ [13] is a distributed runtime system that uses dynamic scheduling. As most task-based programming models, Nanos++ consists of the master and the worker threads. The master thread is launching the parallel region and creates the tasks that have been defined by the programmer<sup>3</sup>. The scheduler of Nanos++ consists of a *ready queue* (*TaskQ*) that is shared for reading and writing among threads and is used to keep the tasks that are ready for execution. All threads have access to the *TaskQ* and once they become available they try to pop a task from the *TaskQ*. When a thread finishes a task, it performs all the

---

<sup>3</sup>Nanos++ also supports nested parallelism so any of the worker threads can potentially create tasks. However the majority of the existing parallel applications are not implemented using nested parallelism.

essential steps described in Section 6.2.1 to keep the data dependency structures consistent. Moreover, it pushes the tasks that become ready to the *TaskQ*.

### 6.3.1 IMPLEMENTATION

TaskGenX relieves the master and worker threads from the intensive work of task creation by offloading it on the specialized hardware. Our runtime, apart from the master and the worker threads, introduces the Special Runtime Thread (SRT). When the runtime system starts, it creates the SRT and binds it to the task creation accelerator, keeping its thread identifier in order to manage the usage of it. During runtime, the master and worker threads look for ready tasks in the task ready queue and execute them along with the runtime. Instead of querying the ready queue for tasks, the SRT looks for runtime activity requests in the Runtime Requests Queue (*RRQ*) and if there are requests, it executes them.

Figure 6.2 shows the communication infrastructure between threads within TaskGenX. Our system maintains two queues; the Ready Task Queue (*TaskQ*) and the Runtime Requests Queue (*RRQ*). The *TaskQ* is used to keep the tasks that are ready for execution. The *RRQ* is used to keep the pending runtime activity requests. The master and the worker threads can push and pop tasks to and from the *TaskQ* and they can also add runtime activity to the *RRQ*. The special runtime thread (SRT) pops runtime requests from the *RRQ* and executes them on the accelerator.

When the master thread encounters a task clause in the application’s code, after allocating the memory needed, it calls the `createTask` as shown in Listing 6.2 and described in Section 6.2.1. TaskGenX decouples the execution of `createTask` from the master thread. To do so, TaskGenX implements a wrapper function that is invoked instead of `createTask`. In this function, the runtime system checks if the SRT is enabled; if not then the default behaviour takes place, that is, to perform the creation of the task. If the SRT is enabled, a *Create* request is generated and inserted in the *RRQ*. The *Create* runtime request includes the appropriate info to execute the code described in Listing 6.2. That is, the dependence analysis data, the address of the allocated task, its parent and its arguments.

While the master and worker threads are executing tasks, the SRT is looking for *Create* requests in the *RRQ* to execute. Listing 6.5 shows the code that the SRT is executing until the end of the parallel execution. The special runtime thread continuously checks whether there are requests in the *RRQ* (line 3). If there is a pending creation request, the SRT calls the `executeRequest` (line 4), which extracts the appropriate task creation data from the creation request and performs the task creation by calling the `createTask` described in Listing 6.2. When the parallel region is over, the runtime system informs the SRT to stop execution. This is when the SRT exits and the execution finishes (line 5).

---

```

1 void SRTloop() {
2     while( true ) {
3         while(RRQ is not empty)
4             executeRequest( RRQ.pop() );
5         if( runtime.SRTstop() ) break;
6     }
7     return;
8 }
```

---

Listing 6.5 Pseudo-code for the SRT loop.



Figure 6.2 Communication mechanism between master/workers and SRT threads.

### 6.3.2 HARDWARE REQUIREMENTS

As described in the previous section, TaskGenX assumes the existence of specialized hardware that accelerates the task creation step. The goal of this paper is not to propose a detailed micro-architecture of the specialized hardware; instead we sketch the high-level hardware requirements for the TaskGenX set-up, in the hope to be an insightful and useful influence for hardware designers. The SRT is bound to the task creation accelerator and executes the requests in the RRQ. Previous studies have proposed custom accelerators for the runtime activity [34, 42, 53, 61, 86, 88]. These proposals significantly accelerate (up to three orders of magnitude) different bottlenecks of the runtime system<sup>4</sup>. These special purpose designs can only execute runtime system activity.

As an alternative, in our envisioned architecture we propose to have a general purpose core that has been optimized to run the runtime system activity more efficiently. The runtime optimized (RTopt) core can be combined with both homogeneous or heterogeneous systems and accelerate the runtime activity. Figure 6.3 shows the envisioned architecture when RTopt is combined with an asymmetric heterogeneous system. This architecture has three core types that consist of simple in-order cores, fast out-of-order cores and an RTopt core for the SRT. RTopt can optimize its architecture, having a different cache hierarchy,

---

<sup>4</sup>Chapter 2 further describes these proposals.



Figure 6.3 SoC architecture including three types of cores: out of order, in-order and RTopt.

pipeline configuration and specialized hardware structures to hold and process the SRT. As a result, the RTopt executes the SRT much faster than the other cores. The RTopt can also execute tasks, but will achieve limited performance compared to the other cores as its hardware structures have been optimized for a specific software (the SRT). To evaluate our approach we study the requirements of the RTopt in order to provide enough performance for TaskGenX. Based on the analysis by Etsion et al. [42], there is a certain *task decode rate* that leads to optimal utilization of the multi-core system. This rule can be applied in the case of TaskGenX for the *task creation rate*, i.e., the frequency of task generation of the runtime system. If the *task creation rate* is higher than the *task execution rate*, then for a highly parallel application the resources will always have tasks to execute and they will not remain idle. To achieve a high *task creation rate*, we can accelerate the task creation cost. Equation 6.1 shows the maximum optimal task creation cost,  $C_{opt}(x)$  in order to keep  $x$  cores busy, without starving due to task creation.

$$C_{opt}(x) = \text{avg. task duration}/x \quad (6.1)$$

If  $C_{gp}$  is the cost of task creation when it is performed on a general purpose core, then the RTopt has to achieve a speedup of  $r = C_{gp}/C_{opt}(x)$  to achieve full utilization of the system. Section 6.4.1 performs an analysis based on these requirements for the evaluated applications. As we will see in Section 6.4.1, a modest and implementable value of  $r = 16\times$  is enough to significantly accelerate execution on a 512-core system.

Finally, if TaskGenX executes on a regular processor without the RTopt core, the SRT is bound to a regular core without any further modification. In this scenario, applications will not significantly benefit from having a separate SRT.

Table 6.1 Evaluated benchmarks and relevant characteristics

| Application            | Problem size      | #Tasks    | Avg task CPU cycles (thousands) | Per task overheads (CPU cycles) |       |              | Measured perf. ratio | $r$    |
|------------------------|-------------------|-----------|---------------------------------|---------------------------------|-------|--------------|----------------------|--------|
|                        |                   |           |                                 | Create                          | All   | Deps + Sched |                      |        |
| Cholesky factorization | 32K 256           | 357 762   | 753                             | 15221                           | 73286 | 58065        | 3.5                  | 10.34  |
|                        | 32K 128           | 2829058   | 110                             | 17992                           | 58820 | 40828        |                      | 83.74  |
| QR factorization       | 16K 512           | 11 442    | 518 570                         | 17595                           | 63008 | 45413        | 6.8                  | 0.01   |
|                        | 16K 128           | 707 265   | 3 558                           | 21642                           | 60777 | 39135        |                      | 3.11   |
| Blackscholes           | native            | 488 202   | 348                             | 29141                           | 85438 | 56297        | 2.3                  | 42.87  |
| Bodytrack              | native            | 329 123   | 383                             | 9 505                           | 18979 | 9474         | 4.2                  | 12.70  |
| Canneal                | native            | 3 072 002 | 67                              | 25781                           | 50094 | 24313        | 2.0                  | 197.01 |
| Dedup                  | native            | 20 248    | 1 532                           | 1294                            | 9647  | 8353         | 2.7                  | 0.43   |
| Ferret                 | native $\times 2$ | 84 002    | 29 088                          | 38913                           | 98457 | 59544        | 3.6                  | 0.68   |
| Fluidanimate           | native            | 128 502   | 16 734                          | 30210                           | 94079 | 64079        | 3.3                  | 0.91   |
| Streamcluster          | native            | 3 184 654 | 161                             | 6892                            | 13693 | 6801         | 3.5                  | 21.91  |

## 6.4 EVALUATION

### 6.4.1 METHODOLOGY

#### *Applications*

Table 6.1 shows the evaluated applications, the input sizes used, and their characteristics. All applications are implemented using the OpenMP programming model [74]. We obtain Cholesky and QR from the BAR repository [12] and we use the implementations of the rest of the benchmarks from the PARSECSs suite [26]. More information about these applications can be found in [26] and [31]. As the number of cores in SoCs is increasing, so does the need of available task parallelism [85]. We choose the input sizes of the applications so that they create enough fine-grained tasks to feed up to 512 cores. The number of tasks per application and input as well as the average per-task CPU cycles can be found on Table 6.1.

#### *Simulation*

To evaluate TaskGenX we make use of the trace-driven TaskSim simulator [46, 83] which is described in section 3.2.

We evaluate the effectiveness of TaskGenX on both symmetric and asymmetric platforms with the number of cores varying from 8 to 512. In the case of asymmetric systems, we simulate the behavior of an ARM big.LITTLE architecture [52]. To set the correct performance ratio between big and little cores, we measure the sequential execution time of each application on a real ARM big.LITTLE platform when running on a little and on a big core. We use the Hardkernel Odroid XU3 board that includes a Samsung Exynos 5422 chip with ARM big.LITTLE. The big cores run at 1.6GHz and the little cores at 800MHz. Table 6.1 shows the measured performance ratio for each case. The average performance

ratio among our 11 workloads is 3.8. Thus in the specification of the asymmetric systems we use as performance ratio the value 4.

For the needs of this hardware-software co-design, we modify TaskSim so that it features one extra hardware accelerator (per multi-core) responsible for the fast task creation (the RTopt). Apart from the task duration time, our modified simulator tracks the duration of the runtime overheads. These overheads include: (a) task creation, (b) dependencies resolution, and (c) scheduling. The RTopt core is optimized to execute task creation faster than the general purpose cores; to determine how much faster a task creation job is executed we use the analysis performed in Section 6.3.2.

Using Equation 6.1, we compute the  $C_{opt}(x)$  for each application according to their average task CPU cycles from Table 6.1 for  $x = 512$  cores.  $C_{gp}$  is the cost of task creation when it is performed on a general purpose core, namely the *Create* column shown on Table 6.1. To have optimal results for each application on systems up to 512 cores,  $C_{gp}$  needs to be reduced to  $C_{opt}(512)$ . Thus the specialized hardware accelerator needs to perform task creation with a ratio  $r = C_{gp}/C_{opt}(512) \times$  faster than a general purpose core.

We compute  $r$  for each application shown on Table 6.1. We observe that for the applications with a large number of per-task CPU cycles and relatively small *Create* cycles (QR512, Dedup, Ferret, Fluidanimate),  $r$  is very close to zero, meaning that the task creation cost ( $C_{gp}$ ) is already small enough for optimal task creation without the need of a faster hardware accelerator. For the rest of the applications, more powerful hardware is needed. For these applications  $r$  ranges from  $3\times$  to  $197\times$ . Comparing  $r$  to the measured performance ratio of each application we can see that in most cases accelerating the task creation on a big core would not be sufficient for achieving higher task creation rate. In our experimental evaluation we accelerate task creation in the RTopt and we use the ratio of  $16\times$  which is a relatively small value within this range that we consider realistic to implement in hardware. The results obtained show the average results among three different traces for each application-input.

#### 6.4.2 HOMOGENEOUS MULTICORE SYSTEMS

Figures 6.4a and 6.4b show the speedup over one core of three different scenarios:

- *Baseline*: the Nanos++ runtime system, which is the default runtime without using any external hardware support
- *Baseline+RTopt*: the Nanos++ runtime system that uses the external hardware as if it is a general purpose core



Figure 6.4 Speedup of TaskGenX compared to the speedup of Baseline and Baseline+RTopt for each application for systems with 8 up to 512 cores. The average results of (a) show the average among all workloads shown on (a) and (b)

- *TaskGenX*: our proposed runtime system that takes advantage of the optimized hardware

We evaluate these approaches with the TaskSim simulator for systems of 8 up to 512 cores. In the case of Baseline+RTopt the specialized hardware acts as a slow general purpose core that is additional to the number of cores shown on the x axis. If this core executes a task creation job, it executes it  $16\times$  faster, but as it is specialized for this, we assume that when a task is executed on this core it is executed  $4\times$  slower than in a general purpose core. The runtime system in this case does not include our modifications that automatically decouple the task creation step for each task. The comparison against the Baseline+RTopt is used only to show that the baseline runtime is not capable of effectively utilizing the accelerator. In most of the cases having this additional hardware without the appropriate runtime support results in slowdown as the tasks are being executed slower on the special hardware.

Focusing on the average results first, we can observe that TaskGenX constantly improves the baseline and the improvement is increasing as the number of cores is increased, reaching up to  $3.1\times$  improved performance on 512 cores. This is because as we increase the number of cores, the task creation overhead becomes more critical part of the execution time and affects performance even more. So, this becomes the main bottleneck due to which the performance of many applications saturates. TaskGenX overcomes it by automatically detecting and moving task creation on the specialized hardware.

Looking in more detail, we can see that for all applications the baseline has a saturation point in speedup. For example Cholesky256 saturates on 64 cores, while QR512 on 256 cores. In most cases this saturation in performance comes due to the sequential task creation that is taking place for an important percentage of the execution time (as shown in Figure 6.1). TaskGenX solves this as it efficiently decouples the task creation code and accelerates it leading to higher speedups.

TaskGenX is effective as it either improves performance or it performs as fast as the baseline (there are no slowdowns). The applications that do not benefit (QR512, Ferret, Fluidanimate) are the ones with the highest average per task CPU cycles as shown on Table 6.1. Dedup also does not benefit as the per task creation cycles are very low compared to its average task size. Even if these applications consist of many tasks, the task creation overhead is considered negligible compared to the task cost, so accelerating it does not help much.

This can be verified by the results shown for QR128 workload. In this case, we use the same input size as QR512 (which is 16K) but we modify the block size, which results in more and smaller tasks. This not only increases the speedup of the baseline, but also shows even higher speedup when running with TaskGenX reaching very close to the ideal speedup and improving the baseline by  $2.3\times$ . Modifying the block size for Cholesky, shows the same effect in terms of TaskGenX over baseline improvement. However, for this application, using the bigger block size of 256 is more efficient as a whole. Nevertheless, TaskGenX improves the cases that performance saturates and reaches up to  $8.5\times$  improvement for the 256 block-size, and up to  $16\times$  for the 128 block-size.

Blackscholes and Canneal, are applications with very high task creation overheads compared to the task size as shown on Table 6.1. This makes them very sensitive to performance degradation due to task creation. As a result their performance saturates even with limited core counts of 8 or 16 cores. These are the ideal cases for using TaskGenX as such bottlenecks are eliminated and performance is improved by  $15.9\times$  and  $13.9\times$  respectively. However, for Canneal for which the task creation lasts a bit less than half of the task execution time, accelerating it by 16 times is not enough and soon performance saturates at 64



Figure 6.5 Canneal performance as we modify  $r$ ;  $x$ -axis shows the number of cores.

cores. In this case, a more powerful hardware would improve things even more. Figure 6.5 shows how the performance of Canneal is affected when modifying the task creation performance ratio,  $r$  between the specialized hardware and general purpose. Using hardware that performs task creation close to  $256\times$  faster than the general purpose core leads to higher improvements.

Streamcluster has also relatively high task creation overhead compared to the average task cost so improvements are increased as the number of cores is increasing. TaskGenX reaches up to  $7.6\times$  improvement in this case.

The performance of Bodytrack saturates on 64 cores for the baseline. However, it does not approach the ideal speedup as its pipelined parallelization technique introduces significant task dependencies that limit parallelism. TaskGenX still improves the baseline by up to 37%. This improvement is low compared to other benchmarks, firstly because of the nature of the application and secondly because Bodytrack introduces nested parallelism. With nested parallelism task creation is being spread among cores so it is not becoming a sequential overhead as happens in most of the cases. Thus, in this case task creation is not as critical to achieve better results.

#### 6.4.3 HETEROGENEOUS MULTICORE SYSTEMS

At this stage of the evaluation our system supports two types of general purpose processors, simulating an asymmetric multi-core processor. The asymmetric system is influenced by the ARM big.LITTLE architecture [52] that consists of big and little cores. In our simulations, we consider that the big cores are four times faster than the little cores of the system. This is based on the average measured performance ratio, shown on Table 6.1, among the 11 workloads used in this evaluation.



Figure 6.6 Average speedup among all 11 workloads on heterogeneous simulated systems. The numbers at the bottom of x axis show the total number of cores and the numbers above them show the number of big cores. Results are separated depending on the type of core that executes the master thread: a big or little core.

In this set-up there are two different ways of executing a task-based application. The first way is to start the application's execution on a big core of the system and the second way is to start the execution on a little core of the system. If we use a big core to load the application, then this implies that the master thread of the runtime system (the thread that performs the task creation when running with the baseline) runs on a fast core, thus tasks are created faster than when using a slow core as a starting point. We evaluate both approaches and compare the results of the baseline runtime and TaskGenX.

Figure 6.6 plots the average speedup over one little core obtained among all 11 workloads for the Baseline, Baseline+RTopt and TaskGenX. The chart shows two categories of results on the x axis, separating the cases of the master thread's execution. The numbers at the bottom of x axis show the total number of cores and the numbers above show the number of big cores.

The results show that moving the master thread from a big to a little core degrades performance of the baseline. This is because the task creation becomes even slower so the rest of the cores spend more idle time waiting for the tasks to become ready. TaskGenX improves performance in both cases. Specifically when master runs on big, the average improvement of TaskGenX reaches 86%. When the master thread runs on a little core, TaskGenX improves performance by up to 3.7×. This is mainly due to the slowdown caused by the migration of master thread on a little core. Using TaskGenX on asymmetric systems achieves approximately similar performance regardless of the type of core that the master thread is running. This makes our proposal more portable for asymmetric systems as the programmer does not have to be concerned about the type of core that the master thread



Figure 6.7 Average improvement over baseline; *x*-axis shows the number of cores.

migrates.

#### 6.4.4 COMPARISON TO OTHER APPROACHES

As we saw earlier, TaskGenX improves the baseline scheduler by up to  $6.3\times$  for 512 cores. In this section we compare TaskGenX with other approaches. To do so, we consider the proposals of Carbon [61], Task Superscalar [42], Picos++ [88] and Nexus# [34]. We group these proposals based on the part of the runtime activity they are offloading from the CPU. Carbon and Task Superscalar are runtime-driven meaning that they both accelerate all the runtime and scheduling parts. The task creation, dependence analysis as well as the scheduling, namely the ready queue manipulation, are transferred to the RTopt with these approaches. These overheads are represented on Table 6.1 under ALL. For the evaluation of these approaches one RTopt is used optimized to accelerate all the runtime activities. The second group of related designs that we compare against is the dependencies-driven, which includes approaches like Picos++ and Nexus#. These approaches aim to accelerate only the dependence analysis part of the runtime as well as the scheduling that occurs when a dependency is satisfied. The RTopt in this case is optimized to accelerate these activities. For example, when a task finishes execution, and it has produced input for another task, the dependency tracking mechanism is updating the appropriate counters of the reader task and if the task becomes ready, the task is inserted in the ready queue. The insertion into the ready queue is the scheduling that occurs with the dependence analysis. These overheads are represented on Table 6.1 under *Deps+Sched*.

Figure 6.7 shows the average improvement in performance for each core count over the performance of the baseline scheduler on the same core count. *Runtime* represents the runtime driven approaches and the *Deps* represents the dependencies driven approaches as

described above. X-axis shows the number of general purpose cores; for every core count one additional RTopt core is used.

Accelerating the scheduling with *Runtime*-driven is as efficient as TaskGenX for a limited number of cores, up to 32. This is because they both accelerate task creation which is an important bottleneck. *Deps*-driven approaches on the other hand are not as efficient since in this case the task creation step takes place on the master thread.

Increasing the number of cores, we observe that the improvement of the *Runtime*-driven over the baseline is reduced and stabilized close to  $3.2\times$  while TaskGenX continues to speedup the execution. Transferring all parts of the runtime to RTopt with the *Runtime*-driven approaches, leads to the serialization of the runtime. Therefore, all scheduling operations (such as enqueue, dequeue of tasks, dependence analysis etc) that typically occur in parallel during runtime are executed sequentially on the RTopt. Even if RTopt executes these operations faster than a general purpose core, serializing them potentially creates a bottleneck as we increase the number of cores. TaskGenX does not transfer other runtime activities than the task creation, so it allows scheduling and dependence analysis operations to be performed in a distributed manner.

*Deps* driven approaches go through the same issue of the serialization of the dependency tracking and the scheduling that occurs at the dependence analysis stage. The reason for the limited performance of *Deps* compared to *Runtime* is that *Deps* does not accelerate any part of the task creation. Improvement over the baseline is still significant as performance with *Deps* is improved by up to  $1.5\times$ .

TaskGenX is the most efficient software-hardware co-design approach when it comes to highly parallel applications. On average, it improves the baseline by up to  $3.1\times$  for homogeneous systems and up to  $3.7\times$  for heterogeneous systems. Compared to other state of the art approaches, TaskGenX is more effective on a large number of cores showing higher performance by 54% over *Runtime* driven approaches and by 70% over *Deps* driven approaches.

## 6.5 CONCLUSIONS

This chapter presented TaskGenX, the first software-hardware co-design that decouples task creation and accelerates it on a runtime optimized hardware. In contrast to previous studies, our study makes the observation that task creation is a significant bottleneck in parallel runtimes. Based on this we implemented TaskGenX on top of the OpenMP programming model. On the hardware side, we set the requirements for the RTopt in order to achieve optimal results and proposed an asymmetric architecture that combines it with general purpose

cores.

Based on this analysis we evaluated the performance of 11 real workloads using our approach with TaskSim simulator. Accelerating task creation, TaskGenX achieves up to  $15.8\times$  improvement (Cholesky128) over the baseline for homogeneous systems and up to  $16\times$  (Blackscholes) on asymmetric systems when the application is launched on a little core. Using TaskGenX on asymmetric systems offers a portable solution, as the task creation is not affected by the type of core that the master thread is bound to.

We further showed that for some cases like Canneal where task creation needs to be accelerated as much as  $197\times$  in order to steadily provide enough created tasks for execution. However, even by using a realistic and implementable hardware approach that offers  $16\times$  speedup of task creation, achieves satisfactory results as it improves the baseline up to  $14\times$ .

Comparing TaskGenX against other approaches such as Carbon, Nexus, Picos++ or TaskSuperscalar that manage to transfer different parts of the runtime to the RTopt proves that TaskGenX is the most minimalistic and effective approach. Even if TaskGenX transfers the least possible runtime activity to the RTopt hardware it achieves better results. This implies that TaskGenX requires a less complicated hardware accelerator, as it is specialized for only a small part of the runtime, unlike the other approaches that need specialization for task creation, dependency tracking and scheduling.

We expect that combining TaskGenX with an asymmetry-aware task scheduler will achieve even better results, as asymmetry introduces load imbalance.

# CHAPTER 7

## REAL-TIME SCHEDULING

---

This chapter contains a high-level description of the work performed in Samsung Electronics Research Institute UK (SRUK) where I practiced my internship. This part of work slightly differs of the rest of this thesis as in this chapter we are focused on thread-level scheduling for asymmetric multi-core (AMC) systems. This means that we are no longer using parallel high-performance applications, rather than mobile phone applications that are implemented in parallel and execute on multiple user-level threads. Our work focuses on scheduling efficiently those threads among the cores of the system in real-time. For simplicity, we refer to threads as tasks.

This work presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) during the gameplay on mobile devices with the ARM big.LITTLE architecture. We design and implement the scheduler in the Android framework and provide an efficient scheduling policy that takes into account the current temperature and performs task migration. Our solution increases the median FPS of the currently used mechanisms by up to 7.5% and at the same time it maintains temperature stable.

## 7.1 BACKGROUND AND MOTIVATION

It has become common for current mobile devices to use asymmetric mobile processors. The state-of-the-art leading asymmetric processor architecture is the ARM big.LITTLE architecture [47] that combines fast out-of-order big cores with simple in-order little cores in order to achieve high performance in a low power budget. Even though AMC systems are being used in the mobile market since the early 2013 there is still a lot of work to be done in order to efficiently utilize them.

As we saw in Chapter 4 the current thread scheduling techniques take place in the Linux kernel. The state of the art Linux scheduler “Completely Fair Scheduler” (CFS) [75] manages to be fair by allowing tasks to execute for the same amount of time on each processor. CFS maintains load balance among the cores of the system by providing equal portions of processor time to the running tasks. Taking into account the asymmetry of current mobile processors and the different types of running applications we can see that CFS contradicts to the needs of an asymmetric system. This is because a fair distribution of the tasks on an AMC system leads to load imbalance and wrong scheduling decisions. Surprisingly enough, CFS is the main scheduler currently used in the mobile market. Heterogeneous Multi-Processing (HMP) with Global Task Scheduling (GTS) (described in Section 4.2.2) is an enhancement in CFS towards the efficient scheduling of the AMC system.

Even though HMP solutions take into account the system’s asymmetry they fall short on achieving efficient schedules. The main reason that these attempts fail is that in the kernel level there is not enough information for the scheduler in order to take efficient scheduling decisions. Information such as data dependencies, task criticality or whether a task is memory or computational bound is not available at kernel level.

Game performance is one of the most challenging areas for mobile devices. Games are the most demanding mobile applications as they execute very computationally intensive tasks in order to produce the graphics that have to be drawn on the screen. State-of-the-art mobile games are typically multi-threaded using existing frameworks and run on both CPU and GPU of the system. Even though the graphics quality and on-screen representation are mainly taking place in the GPU, there are very important CPU tasks that drive the entire game playing procedure. The CPU tasks are responsible for performing the logic of the game as well as sending the appropriate data (that usually are outputs of the game functions) to the GPU. One example of these tasks is the Render task that is responsible for writing the appropriate data in a buffer which is then to be read by the GPU kernels that draw the frames on the screen of the device. Such tasks have to be executed at a specific rate because if they do not finish on time the frame drawing is delayed and then the frame is missed, resulting

to screen freezing or FPS drop. Our solution performs CPU scheduling to increase FPS in games while keeping temperature stable.

## 7.2 REAL-TIME SCHEDULING SOLUTION

<sup>1</sup>As mentioned above, as well as in Chapter 4, scheduling in the kernel level is not flexible and usually results to bad scheduling decisions. An important drawback of the kernel level scheduling is that information provided at this level is limited. For example there is no way to know what each task is used for and what the current runtime circumstances are. To avoid these obstacles, we move upwards in the software hierarchy and create the appropriate environment to reuse existing Samsung software within the Android framework in order to apply scheduling solutions. The Samsung's Temperature Control (STC) is a system service of the Android framework used to maintain device temperature, that starts execution whenever it detects that a game application is running. We implement the Real-Time Scheduler (RTS) within STC.

Since temperature plays a very important role in the performance of the mobile device, RTS uses this information in order to make scheduling decisions for the tasks of the executing game. The scheduling decisions of RTS depend mostly on temperature; RTS decides the appropriate task migration cluster and it then performs the scheduling. During the game-play, when RTS detects that the temperature increases, it gradually starts migrating tasks of the game to the little-core cluster of the AMC. In the opposite scenario, when temperature is decreasing, RTS manages to migrate tasks that currently run on the little-core cluster, back to the big-core cluster. RTS scheduling mechanisms are able to maintain load balance among the cores of the AMC system resulting to higher FPS.

## 7.3 EVALUATION

### 7.3.1 METHODOLOGY

We evaluate RTS by using it while running commodity mobile games on a state-of-the-art Samsung mobile device featuring a Samsung Exynos octa-core chip-set with Arm big.LITTLE architecture [32, 47]. We run three mobile games: *i*) Lineage2 [72], *ii*) King of Glory [90] and *iii*) Dynasty Warriors [56]. The selected games are battle play games and they all fulfill the criteria that facilitate the evaluation: (*a*): They all have high FPS up to 60 and demanding graphics, (*b*): they have auto-play mode and (*c*): their automatic game-play occurs without

---

<sup>1</sup>This section intentionally lacks detailed description due to NDA agreement with Samsung



Figure 7.1 Median frames per second (FPS) results for each game (x axis) and scheduling set-up (series).

intervals for a long time. Our experiments compare our approach (STC + RTS) against two additional scheduling scenarios:

- *GTS*: The default Linux CFS scheduler with the GTS support for asymmetric systems
- *STC*: The CFS scheduler with the STC framework enabled during the game-play
- *STC + RTS*: The RTS running on top of the STC framework

Our experiments include 10 runs for each game-scheduler pair and each run lasts for 800 seconds. The reported results include the averages among the 10 runs of the median Frames per Second (FPS), the average Prescribed Surface Temperature (PST) and the FPS stability. The FPS stability characterizes how stable is the FPS during the gameplay. We compute the stability by computing the percentage of FPS readings that are at maximum 20% above or below the median FPS.

### 7.3.2 RESULTS

#### *Impact of RTS*

Figure 7.1 shows the median FPS obtained from three games when running on the eight cores of the system using the three different scenarios described above. STC+RTS approach improves Samsung's STC by 6% for Lineage2, by 3% for King of Glory and by 7.5% for Dynasty Warriors. Figure 7.2 shows the average Prescribed Surface Temperature (PST) observed during the experiments for each game-scheduler. STC and STC+RTS approaches show similar PST as they both use the same mechanism for the temperature maintenance. It is important to note that via task scheduling, STC+RTS achieves higher FPS than STC while keeping the temperature at the desired levels, avoiding potential lags due to overheating.



Figure 7.2 Prescribed surface temperature (PST) results for each game (x axis) and scheduling set-up (series).



Figure 7.3 FPS stability observed for each game (x axis) and scheduling set-up (series).

For Dynasty Warriors the GTS approach, that lacks the temperature control, outperforms the approaches that control temperature as shown in Figure 7.1. This is because in the case of GTS, the default Linux governor and scheduler constantly push the system to achieve higher FPS which potentially leads to device overheating. This is verified by the results of Figure 7.2: the GTS approach maintains the temperature in reasonable levels for Lineage2 and King of Glory, but in the case of Dynasty Warriors the temperature is increased.

Another important characteristic of the gameplay experience is the FPS stability. Figure 7.3 shows that FPS stability is not affected by the task scheduling performed with RTS even if tasks migrate from one CPU cluster to another.

#### *Comparison to Other Approaches*

In this part of the evaluation we compare RTS against two different scheduling approaches that do not involve any sophisticated task migration. We implement these schedulers on top



Figure 7.4 Comparison of the FPS-PST trade-offs for different scheduling policies running on top of STC.

of STC and we use them to prove that RTS scheduling is effective due to its correct runtime decisions. The schedulers that we compare in this section are all running together with STC and are the following:

- *LittleCls*: in this approach all the tasks of the game are constantly maintained in the little-core cluster
- *RTS*: the presented approach, which migrates a number of tasks to the appropriate core cluster according to current temperature
- *BigCls*: in this approach all the tasks of the game are constantly maintained in the big-core cluster

Figure 7.4 shows the FPS and temperature results for the aforementioned scheduling approaches when running Lineage2. The experimental set-up is the same as described in section 7.3.1. Maintaining the tasks on the little cluster with LittleCls, significantly limits FPS as the little cores are not as powerful to efficiently process demanding game tasks. Due to the lack of the big-core cluster use, device temperature is also maintained to very low levels at 48 Celsius degrees. On the other hand, using only the big-core cluster leads to higher FPS but also increased temperature, even if STC is still operating. RTS is able to maintain temperature at reasonable levels while achieving the highest FPS due to its efficient scheduling. It is interesting to note here that running tasks exclusively on the big cluster is not as efficient in terms of FPS. The increased temperature can affect the governor decisions so the processing power of big cores is reduced leading to FPS decrease as shown in the case of BigCls.

## 7.4 CONCLUSIONS

In this chapter we showed that there is significant potential in increasing game performance on mobile devices through task scheduling. Choosing the right policy is essential as maintaining the tasks to only one core cluster showed its inefficiency. To maintain load balance on asymmetric systems it is important to decouple the scheduling from the operating system as the Android framework offers more programming flexibility as well as the appropriate runtime information that can be used for the scheduling decisions. Due to the lack of runtime information, GTS either keeps FPS high at the cost of increased temperature or maintains the temperature low at the cost of lower FPS. STC+RTS provide a good trade-off between high FPS and reasonable temperature, improving the STC solution.



# CHAPTER 8

## CONCLUSIONS

---

This PhD thesis incorporated flexible software techniques on the runtime system level in order to effectively utilize asymmetric systems. The main contributions of the thesis rely on the efficient exploitation of future asymmetric multi-core systems in terms of performance as well as on conceiving future asymmetric architectures that fit the needs of high performance computing.

This thesis showed that current highly parallel applications are not ready to fully utilize an AMC system. Parallelizing on application level requires significant programming effort and results are not always optimal. Using task-based programming models offers a flexible solution for programmability as well as scheduling and performance. Using task-based programming models on AMC systems increases the need of research for enhancing the runtime system of the task-based programming models to achieve even higher performance. Scheduling and runtime overheads' acceleration are useful research lines towards this direction.

Following these research lines, in this thesis we introduced three novel schedulers for asymmetric systems. We implemented these schedulers within the OmpSs task-based programming model and used them on a real asymmetric multi-core system. These scheduling approaches do not consist of theoretical results but are implementable and work on real platforms with real applications, contrary to previous approaches that use synthetic TDGs and profiling. CATS offers a consistent performance improvement of around 10% to 20% reaching up to 30%. By tracking task execution time, CPATH offers a higher accuracy in the identification of critical tasks but this does not imply that it always increases performance. The number of tasks, task cost variability as well as the TDG structure are vital characteristics of an application that affect performance, especially on AMC systems.

Furthermore, an important outcome of this thesis is the proof that task creation is a significant bottleneck in parallel runtime systems. To overcome this significant bottleneck of task-based programming models we proposed TaskGenX, a HW-SW proposal for accelerating task creation that achieves up to  $15\times$  increased performance. TaskGenX is excels compared to existing approaches, for two main reasons; first is because it achieves higher performance as the number of cores is increased. TaskGenX outperforms approaches that accelerate all the runtime activities by 54% and approaches that accelerate scheduling and dependence analysis by 70%. The second reason that TaskGenX solution is optimal compared to other approaches is that it is the most minimalistic solution. It achieves such results by only requiring the acceleration of task creation, using a simple single-core hardware proposal. Throughout TaskGenX study we also made observations that contribute and give guidelines for the design of the future multi-core asymmetric systems for high performance computing.

Finally we showed that scheduling is important not only for asymmetric systems that run highly parallel applications, but it is also important and can increase performance when used on mobile devices for running multi-threaded applications such as games. An as simple scheduling approach as RTS of this thesis can achieve up to 7.5% increase in FPS while maintaining stable temperature and high FPS stability.

## 8.1 FUTURE DIRECTIONS

Our studies have opened paths for future research on both software and hardware areas. On the software side, the proposed scheduling policies can be enhanced on many different levels. First, there is the opportunity of providing a single smart scheduler that dynamically adapts the most appropriate scheduling policy depending to the application's characteristics and availability of resources. This smart scheduler can include not only the policies proposed in this thesis, but also policies implemented in other studies. The choice of the appropriate scheduling policy can be based on profiling or training (through machine learning) data. Moreover, the scheduling proposals that track task execution time, can be enhanced so that they track the task execution time on all the core types to cover the case when a core type is not always faster. The use of off-line profiling data would also be beneficial to these schedulers, to alleviate the overhead of task cost tracking at runtime.

It is expected in the near future to have asymmetric multi-core systems that incorporate more than two core types. For these systems our scheduling policies can be extended in order to benefit from the increased asymmetry. This can be done by applying multiple levels of criticality to the tasks, and assign each task to the corresponding core type depending on

its performance.

Another enhancement and research path that can be taken from this thesis is the incorporation of asymmetry-aware schedulers with TaskGenX. It is expected that using an asymmetry-aware scheduler on top of TaskGenX will provide further performance improvements. In this case, dependency synchronized applications with TDGs that show long critical paths will benefit, contrary to barrier-based applications that do not have a long critical path. TaskGenX is beneficial for such cases as in these schedulers the task creation overheads are even higher due to the additional time spent on task prioritization.

Furthermore, it sounds reasonable to conduct research on the actual hardware design to accelerate task generation costs. Based on our hardware requirements study we can analyze and design all the task generation steps and provide a hardware implementation. This hardware can be either single core, to support a single level of parallelism (which is the case in most applications) or multi-core in order to support nested parallelism. In this case, the number of cores of this accelerator would indicate the nesting level that TaskGenX could support.



# REFERENCES

---

- [1] Thomas L. Adam, K. M. Chandy, and J. R. Dickson. A Comparison of List Schedules for Parallel Processing Systems. *Commun. ACM*, 17(12), 1974.
- [2] A. Agarwal and P. Kumar. Economical Duplication Based Task Scheduling for Heterogeneous and Homogeneous Computing Systems. In *IACC*, 2009.
- [3] Mike Anderson. Scheduler Options in big.LITTLE Android Platforms. [http://events.linuxfoundation.org/sites/events/files/slides/GTS\\_Angerson.pdf](http://events.linuxfoundation.org/sites/events/files/slides/GTS_Angerson.pdf), 2015.
- [4] ARM. Cortex-A15 technical reference manual, revision: r2p0, 2011. URL [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438c/DDI0438C\\_cortex\\_a15\\_r2p0\\_trm.pdf](http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438c/DDI0438C_cortex_a15_r2p0_trm.pdf).
- [5] ARM. Cortex-A7 MPCore, revision: r0p3, 2011. URL [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464d/DDI0464D\\_cortex\\_a7\\_mpcore\\_r0p3\\_trm.pdf](http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464d/DDI0464D_cortex_a7_mpcore_r0p3_trm.pdf).
- [6] ARM. Juno ARM Development Platform. Last accessed Oct 12, 2016.
- [7] Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. *Concurr. Comput. : Pract. Exper.*, 23(2), 2011.
- [8] Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoe flinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. The design of OpenMP tasks. *IEEE TPDS*, 20(3):404–418, 2009.
- [9] Eduard Ayguadé, RosaM. Badia, Pieter Bellens, Daniel Cabrera, Alejandro Duran, Roger Ferrer, Marc González, Francisco Igual, Daniel Jiménez-González, Jesús Labarta, Luis Martinell, Xavier Martorell, Rafael Mayo, JosepM. Pérez, Judit Planas, and EnriqueS. Quintana-Ortí. Extending OpenMP to Survive the Heterogeneous Multi-Core Era. *International Journal of Parallel Programming*, 38(5-6), 2010. ISSN 0885-7458.
- [10] Saisanthosh Balakrishnan, Ravi Rajwar, Michael Upton, and Konrad K. Lai. The impact of performance asymmetry in emerging multicore architectures. In *ISCA*, pages 506–517, 2005.
- [11] S. Bansal, P. Kumar, and K. Singh. An Improved Duplication Strategy for Scheduling Precedence Constrained Graphs in Multiprocessor Systems. *IEEE TPDS*, 14(6), 2003.

- [12] Barcelona Supercomputing Center. BSC Application Repository, . URL <https://pm.bsc.es/projects/bar>.
- [13] Barcelona Supercomputing Center. Nanos++, . URL <https://pm.bsc.es/nanox>.
- [14] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In *SC*, 2012.
- [15] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In *SC*, pages 66:1–66:11, 2012.
- [16] Christian Bienia. *Benchmarking Modern Multiprocessors*. PhD thesis, Princeton University, 2011.
- [17] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. In *PPoPP*, pages 207–216, 1995.
- [18] OpenMP Architecture Review Board. OpenMP Specification. 4.5, 2015.
- [19] Jag Bolaria. Cortex-A57 extends ARM’s reach, 2012. URL [www.mpronline.com](http://www.mpronline.com).
- [20] Javier Bueno, Judit Planas, Alejandro Duran, Rosa M. Badia, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta. Productive programming of GPU clusters with OmpSs. In *IPDPS*, pages 557–568, 2012.
- [21] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. Technical report, 2007.
- [22] T. Cao, W. Huang, Y. He, and M. Kondo. Cooling-aware job scheduling and node allocation for overprovisioned hpc systems. In *IPDPS’17*, pages 728–737, 2017.
- [23] E. Castillo, M. Moreto, M. Casas, L. Alvarez, E. Vallejo, K. Chronaki, R. Badia, J. L. Bosque, R. Beivide, E. Ayguade, J. Labarta, and M. Valero. Cata: Criticality aware task acceleration for multicore processors. In *2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)*, pages 413–422, May 2016. doi: 10.1109/IPDPS.2016.49.
- [24] E. Castillo, L. Alvarez, M. Moreto, M. Casas, E. Vallejo, J. L. Bosque, R. Beivide, and M. Valero. Architectural support for task dependence management with flexible software scheduling. In *2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)*, pages 283–295, Feb 2018. doi: 10.1109/HPCA.2018.00033.
- [25] Barbara Chapman. *The Multicore Programming Challenge*, pages 3–3. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-76837-1. doi: 10.1007/978-3-540-76837-1\_3. URL [http://dx.doi.org/10.1007/978-3-540-76837-1\\_3](http://dx.doi.org/10.1007/978-3-540-76837-1_3).
- [26] Dimitrios Chasapis, Marc Casas, Miquel Moreto, Raul Vidal, Eduard Ayguade, Jesús Labarta, and Mateo Valero. PARSECSS: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite. *TACO*, 2015.

- [27] N. Chitlur, G. Srinivasa, S. Hahn, P.K. Gupta, D. Reddy, D. Koufaty, P. Brett, A. Prabhakaran, Li Zhao, N. Ijih, S. Subhaschandra, S. Grover, Xiaowei Jiang, and R. Iyer. Quickia: Exploring heterogeneous architectures on real prototypes. In *HPCA*, pages 1–8, 2012.
- [28] K. Chronaki, M. Moretó, M. Casas, A. Rico, R. M. Badia, E. Ayguadé, J. Labarta, and M. Valero. Poster: Exploiting asymmetric multi-core processors with flexible system software. In *PACT*, pages 415–417, Sept 2016.
- [29] K. Chronaki, A. Rico, M. Casas, M. Moretó, R. M. Badia, E. Ayguadé, J. Labarta, and M. Valero. Task scheduling techniques for asymmetric multi-core systems. volume 28, pages 2074–2087, 2017.
- [30] K. Chronaki, M. Casas, M. Moretó, J. Bosch, and R. M. Badia. Taskgenx: A hardware-software proposal for accelerating task parallelism. In *International Supercomputing Conference (ISC)*, 2018.
- [31] Kallia Chronaki, Alejandro Rico, Rosa M. Badia, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. Criticality-aware dynamic task scheduling for heterogeneous architectures. In *ICS*, pages 329–338, 2015.
- [32] Hongsuk Chung, Munsik Kang, and Hyun-Duk Cho. Heterogeneous Multi-Processing Solution of Exynos 5 Octa with ARM big.LITTLE Technology. Technical report, Samsung Electronics Co., Ltd., 2013.
- [33] Jason Cong and Bo Yuan. Energy-efficient scheduling on heterogeneous multi-core architectures. In *ISLPED*, pages 345–350, 2012.
- [34] T. Dallou, N. Engelhardt, A. Elhossini, and B. Juurlink. Nexus#: A distributed hardware task manager for task-based programming models. In *IPDPS*, pages 1129–1138, 2015.
- [35] M.I. Daoud and N. Kharma. Efficient Compile-Time Task Scheduling for Heterogeneous Distributed Computing Systems. In *ICPADS*, 2006.
- [36] Mike Demler. Cortex-A72 takes big step forward, 2015. URL [www.mpronline.com](http://www.mpronline.com).
- [37] R.H. Dennard, F.H. Gaenslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc. Design of Ion-implanted MOSFET's with Very Small Physical Dimensions. In *Solid-State Circuits, IEEE Journal of*, volume 9, 1974. doi: 10.1109/JSSC.1974.1050511.
- [38] Kiril Dichev, Herbert Jordan, Konstantinos Tovletoglou, Thomas Heller, Dimitrios Nikolopoulos, Georgios Karakontantis, and Charles Gillan. Dependency-aware rollback and checkpoint-restart for distributed task-based runtimes. 2017.
- [39] Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. Evaluation of OpenMP Task Scheduling Strategies. *IWOMP'08*, 2008.
- [40] Alejandro Duran, Josep M. Perez, Eduard Ayguadé, Rosa M. Badia, and Jesus Labarta. Extending the OpenMP Tasking Model to Allow Dependent Tasks. *IWOMP'08*, 2008.

- [41] Alejandro Duran, Eduard Ayguadé, Rosa M Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures. *Parallel Processing Letters*, 21, 2011.
- [42] Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. Task superscalar: An out-of-order task pipeline. In *MICRO*, pages 89–100, 2010.
- [43] Alexandra Fedorova, Juan Carlos Saez, Daniel Sheleпов, and Manuel Prieto. Maximizing Power Efficiency with Asymmetric Multicore Systems. *Communications of the ACM*, 52(12), 2009.
- [44] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. *SIGPLAN Not.*, 33(5):212–223, May 1998. ISSN 0362-1340. doi: 10.1145/277652.277725. URL <http://doi.acm.org/10.1145/277652.277725>.
- [45] Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. Reducers and other cilk++ hyperobjects. In *Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures*, SPAA ’09, pages 79–90, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-606-9. doi: 10.1145/1583991.1584017. URL <http://doi.acm.org/10.1145/1583991.1584017>.
- [46] T. Grass, C. Allande, A. Armejach, A. Rico, E. Ayguadé, J. Labarta, M. Valero, M. Casas, and M. Moreto. Musa: A multi-level simulation approach for next-generation hpc machines. In *SC16*, pages 526–537, Nov 2016. doi: 10.1109/SC.2016.44.
- [47] Peter Greenhalgh. big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. *ARM White Paper*, 2011.
- [48] Linley Gwennap. Cortex-A75 Has DynamIQ Debut, Microprocessor Report, 2017.
- [49] M. Hakem and F. Butelle. Dynamic Critical Path Scheduling Parallel Programs onto Multiprocessors. In *IPDPS*, 2005.
- [50] Ian Rickards and Amit Kucherla. Energy Aware Scheduling (EAS) progress update. <https://www.linaro.org/blog/core-dump-energy-aware-scheduling-eas-progress-update>, 2015.
- [51] Michael A. Iverson, Füsün Özgüner, and Gregory J. Follen. Parallelizing Existing Applications in a Distributed Heterogeneous Environment. In *HCW*, 1995.
- [52] Brian Jeff. big.LITTLE Technology Moves Towards Fully Heterogeneous Global Task Scheduling. *ARM White Paper*, 2013.
- [53] M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer, and D. Sanchez. A scalable architecture for ordered parallelism. In *MICRO*, pages 228–241, 2015.
- [54] José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. Bottleneck identification and scheduling in multithreaded applications. In *ASPLOS*, 2012.

- [55] José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. Utility-based acceleration of multithreaded applications on asymmetric CMPs. In *ISCA*, pages 154–165, 2013.
- [56] LTD KOEI TECMO GAMES CO. Dynasty Warriors. URL <http://www.koeitecmoamerica.com>.
- [57] David Koufaty, Dheeraj Reddy, and Scott Hahn. Bias scheduling in heterogeneous multi-core architectures. In *EuroSys*, pages 125–138, 2010.
- [58] Kevin Krewell. Cortex-A53 is ARM’s next little thing, 2012. URL [www.mpronline.com](http://www.mpronline.com).
- [59] Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In *MICRO*, pages 81–92, 2003.
- [60] Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. Single-isa heterogeneous multi-core architectures for multithreaded workload performance. In *ISCA*, pages 64–75, 2004.
- [61] Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In *ISCA*, pages 162–173, 2007.
- [62] M. A. Laurenzano, A. Tiwari, A. Cauble-Chantrenne, A. Jundt, W. A. Ward, R. Campbell, and L. Carrington. Characterization and bottleneck analysis of a 64-bit armv8 platform. In *ISPASS’16*, pages 36–45, 2016.
- [63] K. Li, X. Tang, and K. Li. Energy-efficient stochastic task scheduling on heterogeneous computing systems. *IEEE Transactions on Parallel and Distributed Systems*, 25(11):2867–2876, 2014. ISSN 1045-9219. doi: 10.1109/TPDS.2013.270.
- [64] Kenli Li, Zhimin Zhang, Yuming Xu, Bo Gao, and Ligang He. Chemical Reaction Optimization for Heterogeneous Computing Environments. In *ISPA*, 2012.
- [65] Chun-Hsien Liu, Chia-Feng Li, Kuan-Chou Lai, and Chao-Chin Wu. A dynamic Critical Path Duplication Task Scheduling Algorithm for Distributed Heterogeneous Computing Systems. In *ICPADS*, 2006.
- [66] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke. Composite cores: Pushing heterogeneity into a core. In *MICRO*, pages 317–328, 2012.
- [67] M Manivannan and P Stenström. Runtime-guided cache coherence optimizations in multi-core architectures. In *IPDPS*, 2014.
- [68] Mathieu Poirier. In Kernel Switcher: A solution to support ARM’s new big.LITTLE technology. Embedded Linux Conference 2013, 2013.
- [69] Sparsh Mittal. A survey of techniques for architecting and managing asymmetric multicore processors. 48, 02 2016.

- [70] Tomer Y. Morad, Uri C. Weiser, Avinoam Kolodny, Mateo Valero, and Eduard Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. *IEEE Comput. Archit. Lett.*, 5(1):4–17, 2006. ISSN 1556-6056.
- [71] Ioannis A. Moschakis and Helen D. Karatza. A meta-heuristic optimization approach to the scheduling of bag-of-tasks applications on heterogeneous clouds with multi-level arrivals and critical jobs. *Simulation Modelling Practice and Theory*, 57:1–25, 2015. doi: 10.1016/j.simpat.2015.04.009.
- [72] NCSOFT Corp. Netmarble Corp. and Netmarble Neo Inc. Lineage2: Revolution. URL <http://l2.netmarble.com>.
- [73] OpenMP Architecture Review Board. OpenMP architecture review board: Application program interface, 2013.
- [74] OpenMP Architecture Review Board. OpenMP architecture review board: Application programming interface 4.5. 2015.
- [75] Chandandeep Singh Pabla. Completely fair scheduler. *Linux J.*, 2009(184), August 2009. ISSN 1075-3583. URL <http://dl.acm.org/citation.cfm?id=1594371.1594375>.
- [76] A.J. Page and T.J. Naughton. Dynamic Task Scheduling using Genetic Algorithms for Heterogeneous Distributed Computing. In *Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International*, 2005.
- [77] Vassilis Papaefstathiou, Manolis G.H. Katevenis, Dimitrios S. Nikolopoulos, and Dionisios Pnevmatikatos. Prefetching and cache management using task lifetimes. In *Proceedings of the 27th International ACM Conference on International Conference on Supercomputing*, ICS ’13, pages 325–334, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2130-3. doi: 10.1145/2464996.2465443. URL <http://doi.acm.org/10.1145/2464996.2465443>.
- [78] Jacques A. Pienaar, Srimat Chakradhar, and Anand Raghunathan. Automatic Generation of Software Pipelines for Heterogeneous Parallel Systems. In *SC*, 2012.
- [79] J. Planas, R.M. Badia, E. Ayguade, and J. Labarta. Self-Adaptive OmpSs Tasks in Heterogeneous Environments. pages 138–149, 2013.
- [80] Nikola Rajovic, Paul M Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? In *SC*, 2013.
- [81] James Reinders. *Intel threading building blocks - outfitting C++ for multi-core processor parallelism*. O’Reilly, 2007.
- [82] Bin Ren, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. Exploiting vector and multicore parallelism for recursive, data- and task-parallel programs. In *PPoPP ’17*, pages 117–130, New York, NY, USA, 2017.
- [83] Alejandro Rico, Felipe Cabarcas, Carlos Villavieja, Milan Pavlovic, Augusto Vega, Yoav Etsion, Alex Ramirez, and Mateo Valero. On the Simulation of Large-Scale Architectures Using Multiple Application Abstraction Levels. *ACM TACO*, 8(4), 2012.

- [84] Rance Rodrigues, Arunachalam Annamalai, Israel Koren, and Sandip Kundu. Scalable thread scheduling in asymmetric multicores for power efficiency. In *SBAC-PAD*, pages 59–66, 2012.
- [85] Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In *Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV*, pages 311–322, 2010. ISBN 978-1-60558-839-1.
- [86] M. Själander, A. Terechko, and M. Duranton. A look-ahead task management unit for embedded multi-core architectures. In *2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools*, pages 149–157, 2008.
- [87] M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In *ASPLOS*, pages 253–264, 2009.
- [88] X. Tan, J. Bosch, M. Vidal, C. Álvarez, D. Jiménez-González, E. Ayguadé, and M. Valero. General purpose task-dependence management hardware for task-based dataflow programming models. In *IPDPS*, pages 244–253, 2017.
- [89] X. Tang, A. Pattnaik, H. Jiang, O. Kayiran, A. Jog, S. Pai, M. Ibrahim, M. T. Kandemir, and C. R. Das. Controlled kernel launch for dynamic parallelism in gpus. In *HPCA’17*, pages 649–660, 2017.
- [90] Tencent Games. King of Glory (Arena of Valor). URL <https://www.arenaofvalor.com>.
- [91] H. Topcuoglu, S. Hariri, and Min-You Wu. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. *IEEE TPDS*, 13(3), 2002.
- [92] Jim Turley. Cortex-A15 eagle flies the coop, 2011. URL [www.mpronline.com](http://www.mpronline.com).
- [93] Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multi-cores through performance impact estimation (pie). In *ISCA*, pages 213–224, 2012.
- [94] Kenzo Van Craeynest, Shoaib Akram, Wim Heirman, Aamer Jaleel, and Lieven Eeckhout. Fairness-aware Scheduling on single-ISA Heterogeneous Multi-cores. In *PACT*, 2013.
- [95] Hans Vandierendonck, George Tzenakis, and Dimitrios S Nikolopoulos. A unified scheduler for recursive and task dataflow parallelism. In *PACT*, 2011.
- [96] Hans Vandierendonck, Kallia Chronaki, and Dimitrios S. Nikolopoulos. Deterministic scale-free pipeline parallelism with hyperqueues. In *SC*, 2013.
- [97] H. Wong, A. Bracy, E. Schuchman, T. M. Aamodt, J. D. Collins, P. H. Wang, G. Chinya, A. K. Groen, H. Jiang, and H. Wang. Pangaea: A tightly-coupled ia32 heterogeneous chip multiprocessor. In *PACT*, 2008.

- [98] M.-Y. Wu and D.D. Gajski. Hypertool: a Programming Aid for Message-Passing Systems. *IEEE TPDS*, 1(3), 1990.
- [99] Yun Wu, Charles Gillan, Umar Minhas, Sakil Barbhuiya, Aleksandar Novakovic, Konstantinos Tovletoglou, Georgios Tzenakis, Hans Vandierendonck, Georgios Karakonstantis, and Dimitrios Nikolopoulos. Heterogeneous servers based on programmable cores and dataflow engines. In *EnESCE*, 2017.
- [100] Tao Yang and Apostolos Gerasoulis. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors. *IEEE TPDS*, 5(9), 1994.
- [101] Han Yu. A Hybrid GA-based Scheduling Algorithm for Heterogeneous Computing Environments. In *SCIS*, 2007.
- [102] Xusheng Zhan, Yungang Bao, Christian Bienia, and Kai Li. Parsec3.0: A multi-core benchmark suite with network stacks and splash-2x. *SIGARCH Comput. Archit. News*, 44(5):1–16, 2017. ISSN 0163-5964.
- [103] Ziliang Zong, A. Manzanares, Xiaojun Ruan, and Xiao Qin. EAD and PECD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters. *IEEE Trans. Comput.*, 60(3), 2011.
- [104] Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. Using a “Codelet” program execution model for exascale machines: Position paper. In *EXADAPT*, 2011.