



SCHOOL OF COMPUTATION,  
INFORMATION AND TECHNOLOGY —  
INFORMATICS

TECHNISCHE UNIVERSITÄT MÜNCHEN

Bachelor's Thesis in Informatics

**Exploring GPU Programming Models for  
Autonomous Driving: From Coroutine  
Integration to Persistent Thread  
Optimization**

Jaden Rotter





SCHOOL OF COMPUTATION,  
INFORMATION AND TECHNOLOGY —  
INFORMATICS

TECHNISCHE UNIVERSITÄT MÜNCHEN

Bachelor's Thesis in Informatics

**Exploring GPU Programming Models for  
Autonomous Driving: From Coroutine  
Integration to Persistent Thread  
Optimization**

**GPU Coroutines in Autonomes Fahren**

|                  |                 |
|------------------|-----------------|
| Author:          | Jaden Rotter    |
| Examiner:        | Supervisor      |
| Supervisor:      | Jianfeng Gu     |
| Submission Date: | Submission date |



I confirm that this bachelor's thesis is my own work and I have documented all sources and material used.

Munich, Submission date

Jaden Rotter

## **Acknowledgments**

# **Abstract**

# Contents

|                                                                 |           |
|-----------------------------------------------------------------|-----------|
| <b>Acknowledgments</b>                                          | <b>iv</b> |
| <b>Abstract</b>                                                 | <b>v</b>  |
| <b>1 Introduction</b>                                           | <b>1</b>  |
| 1.1 Motivation . . . . .                                        | 1         |
| 1.1.1 Evolution of Autonomous Driving Architectures . . . . .   | 1         |
| 1.1.2 Limitations of Current GPU Execution Models . . . . .     | 2         |
| 1.2 Problem Statement . . . . .                                 | 2         |
| 1.3 Objectives and Contributions . . . . .                      | 3         |
| 1.3.1 Original Objective . . . . .                              | 3         |
| 1.3.2 Challenges and Scope Adjustment . . . . .                 | 3         |
| 1.3.3 Contributions . . . . .                                   | 4         |
| <b>2 Background</b>                                             | <b>5</b>  |
| 2.1 Real Time Systems . . . . .                                 | 5         |
| 2.2 Integration of GPUs in Autonomous Driving Systems . . . . . | 5         |
| 2.3 GPU versus CPU Architecture . . . . .                       | 6         |
| 2.3.1 GPU Architecture . . . . .                                | 9         |
| 2.4 GPU Programming using the CUDA API . . . . .                | 12        |
| 2.4.1 Kernel Launches . . . . .                                 | 12        |
| 2.4.2 Host to GPU Memory Transfers and Bandwidth Considerations | 13        |
| 2.4.3 Memory Coalescing . . . . .                               | 13        |
| 2.4.4 Example Kernel Launch . . . . .                           | 14        |
| 2.4.5 CUDA Streams . . . . .                                    | 15        |
| 2.5 GPU Programming Models for Real Time Systems . . . . .      | 16        |
| 2.5.1 Coroutines . . . . .                                      | 16        |
| 2.5.2 GPU Coroutines . . . . .                                  | 18        |
| 2.5.3 Persistent Kernels . . . . .                              | 19        |
| <b>3 Related Work</b>                                           | <b>20</b> |
| 3.1 Compiler Driven Frameworks . . . . .                        | 20        |
| 3.2 Runtime Scheduling Frameworks . . . . .                     | 21        |

|                                                                      |           |
|----------------------------------------------------------------------|-----------|
| <b>4 System Design and Implementation</b>                            | <b>22</b> |
| 4.1 Platform Integration: GPU Scheduling in Apollo . . . . .         | 22        |
| 4.1.1 System Design . . . . .                                        | 23        |
| 4.2 Task Management System . . . . .                                 | 23        |
| 4.2.1 Task Queue Design . . . . .                                    | 24        |
| 4.2.2 Extensibility for Coroutines and Priority Scheduling . . . . . | 24        |
| 4.3 Function Context . . . . .                                       | 25        |
| 4.3.1 GPU Function Pointers . . . . .                                | 25        |
| 4.3.2 Function Parameters . . . . .                                  | 25        |
| 4.4 Memory Management System . . . . .                               | 26        |
| 4.4.1 Memory Buffer Design . . . . .                                 | 26        |
| 4.4.2 CUDA Stream Optimization . . . . .                             | 27        |
| 4.5 GPU Block Synchronization . . . . .                              | 28        |
| 4.6 Architecture . . . . .                                           | 30        |
| 4.7 Further Implementation Considerations . . . . .                  | 31        |
| 4.7.1 Serialization of Data for Memory Copies . . . . .              | 31        |
| 4.7.2 Variable Launch Configurations . . . . .                       | 31        |
| <b>5 Experiments and Evaluation</b>                                  | <b>32</b> |
| 5.1 Experimental Setup . . . . .                                     | 32        |
| 5.1.1 Matrix Multiplications for Performance Testing . . . . .       | 33        |
| 5.1.2 Testing Environment . . . . .                                  | 33        |
| 5.1.3 Small Kernel and Small Memory Transfers . . . . .              | 34        |
| 5.1.4 Small Kernel and Larger Memory Transfers . . . . .             | 35        |
| 5.1.5 Multiple Kernels . . . . .                                     | 35        |
| 5.2 Profiling and Analysis with nsys . . . . .                       | 35        |
| 5.3 Coroutine Implementation . . . . .                               | 36        |
| 5.3.1 Yielding and Changing Tasks . . . . .                          | 37        |
| 5.3.2 Continuation and Resumption . . . . .                          | 37        |
| 5.4 Limitations and Future Work . . . . .                            | 38        |
| <b>Abbreviations</b>                                                 | <b>39</b> |
| <b>List of Figures</b>                                               | <b>40</b> |
| <b>List of Tables</b>                                                | <b>41</b> |
| <b>Bibliography</b>                                                  | <b>42</b> |

# 1 Introduction

## 1.1 Motivation

Autonomous driving systems place stringent demands on computational performance, predictability and safety. These demands arise from the need to process vast amounts of sensor data and run complex perception and decision making algorithms in real time, while ensuring timely and deterministic responses to a dynamic environment. To meet the computational requirements of such systems, GPUs have become essential due to their performance on machine learning workloads. However, the current GPU programming and execution model is poorly suited to real time constraints. The autonomous driving platform, Apollo, which serves as the primary motivation for this thesis, illustrates this challenge. While Apollo integrates GPU accelerated workloads, it does not natively provide real time GPU support. Although this thesis does not implement the proposed approach directly within Apollo, the challenges of meeting real time requirements in autonomous systems, particularly in GPU workloads, serve as the central motivation for this research.

### 1.1.1 Evolution of Autonomous Driving Architectures

Historically, early autonomous driving systems addressed the real time requirements using a distributed architecture. In this approach, major functional modules, such as perception, localization, planning, and control, were mapped to separate compute units, which together formed a processing pipeline [1]. Each module could thus operate with predictable timing characteristics, avoiding contention with other modules. In this manner, the distributed architecture allowed for fine tuning the timing between modules to achieve low latency responses from the hardware. This modular architecture ensured responsiveness and real time guarantees at the expense of high hardware cost and increased system complexity.

The rise of increasingly powerful GPUs has enabled a shift toward centralized computing in order to simplify the hardware and complexity while reducing costs. In this centralized architecture, all core driving modules share a common compute node consisting of a heterogeneous CPU-GPU system. The compute node integrates a CPU for task scheduling and system control, which assigns compute heavy workloads

to the GPU. Moving to a singular compute node allows savings in cost, design, and intermodule latency.

### 1.1.2 Limitations of Current GPU Execution Models

Despite the benefits afforded by a centralized architecture and the advances in hardware, the system risks GPU oversubscription. As the GPU is responsible for all processing tasks, too many simultaneously scheduled tasks can lead to delays in execution time. Typical real time solutions based on a CPU architecture, ensure system safety under contention by preempting non critical tasks. Tasks requiring high responsiveness can then be directly executed after preemption of the resident processes. Although there already exists native support for this preemption on CPUs within both the kernel and user space, GPUs currently lack such capabilities.

Modern GPUs do not natively support real time programming models in the same way CPUs do. GPU execution is managed by a hardware scheduler that handles thread blocks, warps and memory accesses according to fixed, internal policies. This internal GPU hardware scheduler is optimized for throughput rather than deterministic execution, required by real time systems. In effect, programmers have limited control over the exact scheduling of threads or tasks, when using standard scheduling techniques. Critically, tasks with strict timing requirements may suffer delays if long running, lower priority kernels are already resident on the device. The inability to natively preempt GPU kernels or to enforce strict task priorities makes using standard GPU kernel scheduling methods unsuitable for safety critical, real time workloads. Particularly in the context of autonomous driving, the repercussions of latencies pose a safety hazard.

## 1.2 Problem Statement

Rather than relying on native kernel launches, this thesis investigates the use of persistent threads to enable low latency execution and develops the foundational framework needed to integrate coroutines into the system for further real time determinism. Persistent threads are specialized kernels that are launched at the start of the application and remain active throughout the lifetime of the system. They function as a user level scheduler, continuously polling for work, executing tasks, and executing scheduling decisions.

Building on this framework, the central research question addressed in this thesis is:

**How can a persistent GPU thread model be designed to enable the implementation of GPU coroutines for predictable, low latency scheduling on real time systems?**

As part of this research, the following aspects of the persistent threaded implementation will be considered and evaluated:

- **GPU Task Management:** Mechanism for submitting tasks to the GPU and retrieving results asynchronously.
- **GPU Device Memory Management:** Efficient allocation, deallocation, and usage of GPU memory to ensure low latency access and avoid contention between tasks.
- **GPU Stream Management:** Organizing and scheduling multiple GPU streams to enable concurrent task execution with memory transfers, while maintaining predictable execution orders.
- **Framework for Coroutines:** Designing a foundation for GPU coroutines that allows cooperative multitasking and the implementation of custom scheduling strategies on top of persistent threads.

## 1.3 Objectives and Contributions

### 1.3.1 Original Objective

The original objective of this thesis was to integrate an existing coroutine based GPU scheduling framework into an autonomous driving system. This integration aimed to evaluate the feasibility of fine grained GPU scheduling within a complex, real time environment. Furthermore, by measuring scheduling latencies in the system, it would be possible to derive and refine strict timing guarantees.

### 1.3.2 Challenges and Scope Adjustment

Direct implementation of the coroutine framework proved infeasible within the available time frame. The system was originally designed for graphics rendering, relied on a complex compiler based architecture with little documentation, and required a separate build process. Combined with limited prior experience in GPU programming and compiler theory, these factors made integration within the timeline challenging.

To simplify the problem, the focus shifted from coroutines to the underlying execution model, persistent threads. No suitable open source implementation was found that could be directly integrated into an autonomous driving system. Instead, a minimal, open source, custom persistent thread scheduler, LightKernel was found and extended to be suitable for real time scheduling. The implementation originally only measuring scheduling overheads from kernel launches, but provided a foundation to implement

fully functional persistent threads. This new scheduler serves as a proof of concept foundation which implements long running GPU kernels to receive and execute tasks efficiently. Furthermore it provides the framework on which GPU routines can be implemented to enable suspension and prioritization between tasks, allowing exploration of how real time behaviors can be approximated within the constraints of the CUDA execution model.

### 1.3.3 Contributions

This thesis designs and implements a number of persistent thread components to enable the efficient execution of GPU code.

1. **Task Management System:** An execution management system was developed to allow tasks to be queued and executed independently of resident executing kernels. This system captures the full execution context of functions, enabling persistent threads to retrieve, schedule, and execute tasks without requiring new kernel launches.
2. **Memory Management System:** To eliminate memory allocations throughout the runtime of the persistent kernel, a memory management system was designed. Tasks inserted into the task management utilize the memory management systems for their associated memory allocations. In this system, memory is mapped into a running epoch buffer of preallocated memory assigned before the launch of the persistent kernel. This system further provides a logical partition of input and output buffers in order to reduce data interdependencies.
3. **Concurrent Memory and Execution models:** To support concurrent memory transfers between the host and device, as well as simultaneous kernel execution, the system leverages multiple independent CUDA streams to resolve data dependencies. This system enables higher overlap between memory and compute operations and better utilization of GPU resources.
4. **GPU Task Coordination and Synchronization:** The persistent kernel is spread across multiple thread blocks to execute distinct tasks concurrently. This system is synchronized within the device code to ensure that no two blocks execute the same task simultaneously. By enforcing exclusive task execution, the mechanism prevents race conditions and ensures correctness across all kernels. Furthermore, this system allows the host to efficiently schedule and queue tasks to any available thread block, maintaining high utilization of the GPU.

## 2 Background

This background chapter begins by examining the role of GPUs in real time systems, focusing on the architectural and programming constraints that influence scheduling and performance. The section then develops into an analysis of real time programming models for GPUs with consideration of persistent threads and coroutines as implemented into the autonomous driving system Apollo. This section provides the necessary technical background to understand how GPU design influences system behavior and scheduling under real time constraints.

### 2.1 Real Time Systems

Real time systems are designed with strict timing constraints to ensure predictable and safe behaviour. [2] These constraints are expressed in terms of the systems ability to meet task deadlines, classified as either soft or hard deadlines. Hard deadlines are critical to the safety of the system and missing these deadlines results in potential system failure or unsafe conditions. For example, missing the deadline on tasks such as collision avoidance or brake activation, severely impact the systems safety. Soft deadlines, in contrast, are less critical and can be missed without posing a risk to system integrity. To ensure the safety of the system, tasks are prioritized both by their deadline urgency and by the criticality of their impact. Prioritization in real time systems often requires preempting soft deadline or non urgent tasks in order to prioritize hard deadline, critical tasks.

### 2.2 Integration of GPUs in Autonomous Driving Systems

Early autonomous systems relied solely on CPU based compute engines for system control and task execution. For example, the Stanley autonomous car, which won the 2005 DARPA Grand Challenge, used 6 Pentium processors among which tasks were divided. The CPU centric design, combined with resource partitioning, enabled the system to satisfy real time requirements using established approaches.

As deep learning techniques advanced, CNNs began to outperform tradition methods in perception tasks such as object detection, lane recognition and environment

understanding. This shift necessitated the integration of GPUs to handle the increased computational demands. Early autonomous driving systems reliant on only CPUs struggled to continue performantly processing data with CNNs in real time. One of the earliest projects using GPUs for autonomous driving, was NVIDIA in 2015, which used a GPU to train a CNN that could steer a car end to end from raw camera input. Today, nearly all modern autonomous platforms, rely on GPUs for sensor processing, perception and decision making tasks.

The rapid adoption of GPU into these systems depends heavily on their massive performance gain on highly parallel workloads, such as neural network inferencing and training. Deep neural networks, inspired by the structure of the human brain, learn patterns through layers of weighted nodes. These nodes perform large numbers of matrix operations, tasks that are highly parallelizable and therefore perfectly suited for GPU architectures. As a result, GPUs can exploit the inherent task parallelism to significantly outperform CPUs in both training and inference.

### 2.3 GPU versus CPU Architecture

GPUs deliver a significant increase in throughput over CPUs, by simplifying the thread context in order to afford greater parallelism. Originally developed to accelerate graphics rendering, a task heavy in parallizable computations, GPU architectures were designed to support increasing numbers of threads. Because these tasks required very little control overhead, GPU threads were intentionally kept simple avoiding the complex control logic found in CPU thread management.

Although the CPU offers some parallelism, it is primarily optimized for sequential tasks, relying heavily on the performance of individual threads. To achieve greater performance on these tasks, CPUs dedicate a substantial "portion of their transistors to non computational tasks", specifically control and memory management logic. Features such as prefetching, branch prediction, speculative execution and out of order execution enable CPUs to efficiently handle irregular, sequeuntial control flows. However, this complextiy comes at the cost of reducing teh fraction of hardware dedicated purely to computation. In contrast, GPUs minimize control overhead, instead allocating resources to maximize arithmetic throughput across many lightweight threads. Consequently, CPUs operate at higher clock rates, while GPUs achieve higher throughput by increasing arithmetic intensity [3].

Consider the following graphic Figure 2.1, which highlights the difference in thread complexity.



Figure 2.1: CPU vs GPU Thread Architecture

Although core components are named differently, both CPU and GPU threads work similarly with an instruction decoder, registers and an arithmetic unit. The differences arise when trying to maximize a single control flow. In contrast to the CPU, the GPU threads execute instructions in order, rely on manual prefetching, and use a simpler, conservative branch predictor, which limits their ability to optimize single threaded performance. Instead of optimizing single threaded performance, GPUs achieve high throughput by executing thousands of lightweight threads in parallel, amortizing latency across them. This design favors workloads with high data parallelism, enabling the GPU to hide memory and execution latencies through massive concurrency rather than complex control logic.

The additional complex logic that CPUs use to improve single threaded applications outperforms the single threaded GPU applications, as seen by the following comparison of single threaded matrix multiplications in Figure 2.2 and Figure 2.3. These figures show the comparison of executing a matrix multiplication using only one GPU thread versus one CPU thread on a matrix multiplication task.



Figure 2.2: Single threaded Matrix Multiplication Execution between CPUs and GPUs averaged over 10 executions

| Matrix Size ( $n \times n$ ) | CPU Time (ms) | GPU Time (ms) | Speedup (GPU/CPU) |
|------------------------------|---------------|---------------|-------------------|
| $32 \times 32$               | 0.068705      | 0.615584      | 11.970438         |
| $64 \times 64$               | 0.285104      | 4.249685      | 15.554119         |
| $128 \times 128$             | 2.751817      | 28.923530     | 11.419879         |
| $256 \times 256$             | 20.706290     | 287.933700    | 13.906730         |
| $512 \times 512$             | 198.716200    | 2446.466000   | 12.323010         |
| $1024 \times 1024$           | 3356.762000   | 46097.410000  | 13.745850         |

Figure 2.3: Data Matrix from Figure 2.2

As seen in Figure 2.2 and Figure 2.3, GPUs struggle in applications that fail to utilize the architectural parallelism. For each of the matrices tested, the CPU is on average around 13 times faster than the GPU. Consequently, the GPU should only be used in place of the CPU, when the application is parallelizable and lacks complex control flow logic, otherwise the application will significantly underperform.

For the purpose of this thesis, any GPU resident scheduler that employs complex control logic will introduce significant latency overhead. Scheduling decisions, task management, and resource assignment are inherently difficult to parallelize, making them inefficient on GPU hardware. As a result, the scheduling logic is delegated to the CPU, where complex control mechanisms can be executed efficiently. Consequently, the GPU’s role is reduced to lightweight task dispatch and execution, while the CPU handles task queuing and resource allocation.

### 2.3.1 GPU Architecture

The NVIDIA Tesla V100 GPU, based on the Volta architecture, was selected for this project due to its availability and suitability for high performance computing workloads. In particular, this GPU is used for all related tests including Figure 2.2, Figure 2.3, and the results in Chapter 5. The following analysis of the V100’s hardware architecture, will provide both the reasoning and implementation fundamentals for designing an efficient GPU scheduling strategy.

#### Thread Hierarchy and Execution Model

From the programmer’s perspective when assigning tasks, the GPU appears as an array of independent, highly parallelized processors, called SMs. Each SM receives work in the form of CTAs, blocks of threads executing the same instruction code, which define the organization and grouping of threads for execution. The executing CTA is subdivided into warps, the smallest execution unit on the GPU, each consisting of 32 GPU threads executing instructions in lockstep. The lockstep execution ensures all threads within a warp execute the same instruction simultaneously. By enforcing lockstep execution, the GPU can schedule and dispatch threads to compute units with a simplified design allowing further parallelism.

While lockstep execution enables efficient SIMD style throughput, it also introduces a potential performance hazard. If threads within a warp follow different control flow paths, these threads diverge, and the GPU is forced to execute these different threads sequentially with respect to one another. During this serialization, inactive threads in the warp are masked out, preventing them from writing results back to memory. The result is a reduction in effective parallelism, leading to degraded overall performance.

#### SM Architecture

The Tesla V100 GPU SM architecture contains 4 processing partitions, each with its own complete execution pipeline. These partitions share an L1 instruction cache as well as a combined L1 data and shared memory cache, enabling threads within different

warps of the same CTA to efficiently access shared instructions and data. Within each processing partition there is an L0 instruction cache, a warp scheduler, a dispatch unit, and multiple execution units. An in depth view of each processing partition's architecture is provided in Figure 2.4, taken from the NVIDIA Volta Whitepages.



Figure 2.4: SM Processing Partition Architecture taken from the Volta Whitepages

Every clock cycle, the warp scheduler selects a warp of 32 threads to issue to the dispatch unit, which dispatches decoded instructions to the appropriate functional units. If there are not enough execution units of the required type for a given instruction, the instruction is queued. Depending on the current queue and delays, such as global memory accesses or dependencies, the warp scheduler will interweave different instructions from other ready warps, ensuring that the execution units remain busy. This thread interweaving allows GPUs to hide latencies and resource contention through thread oversubscription.

In GPU scheduling, oversubscription can improve hardware utilization, though it introduces overhead. By scheduling more warps than can actively execute, the GPU can potentially interleave tasks and keeps resources busy during stalls. Whether this is beneficial depends on the workload goals: oversubscription increases throughput, but potentially adds latency for individual tasks. Real time workloads may benefit from oversubscription, but the level should be tuned to the system's performance goals to

ensure minimum timing guarantees.

### SM Scaling and Thread Concurrency

Compared to CPU hardware threads, GPUs scale far more aggressively, supporting a much larger number of threads. On the Tesla V100, there are 80 individual SMs, each capable of supporting up to 64 resident warps. With 32 threads per warp, this yields  $64 \times 32 = 2048$  threads per SM. Across all 80 SMs, the theoretical maximum concurrency is  $80 \times 2048 = 163,840$  resident threads. For comparison, a typical Intel Tiger Lake i5-1135G7 CPU has 4 cores with 2 hardware threads each, for a maximum of 8 concurrent hardware threads. High end server CPUs, such as the Intel Xeon Gold 6148, only support 20 hardware threads. Although the GPU oversubscribes the number of warps and threads to hide latencies, the total number of dispatch and scheduling units allow for a maximum of  $4 * 32 * 80 = 10840$  instructions issued per cycle when the hardware is fully utilized.

In practice, this theoretical maximum is rarely achieved due to resource bottlenecks. All threads within an SM share the same on-chip L1 data and shared memory cache, as well as the 256 KiB on-chip register file ( $16,384 \times 32$  bit registers per processing partition, with four partitions per SM). SM threads all share the same on-chip L1 data and shared memory cache and the 256KiB on-chip register file ( $16,384 \times 32$  bit registers per processing partition, with four partitions per SM). Depending on the kernel's resource usage, high register pressure will cause the kernel launch to fail. Furthermore, if multiple thread blocks with high shared memory demands get scheduled to the same SM, they will saturate the L1 data and shared memory cache and force frequent evictions and write backs to the L2 cache or global memory. These hardware limits must be carefully considered when mapping tasks to SMs.

Native GPU code handles these constraints automatically without any needed input from the programmer. When launching kernels, the host, driver and device each analyze the programming configurations to see if the hardware can support the task. Should the hardware fail, the launch silently fails with a `cudaError_t` return type.

However, building a GPU scheduler on top of this system introduces additional challenges. A custom scheduler must account for the same hardware constraints, thread limits, register pressure, shared memory, and occupancy, when mapping tasks to the GPU. At the same time, the scheduler operates within the bounds of the proprietary hardware scheduler, which already enforces its own resource allocation and dispatch policies. As a result, attempts to impose new scheduling behavior may conflict with or be restricted by the GPU's built in mechanisms, limiting the degree of control available to the programmer.

## Scheduling and Task Mapping

Within the CUDA programming model, the specific mapping of tasks to SMs, TPCs, and GPCs is determined natively by the proprietary hardware scheduler, the GigaThread Engine. While the exact documentation is not public, this module maps CTAs to the individual SMs based a multitude of factors: hardware resources, parallelism, priorities, and dependencies. Similarly, the global memory and L2 cache utilization are determined by the hardware and transparent to the programmer. After the CTA gets mapped to the specific SM, the device code then executes till completion without interruption.

The CTAs are entirely managed by the SM on which it is currently executing. As shown previously in Figure 2.4, the SM has its own execution pipelines, register files, shared memory and scheduling units on which the CTA executes. For SMs to communicate with one another, they must use either the global on chip device HBM2 or the global L2 cache which is shared and coherent across all SMs. Although the global memory allows individual SMs to communicate with each other, accesses require hundreds of cycles, which introduce further latencies when compared to local SM L1 memory caches. Ideally, the SMs execute independently of one another and accumulate answers in global memory, skipping the high memory latency accesses of coordinating synchronous work.

## 2.4 GPU Programming using the CUDA API

NVIDIA GPUs are designed as computational accelerators for a host system, which manages and schedules tasks using the CUDA programming model. In this model, the GPU acts as an independent processor with its own memory and execution pipelines. For tasks to be scheduled on the GPU, the host process must first launch it through the CUDA API, an extension of C++ that enables CPU to communicate with and control the GPU.

Using CUDA, the host invokes device functions, called kernels, while specifying execution parameters such as the number of threads and memory configuration. These kernel launches are asynchronous, meaning that once the host issues the call, the GPU executes it independently, allowing the CPU to continue other work or synchronize with the GPU later as needed.

### 2.4.1 Kernel Launches

The task of launching and running device code begins from a kernel launch, which passes the function, its parameters, pointers, and the grid and block dimensions to

the GPU. The launch configuration specifies the number of CTAs and their logical dimensional organization. Every block is then mapped to a single SM and is constrained by that SM’s hardware resources, including the maximum number of threads, registers, resident warps, and available shared memory. If no available SM can meet these requirements, the kernel launch will fail. Conversely, if a CTA does not fully saturate the hardware resources of an SM, additional further blocks may be scheduled concurrently on the same SM.

#### 2.4.2 Host to GPU Memory Transfers and Bandwidth Considerations

Both the GPU and CPU maintain distinct memory regions to match their respective architectural and performance requirements. CPU memory is optimized for low latency access to efficiently handle serial and branch intensive workloads. GPU memory, on the other hand, is designed for high throughput, allowing thousands of threads to execute in parallel while tolerating higher latency per access.

Since the memory regions are distinct, the CPU must explicitly allocate and transfer data to the GPU using the CUDA API. Transfers require the CPU memory to be pinned in RAM, as CUDA cannot directly access the disk. While the CUDA runtime can automatically pin memory, automatically, host arrays allocated directly in pinned memory using `cudaMallocHost()` or `cudaHostAlloc()` eliminates this extra step and enables faster transfers. However, allocating excessive pinned memory reduces the RAM available to other CPU processes, potentially causing paging and degrading system performance. Careful management of pinned memory is therefore essential for maximizing data transfer efficiency without negatively impacting the host system.

In CUDA, understanding how memory is transferred and managed across the host and the device is crucial for optimizing performance. The device memory consists of a global 16 GB block on the Tesla V100 architecture as well as on chip processor memory. To improve memory performance, the programmer can leverage three different models of memory management `__constant__`, `__device__`, and `__shared__` memory. Both constant and device memory are allocated to the global memory, with constant memory only being writeable by the CPU and allowing faster access times due to the reduced coherency required. The shared memory is shared among all warps and threads of a given SM in the L1 data and shared memory cache.

#### 2.4.3 Memory Coalescing

For the GPU to coalesce memory operations and enable SIMD-style execution across multiple data points, each thread maintains registers that store its execution context, including its position within the kernel and executing CTA. In PTX, these special

registers provide unique identifiers such as threadIdx and blockIdx, which indicate a thread's coordinates within its block and a block's coordinates within the grid. These identifiers are essential for structuring parallel computations and optimizing memory access patterns. For example, when threads within a warp access consecutive memory locations, the hardware can coalesce those accesses into a single memory transaction, significantly improving throughput.

#### 2.4.4 Example Kernel Launch

Consider the following example program, which allocates device memory and launches a kernel consisting of one block and 32 threads.

```
1 __global__ void increment(float *x) {
2     x[threadIdx.x] += 1.0f;
3 }
4
5 int main() {
6     const int N = 1024;
7     float h_x[N];
8     for (int i = 0; i < N; ++i)
9         h_x[i] = i * 1.0f;
10
11    float *d_x;
12    cudaMalloc((void**)&d_x, N * sizeof(float));
13    cudaMemcpy(d_x, h_x, N * sizeof(float), cudaMemcpyHostToDevice);
14
15    increment<<<1, 32>>>(d_x);
16
17    cudaMemcpy(h_x, d_x, N * sizeof(float), cudaMemcpyDeviceToHost);
18    cudaFree(d_x);
19    return 0;
20 }
```

Listing 2.1: Simple CUDA Kernel

The code block above depicts the launching and execution of a simple GPU kernel as well as the memory allocation scheme used for executing kernel code. The kernel itself is the execution of the GPU device program denoted by the `__global__` function, while the `<<<_, _>>>` syntax enables the programmer to specifically partition their execution tasks across waiting threads. The values in `<1,32>>` specify the grid and block dimensions, determining how many blocks are launched and how many threads execute per block. These dimensions can be given as integers or as a `dim3` struct for 1D, 2D, or 3D layouts. In particular, this code allocates a singular block with an array

of 32 threads, which completely saturates a singular warp. These dimensional vectors allow the different threads to maintain lockstep execution, while processing different sections of the same array. This is done by using the dimensional properties assigned to the individual threads by the runtime system.

The main function, executed by the CPU or host, initializes the parameters, executes the kernel and then copies the memory back. The host array, `h_x` is allocated to the CPU stack, which is explicitly copied to the GPU. Trying to pass the array by value, something common in C++ code, seems at first the most simple; however, poses two separate issues. Firstly, when passing arrays as parameters, they decay to pointers, which CUDA forbids, as the pointer address passed to the device does not have any meaning on the GPU. Secondly, if the array were wrapped in a struct and passed to the function to circumvent the first issue, the array would be allocated to every single thread independently. In the example above, the array would be allocated 32 times, each independent from one another, taking up further memory bandwidth and both on chip and global device memory. In this case, each individual thread, would get the array passed by value, leading to a total  $32 \text{ threads} * 1024 \text{ floats} * 4 \text{ bytes per float}$  or 128KiB. Instead, the memory is allocated in the device memory, transferred once and each thread receives only the device pointer `d_x`, which can be used to copy the results back to the host.

#### 2.4.5 CUDA Streams

CUDA API calls are queued to the GPU using cuda streams, which enforce the execution order of tasks. A stream, represented by the type `cudaStream_t` acts as a handle to a specific command queue, similar to how a Linux file pointer in Linux refers to a particular file descriptor. Within a stream, operations such as kernel launches, memory copies, and memory set operations are enqueued and executed strictly in the order they are issued, ensuring deterministic behaviour. Commands submitted to the same stream are executed sequentially in the order they were issued, ensuring deterministic behavior within that stream. When using multiple streams, the GPU can execute operations concurrently, allowing kernels and memory operations to overlap. By carefully managing streams, developers can expose task parallelism, reduce idle time, and more effectively use GPU resource.

The Tesla V100 GPU has two separate hardware copy engines for copying data from the host to the device and back. The copy engines support the transfer in both directions, with one engine specifically being allocated for the unidirectional D2H transfer and the other for H2D. Using only one stream for multiple kernels fails to maximize the device memory bandwidth. For example, consider the launch of two independent kernels, kernel A and kernel B, each on the same CUDA stream. Both

A and B, allocate and copy memory onto the device, schedule their kernels and then copy the results back. Regardless of the ordering of the API calls, the two kernels cannot execute concurrently or share the hardware copy engines, since their execution is serialized by being placed in CUDA stream. To maximize hardware utilization and ensure correctness, each kernel and its associated memory transfers should be placed in the same dedicated stream, allowing the CUDA runtime to safely overlap kernel execution, memory copies, and bidirectional transfers when possible.

## 2.5 GPU Programming Models for Real Time Systems

In order to meet urgent deadlines, systems need to prioritize and ensure the timely execution of critical tasks. Prioritizing execution requires that high priority tasks are scheduled first when resources are available, and that their deadlines are still met even when resources are fully occupied. Achieving this responsiveness, requires preemption or context switching between resident tasks and scheduled critical tasks. In Apollo, this responsiveness is implemented through coroutines, which cooperatively yield execution to enable timely task switching. Coroutines are particularly well suited to GPU scheduling, since GPUs lack integrated hardware preemption and therefore depend on cooperative mechanisms to maintain responsiveness.

### 2.5.1 Coroutines

Coroutines are a form of asynchronous programming that enables cooperative multitasking between functions. Unlike traditional thread or process switches, which can occur at any time and requires a larger context, coroutine switches happen only at programmer defined suspension points. This makes them particularly useful for enabling runtime kernel task switching on the GPU. Here, execution of a kernel can be suspended to allow another kernel to run, and then later resumed without blocking other work.

A coroutine suspends execution by capturing its current context, known as the continuation, which contains the execution state needed for later resumption [4]. Once suspended, another task can execute without overwriting or disrupting the saved kernel state. When the interim task completes, the coroutine can continue exactly where it left off by restoring its continuation. This ability to strategically pause and resume execution makes coroutines well suited for real time workloads, where rapid switching between concurrent tasks can help meet hard deadlines without delay.

## CPU Coroutines in Apollo

Coroutines on the CPU are implemented using the native x86 calling conventions and the program stack to dynamically save and restore continuations. Under x86 convention, when a function is invoked, the CPU pushes the return address, the location after the next instruction after the call, onto the stack before jumping to the target function. The called function accesses its local variables through the stack and registers, which are divided into two categories, volatile, caller saved and non volatile, callee saved. Volatile registers may be freely modified by the callee, while callee saved registers must be preserved and restored before the function returns.

When execution reaches a return instruction, the CPU pops the return address off the stack and continues execution from that point. To implement coroutine behaviour, however, the function must be able to yield and later resume. This requires saving the callee saved registers, since their preservation is guaranteed, as well as any additional state such as local variables or registers, that is necessary for continued execution. These elements can be stored on the stack, enabling a coroutine to pause and later continue seamlessly.

The Apollo project demonstrates this mechanism with CPU coroutines implemented directly on top of the x86 calling convention, as illustrated in the following example.

```
1 ctx_swap:  
2     pushq %rdi  
3     pushq %r12  
4     pushq %r13  
5     pushq %r14  
6     pushq %r15  
7     pushq %rbx  
8     pushq %rbp  
9     movq %rsp, (%rdi)  
10  
11    movq (%rsi), %rsp  
12    popq %rbp  
13    popq %rbx  
14    popq %r15  
15    popq %r14  
16    popq %r13  
17    popq %r12  
18    popq %rdi  
19    ret
```

Listing 2.2: CPU Coroutine

The CPU coroutine implementation uses the context switching function, `ctx_swap`,

which captures the current continuation and restores the state of another coroutine. This function accepts two parameters, `%rdi` and `%rsi`, which point to the memory locations used to store and retrieve coroutine continuations. Execution happens in two stages. In the first stage, the continuation is stored and its address is saved to the register `%rdi`. The second stage then moves the stack frame to the location of its continuation, using the register `%rsi`, before restoring the new coroutine context.

Each stage loads or saves their registers in a specific defined order. In particular, to ensure the memory is reread into the correct registers, the order of pushing elements onto the stack is the reverse order of popping elements. As the stack works on a last in first out principle, this ensures that all the registers have the correct values upon resumption. In comparison to traditional context switches involving processes or threads, coroutine context switching via `ctx_swap` operates with significantly lower overhead, resulting in faster execution.

### 2.5.2 GPU Coroutines

Unlike CPU coroutines, which can implement context switching by saving and restoring stack frames, GPUs cannot use the same mechanism, as GPU threads do not maintain conventional stack frames. Instead, their execution state is managed differently, requiring specialized approaches to preserve and resume coroutine execution.

Attempting to implement CPU style coroutines using the `ctx_swap` function does not work because GPUs handle function calls differently. CPUs stores a call stack of function stack frames, which can be easily accessed and manipulated by pushing and popping values or addresses. GPUs, in contrast, avoids traditional stack frames by aggressively inlining function calls, reducing overhead since function code is readily available and no stack frame is needed.

Implementing and manipulating deep call stacks on GPUs is largely handled by the compiler and hardware, leaving very little control to the programmer, which makes such approaches impractical. For example, CUDA originally did not support recursive functions, which were only introduced later via the `-rdc=true` flag. Even then, recursion comes with strict limitations on stack size and adds significant performance overhead. Attempting to rely on deep call stacks can quickly lead to excessive instruction and register usage, illustrating why GPU architectures and compilers discourage CPU style stack manipulation.

#### Persistent Threads to support Coroutines

Given that call stacks are not effectively or efficiently supported in CUDA, the GPU needs to support a user level runtime scheduling mechanism to manage the contexts

dynamically. Due to the nature of kernels executing until completion, the GPU coroutine scheduler needs to be based on a persistent kernel, which can schedule coroutines throughout the lifetime of the application. The coroutines running on the persistent threads will then define suspension points to manually give control back to the scheduler in order to switch tasks. Because storing every register for all threads across coroutines is too expensive and inaccessible, coroutine contexts must be explicitly synthesized and saved in global memory, freeing the limited SM resources for new tasks.

### 2.5.3 Persistent Kernels

Persistent GPU threads give the programmer enhanced control over hardware scheduling by running on top of the hardware scheduler, enabling manual execution decisions that are otherwise unavailable. In addition to this increased flexibility, persistent kernels reduce runtime scheduling overhead. As the hardware resources are preallocated, thread and block configurations are loaded ahead of execution, the runtime avoids repeatedly configuring these settings, allowing kernels to run more efficiently. In systems with recurring or periodic tasks, such as autonomous driving, this overhead becomes particularly costly. With the implementation of persistent threads, only input and output buffers need to be updated for each new task, minimizing execution overhead and allowing kernels to operate with only the essential arguments required for computation.

## 3 Related Work

The increasing use of GPUs in real time systems has led to the development of custom programming models designed to reduce kernel launch overhead and improve predictable GPU scheduling. Such models are particularly useful in systems where minimizing execution latency, improving resource utilization, and maintaining timing determinism is important. These approaches can be broadly categorized into compiler driven frameworks, runtime scheduling frameworks, and manual implementations, the latter of which is the focus of this work.

### 3.1 Compiler Driven Frameworks

Compiler driven frameworks optimize GPU workloads through automated code generation. The user writes codes in a high level DSL that abstracts low level device details. At runtime, the framework parses the DSL into an AST, which is then converted into an IR. From this intermediate representation, the compiler applies transformations and optimizations before compiling the code JIT into GPU executable device code.

Several projects demonstrate how these frameworks are applied in practice. *Mirage*, for example, improves large language inference by fusing kernels into a single megakernel. This achieves the same effect as persistent threads in removing extra kernel task scheduling overhead. Similarly, *Halide* separates algorithm specification from execution schedule, allowing compiler autoschedulers to optimize performance with minimal manual tuning. Finally, *Luisa* targets graphics and simulation workloads, offering acceleration structures, ray traversal APIs and shader abstraction to provide high level rendering code while retaining low level performance.

*Luisa*, in particular, offers a model that can be adapted for real time applications. Built on top of *Luisa*'s execution model, *LuisaCompute-Coroutines* extends the framework to support coroutines running on persistent threads. Coroutines are expressed simply within the DSL as explicit suspension points, which allow the device to preserve context and yield, enabling fine grained scheduling. This approach is especially efficient and well suited for real time systems, as it leverages asynchronous coroutine execution to improve responsiveness and resource utilization.

### **3.2 Runtime Scheduling Frameworks**

Beyond compiler based transformation and code generation, runtime based frameworks offer an alternative approach by enabling real time GPU scheduling. For example, *RT-GPU*, a runtime system, provides deadline aware scheduling of GPU workloads by partitioning GPU resources. Using a reservation based model, RT-GPU enables fine grained control over scheduling to ensure the task deadlines are met. Additionally, *ROSGM* is a GPU management framework designed specifically for ROS 2 robotics systems. ROSGM interposes a layer to intercept the CUDA API calls and attach metadata to each GPU task, allowing custom deadlines and priorities to manage how GPU tasks are queued and issued to the device.

Overall, these frameworks demonstrate different approaches to achieving predictable GPU execution. Each of these frameworks relies on a form of high level user abstraction to manage scheduling and low level execution. While compiler and runtime systems simplify development through abstraction and automation, they do not allow for highly application specific fine tuning. Manual implementations, in contrast, enable this level of control and form the basis for the fine grained GPU scheduling explored in this thesis.

# 4 System Design and Implementation

This thesis initially aimed to bring GPU coroutine support to the open source autonomous driving platform Apollo, by implementing the LuisaCompute-Coroutines framework discussed in Chapter 3. The goal was to extend Apollo’s existing CPU coroutine infrastructure to GPUs, providing fine-grained, predictable scheduling for real-time tasks. However, due to integration challenges and the constraints of the project timeline, directly implementing LuisaCompute-Coroutines proved infeasible. Consequently, the scope shifted toward a manual implementation: a persistent thread scheduler for GPUs that lays the groundwork for coroutine support while still enabling fine-grained control over GPU execution. This section presents the design and implementation of that scheduler and the foundation it provides for future coroutine integration.

## 4.1 Platform Integration: GPU Scheduling in Apollo

This thesis initially explored integrating LuisaCompute-Coroutines into Apollo, aiming to provide GPU coroutines to complement the existing CyberRT CPU coroutines. Similar to the CPU coroutines, these GPU coroutines were intended to deliver predictable execution latencies, with the added latency improvements that GPU persistent threads offer. However, due to several integration barriers, including foreign build systems, sparse documentation, time constraints, and a limited familiarity with both projects, GPUs, and compiler theory, it became clear that integrating this system into Apollo would not be feasible within the scope of this thesis.

With the LuisaCompute’s coroutines approach impractical, the focus shifted to implementing manual GPU scheduling functionality. Finding no suitable open source implementation for coroutines, this thesis instead turned to persistent threads. Most existing persistent thread implementations are highly application specific and not readily available, which meant that adapting any implementation required substantial work. The only suitable implementation found was LightKer, a research project designed to measure the speedup of using persistent threads compared to sequential kernel launches. To implement LightKer into a real time system, the project required a complete restructuring of the codebase. Although full integration into Apollo and

support for coroutines remains unfinished, this work extends the real time capabilities of LightKer and provides a foundation for future GPU coroutine integration.

The LightKer implementation was primarily designed to measure the performance difference between sequential kernel executions and a persistent kernel implementation. It constructed simple, trivial kernels and compared the overhead of explicit host launched kernels versus implicit execution within the persistent kernel. However, the framework does not support runtime features necessary for a fully functional system. Nonetheless, it provides host device synchronization through a mailbox system and establishes a fundamental foundation for launching persistent kernels, which proved useful in enabling real time support for the system.

#### **4.1.1 System Design**

At a high level, there are four important components: the task queue, memory buffers, stream design, and gpu block synchronization. The task queue is managed, controlled, and allocated by the host and provides the arguments and tasks for the GPU to execute. The implementation of the task queue gives the programmer fine-tuned implementation opportunities to manually design and schedule workloads. The memory buffers provide an epoch based staging area for input arguments and output results as well as persistent memory for the further extension of coroutines.

### **4.2 Task Management System**

For persistent threads to execute kernels at runtime, a staging mechanism is required to enqueue tasks and maintain the task context necessary for execution. As discussed in Chapter 2, any task management system that relies on device side scheduling logic is inherently inefficient due to characteristics of GPU architecture. Therefore, the task management system in this design adopts a host driven scheduling model, where tasks are prepared and dispatched from the CPU, while persistent threads on the GPU handle execution. This approach enables better control over task ordering, reduces idle time, and improves overall predictability, essential for real time workloads. Furthermore, this system should be compatible with coroutine based execution and priority scheduling in the future.

#### 4.2.1 Task Queue Design



Figure 4.1: GPU Task Queue for Task Management

The architecture implemented into the persistent thread scheduler uses a loop through buffer, as a FIFO queue, in a producer-consumer architecture. As shown in Figure 4.1, each GPU task is represented as a clipboard, which contains the function context. The context contains both the information needed to execute the device function, as well as metadata for the control of the task within the queue. The host acts as a producer, enqueueing tasks into the task queue at the front, while the device consumes tasks from the back.

To improve the complex scheduling efficiency, all task enqueueing and dequeuing logic rests on the host side of the system. Enqueueing a task involves copying its context into the task queue and initializing the associated control variables of the task within the function context. Before execution, the device simply checks the control variables to ensure that only valid tasks are executed. Upon tasks being executed by a CTA, the device block will advance the back pointer to the next available task within the queue.

To ensure that tasks are not overwritten, the host tracks the total number of tasks currently in the system in relation to the total length of the task queue. Every time a new task is added to the queue, the tracker is incremented, unless the queue is full, which results in a failure response, similar to the native CUDA system. For the host to decrement the length tracker, the device must execute the task, notify the host, upon which the host can copy the results back. After the results are copied back, a new task can then be enqueued to the same position in the queue.

#### 4.2.2 Extensibility for Coroutines and Priority Scheduling

The function contexts stored in the task queue buffer were specifically designed with the future implementations of coroutines and priority scheduling in mind. By introducing additional control variables into the function context, the host system can assign tasks different priorities. Furthermore, the queue already provides space to store the continuation of the coroutine, allowing a task to yield and record the current instruction

for later resumption.

Currently the system executes tasks in a loop using a synchronized busy waiting scheme. However with slight modifications, it can be extended to support priorities. If every task additionally contained a priority or deadline in its function context, the persistent threads polling active tasks could select and execute the most critical tasks first. In this case, the host would need to actively track executing tasks to ensure they are not accidentally overwritten.

In conjunction with this system, the function context can also store the coroutine continuation. While precomputed values could be easily stored in the function context, capturing the resumption address is more challenging. On the GPU it would be necessary to define specific set points within the program and save a variable in the function context that indicates the next instruction to be executed within the kernel.

## 4.3 Function Context

Each task entry defines its execution context through an explicit function identifier and its associated function parameters. This entry acts exactly like a coroutine continuation, which stores the necessary state to resume execution within the function. The function id is used in conjunction with a lookup table to execute the GPU device code. The memory location for the parameters is allocated by the host as part of the enqueueing and memory management systems.

### 4.3.1 GPU Function Pointers

To execute new functions from the persistent threads, the task queue needs to be able to reference the specific function. Generally referencing functions on a CPU requires only the function pointer to execute the code defined at that memory location. When GPU functions are compiled, the device code lives in the GPU address space and is not accessible from the CPU. The CPU only has access to functions denoted by the `_global_` keyword, which allows the execution of GPU kernels, not enqueueing of GPU functions. In order to be able to access and run the functions specified by the CPU, the task queue supports a lookup table to map integers to specific functions. The lookup table allows the host to `memcpy` in function ids to the task queue when enqueueing new tasks.

### 4.3.2 Function Parameters

When the CPU assigns tasks to the GPU, it passes either allocated GPU memory pointers or explicit parameters. These explicit parameters then get propagated to all the

individual threads executing the kernel code, resulting in greater api memory overhead. When enqueueing new tasks to the task queue, the memory has to be transferred at runtime before the device function calls.

The GPU task in the queue originally had a pointer to the allocated memory and upon receiving compute resources would schedule the task with the memory to the individual persistent thread. Unfortunately, this method is dependent on the specific task and parameters and consumes variable memory requiring further pointers to GPU memory. In order to consolidate the memory pointers, the task queue was simplified to contain only allocated memory pointers in order to automatically load kernel memory.

In this method, enqueueing the GPU tasks forces the programmer to streamify the data and automatically load the memory into preallocated memory partitions. The task queue then only consists of the actual memory partition pointers, both start and end. Executing a task then requires the interpretation of the memory and then the loading of it into the device function. Should the input memory be oversubscribed, the task then has a preallocated buffer to store any updated context for the continuation of the coroutine.

## 4.4 Memory Management System

The memory management system supports the task queue, while eliminating unnecessary memory allocation and deallocation overheads. By managing memory centrally, the task queue's function contexts can remain simple and flexible. This memory system is managed for the lifetime of the persistent kernel, removing the need for costly dynamic memory operations at runtime. While maintaining a memory buffer is simple, designing a runtime strategy to partition memory among tasks and reclaim it efficiently presents a more complex challenge. To visualize this, Figure 4.2 shows the memory buffer logically partitioned into epochs, with each epoch containing input argument and output result buffers for the corresponding tasks.

### 4.4.1 Memory Buffer Design



Figure 4.2: Continuous Memory Buffer Logically Partitioned

The memory management system consists of a single, logically partitioned memory buffer, designed to reduce the complexity of allocation algorithms. Each epoch within this buffer contains both an input argument and output result section. The entire overhead of managing the position of task memory within the buffers is entirely managed by the host and shared with the device through the function context within the task queue.

As tasks are allocated within the epochs, eventually one of the corresponding memory buffer will overflow. When the host detects that allocating memory for a task either results in the input or output buffer overflowing, the host begins allocation in the next epoch. After fully allocating the memory for an individual buffer, that buffer remains untouched by the scheduler until the scheduler loops around the entire buffer queue and reaches the same epoch again. An epoch is free to schedule again when all results of tasks allocated within that epoch have been collected and marked as free.

The host manages these epochs, by tracking the number of tasks within each epoch and the current offset within the current epoch. To enqueue tasks input arguments are immediately loaded at the current offset within the current epoch. Tasks are only considered as free after the memory has been copied back to the host and the host marks the epoch memory as freed. When the buffer queue loops and returns, if the specific buffer parameters and memory size has been correctly set, potentially through profiling, the buffer will be free and can be reused.

#### 4.4.2 CUDA Stream Optimization

As discussed in Chapter 2, modern GPUs feature two independent memory transfer engines to support bidirectional data movement between host and device. One engine is dedicated to H2D transfers, while the other handles D2H transfers. These engines can operate concurrently, enabling overlap between data movement and computation if used correctly.

To fully exploit this hardware capability, input and output transfers must be carefully organized. If tasks are enqueued into a single CUDA stream, transfers and kernel executions are serialized, causing the GPU to wait unnecessarily and leaving one of the transfer engines underutilized. To avoid this, the implementation employs multiple CUDA streams, separating concerns between input staging, output collection, and kernel execution.

Specifically, input arguments are transferred from the host to the device using a dedicated H2D stream, while task results are copied back to the host through a separate D2H stream. This separation prevents intra-stream dependencies between input and output operations, ensuring that transfers in opposite directions do not block one another. This design enables continuous execution of persistent threads: while one

batch of tasks is being executed on the GPU, the next batch can be staged in device memory, and previously completed results can be copied back to the host.

By combining the persistent task queue with a dual-stream transfer strategy, the scheduler achieves efficient utilization of both memory transfer engines and GPU compute resources. The result is a pipeline where host-to-device transfers, device computation, and device-to-host transfers proceed in parallel, minimizing idle time and maximizing throughput.

## 4.5 GPU Block Synchronization

In order to utilize hardware efficiently, the persistent kernel launches multiple independent GPU blocks across the available SMs. Each block concurrently dequeues and executes task from the shared task queue. However, when the kernel consists of multiple blocks, concurrent access to the shared queue introduces the risk of interference between blocks. Without coordination, multiple CTAs may race for the same task, resulting in lost work, duplicated execution, or even corrupted input or output buffers.

The most critical source of contention is the global device task queue tail pointer `d_tail`, which identifies the next task to execute. Since all blocks of the persistent kernel on the device update this shared variable, race conditions may cause two blocks to claim the same task, while others may skip tasks entirely. To guarantee correctness, the dequeue operation must therefore be synchronized.

This synchronization is achieved using atomic device instructions, which enforce mutual exclusion when updating shared memory locations. The device function `dequeue` in the following block demonstrates the mechanism:

```

1  __device__ int dequeue(volatile mailbox_elem_t * from_device){
2
3      int old_d_tail = d_tail;
4      unsigned int next = (old_d_tail + 1) % WORK_QUEUE_LENGTH;
5      int terminate = 0;
6
7      if(threadIdx.x == 0 && threadIdx.y == 0) {
8
9          int prev_state = atomicCAS(&d_task_queue[old_d_tail].executing, 2, 1);
10
11         if (prev_state != 2){
12             terminate = 1;
13         }
14         else {
15             int updated_idx = atomicCAS(&d_tail, old_d_tail, next);
16         }
}

```

```

17
18 }
19 __syncthreads();
20 if(terminate) {
21     return terminate;
22 }

23 __syncthreads();
24 bool execution = execute(d_task_queue[old_d_tail]);
25
26 d_task_queue[old_d_tail].executing = 0;
27
28 DeviceWriteMyMailboxFrom(THREAD_FINISHED);
29 return 1;
30 }
31

```

Listing 4.1: Synchronized Block Execution of Tasks

The device function `dequeue` is responsible for ensuring the correct execution of tasks by individual CTAs within the shared task queue. Its primary role is to guarantee that each task is executed exactly once, and that no two thread blocks attempt to process the same task concurrently.

At the core of this mechanism lies the global queue pointer `d_tail`, which identifies the next task to be executed. Since `d_tail` is shared across all CTAs, it is constantly updated as blocks dequeue and complete tasks. To prevent inconsistencies caused by concurrent updates, the current value of `d_tail` is first copied into a block local variable `old_d_tail`. This snapshot ensures that the block works with a stable reference to the task index, even if other blocks advance the global pointer in parallel.

Once the local index has been secured, a designated thread within the block attempts to claim ownership of the task using an atomic compare and swap `atomicCAS`. If the operation succeeds, the block has exclusive rights to execute the task, and the global pointer `d_tail` is atomically advanced to the next task index. If the claim fails, it means another block has already taken the task, and the current block terminates early.

By following this procedure, the `dequeue` function ensures that:

- Each task is mapped to a single thread block only.
- No task is executed more than once.
- Updates to the shared queue pointer remain consistent across all CTAs.

Through the combined use of atomic operations and local snapshots of global state, the system maintains correctness even under highly concurrent execution across multiple streaming multiprocessors.

## 4.6 Architecture

The individual contributions explicitly discussed are joined together in the following design architecture below. The architectural aspects discuss previously are the main aspects of the work of this thesis, with additional tools and synchronization methods such as the mailboxes being used from the LightKernel project.



Figure 4.3: Persistent Thread Architecture implemented into LightKer

The Figure 4.3 depicts the individual components in a logical program architectural overview. The left side of the graphic shows the host side code and methods, which allow the enqueueing of tasks and launching of the persistent thread kernel. On the right hand side of the graphic, the actual device side code allocation and kernel execution is depicted which allows the processing of memory and write back of results to the memory buffers. The mailboxes are constantly enqueueing and dequeuing new tasks throughout the execution of the kernels.

## 4.7 Further Implementation Considerations

### 4.7.1 Serialization of Data for Memory Copies

Task memory written or read from memory first needs to be either serialized or deserialized, which requires manual GPU kernel wrappers. When a task is selected for execution, its parameters are deserialized from the input buffer, converted into an internal representation, and used to invoke the corresponding device function. After execution, the results of the computation are serialized back into the output buffer so they can be transferred to the host once the task is complete.

This serialization/deserialization process acts as a bridge between the host managed task queue and the device executed kernels. By keeping inputs and outputs in a raw, buffer-based format, the system avoids the overhead of allocating separate memory for each task and instead reuses the preallocated memory partitions. At the same time, serialization ensures that heterogeneous tasks with different argument types can be uniformly stored and scheduled through the same queue mechanism.

In practice, each task type requires a lightweight wrapper kernel responsible for unpacking its arguments, invoking the correct device function, and packing the results back into the buffer. While this introduces some additional development effort, it makes the overall system highly extensible: new task types can be supported simply by defining the corresponding wrapper without modifying the scheduler or memory manager.

### 4.7.2 Variable Launch Configurations

One of the weaknesses of persistent threads is the inability to change the launch configurations. Standard GPU kernel launches specify the thread configuration and grid layout of GPU threads executing the code and their physical placement in the architecture. The GPU automatically decides the execution placement of the kernels from the Gigathread Engine during the launch of GPU code from the host. As the persistent threads are already launched at program start, the configuration remains the same throughout the lifetime of the persistent thread. Therefore these persistent threads can not support variable launch configurations at runtime without terminating the kernel and restarting a new kernel with different launch configurations. However, multiple different persistent kernels can be started with various kernel launch configurations, each with different task queues, or through code refactoring the kernels can be adapted to the existing GPU thread block organization.

# 5 Experiments and Evaluation

This chapter evaluates the proposed GPU persistent thread model with respect to the objectives of this work:

1. Reducing scheduling latency by minimizing kernel overhead.
2. Establishing a baseline platform for future research in real time GPU scheduling, serving as a foundation for a coroutine based approach.
3. Enhancing understanding of real time GPU scheduling, CUDA architecture, and GPU execution, programming, and scheduling models.

The implementation introduces a GPU persistent thread model designed as both a building block for future scheduling research and as a means to improve the efficiency of executing multiple tasks on the GPU by reducing kernel launch overhead. This persistent GPU model allows for application specific fine tuning with particular focus on hardware resource management. In this model, the persistent kernel executes throughout the lifetime of the application and dynamically processes incoming tasks streamed to the GPU, avoiding repeated kernel launches.

The evaluation focuses on measuring the model's effectiveness in meeting the stated objectives. In particular, it examines scheduling latency, kernel overhead, and suitability as a baseline for real time programming models such as coroutines. The results highlight both the benefits of the approach and its limitations, especially regarding task variability and scaling across the GPU.

## 5.1 Experimental Setup

In order to evaluate the effectiveness of persistent threads in reducing the scheduling latency, the solution was tested on a matrix multiplication benchmark against a baseline implementation. To test the persistent threaded implementation, a simple matrix multiplication task was repeatedly scheduled on the persistent thread implementation and compared with a simple kernel invocation code. The persistent kernel was then tested with a number of varying parameters, including memory overhead, GPU work, and number of persistent kernel blocks.

### 5.1.1 Matrix Multiplications for Performance Testing

The current execution model requires manual task wrappers for scheduling and executing GPU tasks, making it more complex to provide a varying range of different tasks. The manual task wrappers, designed to reduce the overhead of allocating GPU memory, are required to serialize and deserialize the allocated data during execution. To keep testing manageable, this thesis focuses on matrix multiplications as a representative example to demonstrate the solution's capabilities.

In particular, the selected matrix multiplications were of a predetermined size of 16x16, due to the constraints of the GPU persistent kernel. Normally, the programmer decides at kernel launch the configurations of threads within the kernel; however, this feature is not available for persistent threads. Persistent threads configurations are determined at launch and can not be reallocated throughout the duration of the task. For application based workloads, these values would need to be determined through profiling the workloads and manually implementing the values. In this case, matrix multiplications are significantly faster when the task has at least as many threads as the solution matrix values, which led this implementation to using a 16x16 persistent kernel launch configuration.

### 5.1.2 Testing Environment

Throughout the matrix multiplication performance testing, as the benchmark and the persistent kernel implementation both vary in application goals, a simulated work environment was designed. For the tests, the persistent kernel task schedules a number of tasks to the GPU and then immediately starts waiting to receive the results. As the tasks are being scheduled, the GPU already begins to execute enqueued tasks. After finishing individual tasks, the host is signaled with the message that the tasks are finished. In comparison, the matrix multiplication benchmark just executes tasks serially in an alloc, memcpy, kernel launch, memcpy loop.

In a real time persistent kernel implementation, the CPU needs to independently run tasks within the system and cannot synchronously wait on results to be streamed back. Rather, the CPU needs the results to be waiting when the it goes to check if the results have returned or not. Unfortunately, this model, when compared to any synchronous execution, will have waiting overhead. This is implemented within the test kernel as a busy waiting scheme; however, this increases the GPU traffic and slightly weakens the GPU implementation as it contests the same streams for data transfers.

### 5.1.3 Small Kernel and Small Memory Transfers



Figure 5.1: Small Kernel and Memory Transfer Tasks

Figure 5.1 shows the first experiment that was run comparing the task scheduling from the GPU persistent kernel in comparison to the baseline model. Here each implementation executes 32 tasks 16x16 matrix multiplication tasks scheduled to the GPU. The matrixes for testing are generated at runtime and compared to the correct CPU implementation using the same functions.

On average over 10,000 runs, the persistent kernel achieved an execution time of 189.165ms, while the baseline model required 229.604ms, demonstrating a clear performance benefit for this small kernel, small memory transfer case. However, the measurements exhibited high variability, with execution times differing by as much as  $\pm 50\text{ms}$  between runs, even averaged over 10000 different executions. Despite this noise, the persistent kernel consistently outperformed the baseline in the majority of trials.

### 5.1.4 Small Kernel and Larger Memory Transfers



Figure 5.2: Small Kernel and Larger Memory Transfer Tasks

In this secondary evaluation, the GPU executes the same task as before but now transfers an additional 1 MiB of memory to the task and returns it. Interestingly, the persistent thread implementation runs slightly faster than in the previous test, while the baseline control becomes slower. This unexpected result prompted further testing with a larger input size of 10 MiB, which completed in 623.915ms.

The observed increase in speed is likely due to the underlying memory management strategy. To conserve valuable GPU resources, the memory buffers are logically partitioned using offsets. When tasks use very small memory frames, their data ends up placed very close together in memory. During execution, tasks that read from memory locations adjacent to others simultaneously writing to nearby regions can cause contention and reduce performance. By increasing the memory size transferred per task, these memory regions are spaced farther apart, reducing contention and allowing the kernel to run more efficiently.

This separation applies to both input and output buffers. When transferring larger amounts of data (e.g., 1 MiB or more), the serialized GPU memory transfers no longer interfere with kernel execution, leading to the improved performance observed.

### 5.1.5 Multiple Kernels

## 5.2 Profiling and Analysis with nsys

To better understand the execution characteristics of the implementation, nsys profiling was performed for both the baseline (non-persistent) and persistent-threaded

approaches. In the baseline version, the profiler output is easy to interpret: each kernel launch is explicitly visible on the timeline, interleaved with host-device memory transfers. This makes it straightforward to determine where computation occurs, identify launch overheads, and correlate them with memory transfer costs.

In contrast, profiling the persistent-threaded implementation presents significant challenges. Since the persistent kernel is launched only once and remains active for the duration of execution, there are no distinct kernel launch entries on the timeline for each individual task. Instead, the majority of the visualized activity in nsys consists of frequent host-to-device and device-to-host memory transfers corresponding to the queuing and completion of tasks. The actual execution of the tasks within the persistent kernel is effectively "hidden" from the profiler's high-level view, as it takes place inside the single long-running kernel.

This difference means that the persistent-threaded execution does not lend itself well to the same form of visual, task-by-task inspection available in the baseline case. While memory transfer patterns can still be analyzed, the lack of discrete kernel events makes it difficult to directly measure per-task execution time or to distinguish between overlapping computation and data movement. To obtain deeper insights, low-level instrumentation or custom device-side logging would be required, as standard profiling tools are optimized for discrete kernel launches rather than continuous, event-driven GPU execution.

Consider the evaluation of the current execution.



Figure 5.3: Simple nsys timelines for the baseline implementations.

### 5.3 Coroutine Implementation

While the primary objective of this work was the implementation of GPU persistent threads, it became evident that the same architectural foundations could be extended to support coroutine functionality with relatively modest additional effort. The existing design already incorporates a task queue, which the GPU processes in sequence, and a dedicated memory address space for function contexts, making it possible to store and retrieve the continuation state of executing tasks. These features naturally lend themselves to coroutine-style execution, where tasks can yield control and later resume from the same execution point.

### 5.3.1 Yielding and Changing Tasks

In essence, a coroutine yields its execution by voluntarily releasing its currently allocated compute resources, allowing other tasks to make progress. Within the present implementation, this can be modeled directly using the existing task queue mechanism. The task queue is implemented as a circular buffer, processing tasks in order from the head to the tail. When a task yields, it is effectively repositioned within this queue, relinquishing its slot so that another, potentially higher-priority, task can execute in its place. The most urgent pending task can then take over the vacated slot, ensuring minimal idle time for the GPU.

A more advanced extension would involve replacing the simple circular buffer with a *priority heap*. In such a structure, tasks are automatically sorted and selected for execution based on their priority levels, allowing the scheduler to make more informed decisions about task ordering. While the current system executes tasks to completion once they are selected, this model could still allow tasks to voluntarily halt and later resume execution, improving responsiveness for workloads with mixed task lengths.

### 5.3.2 Continuation and Resumption

Implementing coroutine behavior also requires explicit mechanisms for saving and restoring task state. When a task yields, the GPU must store sufficient continuation data so that it can later resume execution from precisely the point where it was interrupted. The current memory structure already provides a convenient, per-task memory region where such continuation data can be stored. This includes both the computational state and the control information necessary for resuming execution at the correct instruction address within the device function.

To enable this, each task must be preallocated with enough device memory during scheduling to hold both its input parameters and any continuation data that may be required. This is achieved by copying the task’s input stream into the designated memory region and advancing the memory offset to reserve additional space for storing the continuation state. When a task yields, the current execution context—such as register values, loop counters, and program counters—is written into this reserved space. Upon resumption, the scheduler retrieves this data and restores the task’s execution context, allowing it to continue seamlessly from its previous stopping point.

By combining these mechanisms with the persistent thread framework, the GPU could effectively execute multiple long-lived, cooperative tasks, enabling finer-grained scheduling and more efficient utilization of GPU resources, especially in irregular or latency-sensitive workloads.

## 5.4 Limitations and Future Work

The current system does not support priority based scheduling decisions or utilizing coroutines for software preemption like execution. Future work implementing the strategies outline by this thesis towards utilizing the persistent kernel implementation as a baseline for these tasks would enable further predictable real time scheduling capabilities. In this case, the memory management system might need to progress from an epoch based scheduler to a more complex management system. Low priority tasks would interrupt the freeing of epoch memory. Furthermore, the current manual streaming of input and output parameters restricts task variability and automation. Developing a generalized parameter serialization and deserialization layer that supports arbitrary kernel arguments would increase flexibility and reduce overhead in task preparation.

Further improvements could also include:

- Implementing task batching, where each worker processes multiple tasks before polling again, to reduce synchronization overhead.
- Optimizing the polling mechanism with adaptive backoff strategies to minimize resource consumption.
- Leveraging asynchronous memory copies with double buffering and pinned memory to better overlap data transfers and computation.
- Extending the evaluation to more complex kernels and real-world workloads beyond matrix multiplication to better characterize the benefits and limitations of persistent threads.

Overall, while the current implementation demonstrates the feasibility of persistent GPU threads, there remains substantial opportunity for optimization and scaling to fully leverage the advantages of this approach.

# Abbreviations

**GPU** Graphics Processing Unit

**CPU** central processing unit

**DSL** domain specific language

**GPC** Graphics Processsing Cluster

**TPC** Texture Processing Cluster

**SM** Streaming Multiprocessor

**CNN** Convolutional Neural Network

**CTA** Cooperative Thread Array

**HBM2** High Bandwidth Memory

**D2H** Device to Host

**H2D** Host to Device

**JIT** just in time

**IR** intermediate representation

**AST** abstract syntax tree

# List of Figures

|     |                                                                                                             |    |
|-----|-------------------------------------------------------------------------------------------------------------|----|
| 2.1 | CPU vs GPU Thread Architecture . . . . .                                                                    | 7  |
| 2.2 | Single threaded Matrix Multiplication Execution between CPUs and GPUs averaged over 10 executions . . . . . | 8  |
| 2.3 | Data Matrix from Figure 2.2 . . . . .                                                                       | 8  |
| 2.4 | SM Processing Partition Architecture taken from the Volta Whitepages .                                      | 10 |
| 4.1 | GPU Task Queue for Task Management . . . . .                                                                | 24 |
| 4.2 | Continuous Memory Buffer Logically Partitioned . . . . .                                                    | 26 |
| 4.3 | Persistent Thread Architecture implemented into LightKer . . . . .                                          | 30 |
| 5.1 | Small Kernel and Memory Transfer Tasks . . . . .                                                            | 34 |
| 5.2 | Small Kernel and Larger Memory Transfer Tasks . . . . .                                                     | 35 |
| 5.3 | Simple nsys timelines for the baseline implementations. . . . .                                             | 36 |

## **List of Tables**

# Bibliography

- [1] K. Jo, J. Kim, D. Kim, C. Jang, and M. Sunwoo, "Development of autonomous car—part i: Distributed system architecture and development process," *IEEE Transactions on Industrial Electronics*, vol. 61, no. 12, pp. 7131–7140, 2014. doi: 10.1109/TIE.2014.2321342.
- [2] J. Sun, K. Duan, X. Li, N. Guan, Z. Guo, Q. Deng, and G. Tan, "Real-time scheduling of autonomous driving system with guaranteed timing correctness," in *2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS)*, 2023, pp. 185–197. doi: 10.1109/RTAS58335.2023.00022.
- [3] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell, "A survey of general-purpose computation on graphics hardware," in *Comput. Graph. Forum*, vol. 26, no. 1, pp. 80–113, Mar. 2007.
- [4] S. Zheng, Z. Zhou, X. Chen, D. Yan, C. Zhang, Y. Geng, Y. Gu, and K. Xu, "Luisarender: A high-performance rendering framework with layered and unified interfaces on stream architectures," *ACM Trans. Graph.*, vol. 41, no. 6, Nov. 2022, ISSN: 0730-0301. doi: 10.1145/3550454.3555463. [Online]. Available: <https://doi.org/10.1145/3550454.3555463>.