

# 1 Modular GPU Programming with Typed Perspectives

## 2 ANONYMOUS AUTHOR(S)

3 To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to  
4 individual threads to control their behavior, while simultaneously tracking the convergence of many threads  
5 acting in concert to perform *collective operations* like Tensor Core instructions. The tension between these two  
6 mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite  
7 being called per-thread, must be executed cooperatively by groups of threads.

8 In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers  
9 the low-level control over collective operations necessary for high performance. Our core idea is *typed perspectives*,  
10 which materialize, at the type level, the granularity at which the programmer is controlling the behavior  
11 of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations  
12 in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers  
13 programmers the safety guarantees needed to confidently write modular code without sacrificing performance.

## 15 1 Introduction

16 CUDA [39] is a low-level, imperative programming language for NVIDIA GPUs. These GPUs are  
17 organized into a hierarchy of compute resources. *Threads* are the basic unit of sequential execution,  
18 *blocks* are groups of threads that can cooperate through shared scratchpad memory, and the *grid* is  
19 the full collection of blocks launched for a computation. A GPU program executes in parallel across  
20 this hierarchy, but is written *from the perspective of a single thread*.

21 While operations are specified per-thread, some are only valid when executed *collectively* by a  
22 group of threads. The `__syncthreads()` intrinsic, for instance, synchronizes all threads within a  
23 block, and it will cause a deadlock if executed by only some of those threads. There are many such  
24 collective intrinsics, including Tensor Core instructions [41–43], warp shuffle operations [22], and  
25 other kinds of synchronization primitives [21]. These operations require programmers to carefully  
26 marshal compute resources to coordinate which threads execute which lines of code. As a result,  
27 such collective operations break the illusion of threads executing independently.

28 The conceptual clash between regular statements (executed by a single thread) and collective  
29 operations (executed by a group of threads) impacts not only hardware intrinsics, but also *user-defined*  
30 collective operations, like functions. While functions are invoked individually by threads, they may  
31 encapsulate collective behavior, creating a contradiction between the per-thread syntax of their  
32 invocation and the cooperative semantics of their execution.

33 This contradiction puts modularity at odds with correctness and is apparent even in widely used  
34 CUDA libraries that package common functionality via function interfaces [13, 38, 51]. Consider the  
35 following snippet of documentation taken directly from CUB [38], a library of parallel primitives.  
36 It describes the `BlockReduce` function [11], in which all threads in a block collaboratively apply a  
37 reduction operation, such as a maximum or prefix sum, over an array:

38 Computes a block-wide reduction for `thread0` using the specified binary reduction functor.

- 40 • The return value is undefined in threads other than `thread0`.
- 41 • A subsequent `__syncthreads()` threadblock barrier should be invoked after calling this method if the collective's  
42 temporary storage (e.g., `temp_storage`) is to be reused or repurposed.

43 The snippet above attempts to convey several assumptions implicit in `BlockReduce`'s implemen-  
44 tation. First, the function is only well defined when invoked by all threads in a block. Second, because  
45 it accesses memory shared among threads, any subsequent reuse of that storage requires synchro-  
46 nization to ensure all threads have completed their accesses. In effect, CUB attempts to retrofit CUDA  
47 with information about the compute and memory requirements of collective operations. By prefixing  
48 functions with identifiers such as `Block`, CUB creates ad-hoc “namespaces” for different functions

50 that assume similar invariants. However, without a type system to statically enforce these invariants,  
 51 correct usage of this function—and collective operations in general—depends on carefully reading  
 52 and interpreting the documentation.

53 In this work, we ask: can we provide low-level control over collective operations while statically  
 54 guaranteeing that they execute with the necessary compute resources? By reifying configurations  
 55 of compute resources as type-level *perspectives*—so named because they describe the view of GPU  
 56 resources against which each statement is written—we find that we can. Our key insight is that GPU  
 57 programmers naturally map computations onto different compute resources, and, unlike CUDA,  
 58 which obscures this mapping, we can track it with a type system. This tracking enforces that collective  
 59 operations are executed with the necessary resources, while still allowing users to access low-level  
 60 intrinsics. Further, it allows users to define and compose their own collective operations by specifying  
 61 the perspective with which their code must be run.

62 Our approach departs from previous work, which attempts to resolve the mismatch between  
 63 per-thread syntax and collective execution by restricting access to low-level operations. Tile-based  
 64 languages like Triton [54], Helion [44], and Tilos [26], limit the user’s perspective to the block  
 65 level, which prevents them from writing the highest-performance kernels. A variety of functional  
 66 languages, [7, 27, 32] provide compile-time guarantees through their type systems but do not expose  
 67 low-level control over hardware. Some other efforts, like Descend [35], aim to provide a low-level,  
 68 memory-safe GPU systems programming language, but lack support for modern GPU features (like  
 69 Tensor Cores or asynchrony) entirely. As a result, despite its flaws, CUDA remains the de-facto  
 70 standard for writing high-performance kernels on modern GPUs [23].

71 We introduce Prism, a new low-level GPU programming language that enforces that operations are  
 72 only executed with the correct view of hardware resources, enabling fearless composition. Inspired by  
 73 dependency calculi [1], Prism statically tracks the configurations of compute and memory resources  
 74 with type-level perspectives. We also develop Bundl, a core calculus underpinning Prism, provide  
 75 formal rules for manipulating the perspectives of both code and data, and prove *type-and-perspective*  
 76 *safety*. This ensures that Bundl is sound, and that code always has the right perspective to execute  
 77 operations at run time. A parallel goal, alongside safe composition, is performance. We incorporate  
 78 modern GPU features such as Tensor Cores and asynchronous data movement into Prism, and  
 79 demonstrate that Prism can achieve the same performance as hand-written, highly optimized code  
 80 on an H100 and a 4070 Ti Super. Our contributions are:

- 81     (1) Prism, a low-level GPU language that tracks the logical grouping of compute and memory  
       82       resources with type-level perspectives, empowering users to write modular code (Section 3);  
- 83     (2) Bundl, a formal model of Prism that tracks perspectives in its type system and operational  
       84       semantics, along with a soundness theorem guaranteeing that programs execute operations only  
       85       when they have been statically proven to possess the necessary perspective (Section 4); and
- 86     (3) An implementation of Prism (Section 5) that demonstrates that it can support modern GPU  
       87       features like Tensor Cores and asynchronous data movement, achieving performance com-  
       88       parable to hand-optimized CUDA implementations (Section 6).

90 Section 7 discusses related work, and Section 8 discusses the limitations of our approach and  
 91 outlines future work.

## 93 2 Background & Motivation

95 Before diving into the design of Prism, we begin with an overview of the GPU’s compute and memory  
 96 hierarchies and outline the challenges posed by reasoning about them collectively. We also discuss  
 97 how Prism can help solve these problems.

```

99  1 int tid = threadIdx.x;
100 2 if (tid >= 0 && tid < 32){
101 3   float A[4], B[2];
102 4   float C[4] = { 0 };
103 5   // Populate A with unique values.
104 6   for (int i = 0; i < 4; i++)
105 7     A[i] = tid * 4 + i;
106 8   // Populate B with unique values.
107 9   for (int i = 0; i < 2; i++)
108 10    B[i] = tid * 4 + i;
109 11   // Issue a warp-level Tensor Core
110 12   // operation: D = A * B + C
111 13   // (eliding some typecasts).
112 14   asm("mma.sync.aligned.m16n8k8..."
```

```

1  tid : int @ thread[1] = id();
2  with group(thread[32]):
3    A : float[4] @ thread[1]
4    B : float[2] @ thread[1]
5    C : float[4] @ thread[1]
6
7  # Populate A with unique values
8  for i in range(0, 4, 1):
9    A[i] = tid * 4 + 1
10   C[i] = 0
11  # Populate B with unique values
12  for i in range(0, 2, 1):
13    B[i] = tid * 4 + 1
14
15  # Issue Tensor Core op.
16  intrinsic.mma(
17    A[0], A[1], A[2], A[3],
18    B[0], B[1],
19    C[0], C[1], C[2], C[3],
20    out=[A[0], A[1], A[2], A[3]])
```

Fig. 1. Warp-level Tensor Core instruction in CUDA.

Fig. 2. Warp-level Tensor Core instruction in Prism.



Fig. 3. The `@` syntax represents the frequency at which a variable varies across threads T1–T8. Each `@` denotes a coloring of the variable; threads that “view” the variable with the same color must observe the same value.

## 2.1 Compute Hierarchy

In CUDA, programmers launch computations that run on thousands of threads. These threads are organized into a *compute hierarchy* that defines how work is distributed and scheduled on the GPU. At the top of this hierarchy is the *grid*, representing all threads launched as part of a single computation. The grid is divided into *blocks*, each containing a user-specified number of threads. *Threads* are the finest unit of execution and the machine’s basic unit of sequential control.

Users describe a computation by writing a single program which is replicated across all threads. This program controls the behavior of individual threads by reading built-in identifiers like `threadIdx.x`—to determine a thread’s position within a block—and `blockIdx.x`—to determine the block’s position within the grid—at runtime. By making control flow decisions based on these two variables, users can assign each thread to its share of the full computation.

A natural and tempting way to interpret these built-in variables is to think of them as the indices of implicit “parallel-for loops” surrounding the program, where each iteration executes simultaneously. While this view is sufficient to understand CUDA’s programming model when threads do independent work, it quickly breaks down in the presence of collective operations.

Unlike most instructions, which are executed by a single thread, collective operations must be executed collaboratively by a group of threads. For example, on line 14 of Figure 1, we perform a Tensor Core operation [43], which is only meaningful when invoked by a collection of 32 threads—a *warp*—acting together. For the call, each thread sets up its portion of the operands, and the operation is performed *once* for the entire warp, with the results scattered across participating threads. In this example, the first 32 threads in a block are tasked with this operation; programmers mentally group these 32 threads into a collective unit, and then reflect that grouping in the program via the `if`-statement on line 2. If the condition on line 2 were instead `tid >= 0 && tid < 30`, making fewer than 32 threads reach line 14, the result would be undefined. To make matters worse, the restriction is not only on the number of threads executing the operation, but also on their alignment. For threads to form a warp, the starting ID of the group must be aligned to a multiple of 32. So, if the condition on line 2 were instead `tid >= 1 && tid < 33`, the result would still be invalid, even though 32 threads would execute it.

Collective operations make reasoning about CUDA programs challenging because they force programmers to track the *convergence behavior* of threads. Specifically, programmers must reason both about how many threads reach each in the program and how those threads are arranged, accounting for alignment. This difficulty is further amplified by two factors. First, programs may have multiple points of convergence, requiring programmers to mentally track the relative ID of a thread within

148 a logical group as that group evolves over the course of a program’s execution. Second, threads may  
 149 be participating in multiple levels of convergence within the same program. In our example, we  
 150 only considered the warp-level Tensor Core operation, but there are other collective operations that  
 151 require convergence at different granularities like *warpgroup*-level Tensor Core operations—which  
 152 must be issued collectively by four warps—or the block-level `__syncthreads()` primitive—which  
 153 must be executed by all threads within a block.

154 Reasoning about collective operations is already error prone within the context of a single function,  
 155 but becomes even more difficult when reasoning across functions. A callee may assume a certain  
 156 configuration of threads and may structure its computation around that assumption. However, such  
 157 assumptions are not visible in the callee’s function interface, which only exposes the input and  
 158 output types. To invoke the function correctly, users must read its documentation—or worse, its  
 159 implementation—to understand its assumptions, breaking modular reasoning.

160 In Prism, we materialize the programmer’s mental grouping of the compute hierarchy explicitly  
 161 in the program’s source. Consider the example in Figure 2, which shows an equivalent rewrite of the  
 162 CUDA program in Figure 1 using Prism. The call to the `mma` is valid only because it is executed within a  
 163 **group** of 32 threads, made explicit on line 2. This intrinsic exposes its invariant via its function signature,  
 164 which we will discuss in detail in Section 3.6. Since Prism’s **group** construct enforces both the size  
 165 and alignment of participating threads, the validity of the `mma` operation is guaranteed by construction.

166 However, it is not sufficient to only consider the compute hierarchy when reasoning about convergence;  
 167 we must consider the memory hierarchy as well. Whenever threads see different data,  
 168 they can use it to induce divergent behavior. We’ve already seen an example of this: the `tid` variable  
 169 in Figure 1. In general, the possibility of divergence lurks anywhere threads branch on data.

## 170 2.2 Data and the Memory Hierarchy

171 Data on the GPU is organized into a *memory hierarchy*, mirroring the compute hierarchy. All threads  
 172 in a grid can read and write from *global memory*, where operands reside at the start of a computation  
 173 and where results are eventually written. Each block has access to a limited amount of *shared memory*,  
 174 a programmer-managed scratchpad typically used to stage repeatedly-used data. Finally, each thread  
 175 maintains its own state in *registers* and *local memory*. These are private to each thread, while shared  
 176 and global memory are accessible by multiple threads.

177 *Thread-Local Data.* Registers and local memory are owned by individual threads, so variables with  
 178 the same name can have different values on different threads at run time. In general, CUDA offers  
 179 no support for dealing with the divergence that arises as a consequence.

180 A key challenge for Prism is managing this divergence, ensuring that all threads organized by  
 181 other language constructs like **group** remain logically unified, even in the face of data-dependent  
 182 control flow. To do this, Prism tracks the *frequency* with which values vary in space. The `@` syntax  
 183 attached to each declaration denotes this frequency. For example, `thread[1]` variables can vary for  
 184 every thread, while `block[1]` variables can vary for every block, but not for threads within that block.  
 185 Intuitively, the `@` construct “colors” a variable across threads, and any threads that share a color are  
 186 required to agree on that variable’s value. Figure 3 illustrates valid and invalid colorings of a variable.

187 Using this information, Prism enforces rules for reading from and writing to variables. So, for  
 188 example, if we introduce a condition based on `tid` within the **group**’s scope in Figure 2, Prism will  
 189 reject that program; code that diverges at low frequency cannot read from variables that change at  
 190 higher frequency. We explain the rules of Prism programs in detail in Section 3.

191 *Thread-Shared Data.* Shared and global memory, on the other hand, is not replicated per-thread;  
 192 instead, all threads in a block have the same view into shared memory, while all threads on the grid  
 193 have the same view into global memory.

CUDA does not explicitly model shared and global memory spaces, nor does it track whether allocations in a given memory space exceed device limits. Prism, on the other hand, distinguishes these spaces and restricts shared memory to static allocations, throwing an error if an allocation exceeds device limits. We will discuss these aspects in detail in Section 3.4. For now, however, we focus on how views of memory converge and diverge in tandem with the compute hierarchy.

As an example, when launching a kernel on the GPU, a global memory pointer initially belongs to the grid because every thread sees the same pointer value at the start of the computation. To write to that memory, each thread computes an offset from the original base address. In doing so, the pointer effectively diverges across threads so each can write independently. After these writes, the view on the memory must reconverge, ensuring that all threads have completed their updates before it can return to its original logical owner.

Let us reconsider the Tensor Core example from Figure 1, this time initializing the operands of the Tensor Core operation from pointers into global memory. In Figure 4, **A** and **B** are now populated from **a\_mem** and **b\_mem**. After the Tensor Core operation completes, the result is written back to **c\_mem**, also in global memory. This variable **c\_mem** is logically accessed from two different levels of the compute hierarchy. At the beginning, each thread in a group of 32 sees the same value for **c\_mem**. Then, they each locate an offset within **c\_mem** that they write to (lines 20-23). To restore **c\_mem** back to its 32 thread-level ownership, all 32 threads must synchronize (line 25) to ensure all per-thread writes have completed. This is similar to the requirement we saw documented in CUB in Section 1.

Similarly to the compute hierarchy, CUDA programmers are responsible for tracking the logical owner of a view of memory as it evolves over the course of a program, and for ensuring that appropriate synchronization occurs.

Prism, by contrast, makes the evolution of memory’s view explicit in the syntax and automatically inserts the synchronization required to restore that view to its original owner. In Figure 5, we show how memory is lowered through the compute hierarchy for the same Tensor Core operation from Figure 4. The lowering of **c\_mem** is required in this program as Prism only permits writes from the view of a single thread. To lower **c\_mem**, we use Prism’s **partition** operation on line 28. The operation takes a pointer to partition, **c\_mem**, and lowers it to a single thread, assigning it a new name, **c\_thrd**; Prism will not allow use of the old variable **c\_mem** within the **partition**’s scope. The **partition** also takes an indexing function, and each time **c\_thrd** is accessed, this function is implicitly applied. Once the **partition**’s scope ends, Prism inserts a synchronization barrier before the next use of the original variable, so that all per-thread writes have completed. In this way, at the end of the **partition**, the original variable represents a convergent view of the data once again.

```

1 int x = threadIdx.x;
2 if (x >= 0 && x < 32){
3     float A[4], B[2];
4     float C[4] = { 0 };
5
6     // k is a_mem's stride.
7     A[0] = a_mem[(x/4)*k+(x%4)];
8     A[1] = a_mem[(x/4)*k+(x%4)+4];
9     A[2] = a_mem[(x/4+8)*k+(x%4)];
10    A[3] = a_mem[(x/4+8)*k+(x%4)+4];
11
12    // n is b_mem's stride.
13    B[0] = b_mem[(x%4)*n+(x/4)];
14    B[1] = b_mem[(x%4+4)*n+(x/4)];
15
16    asm("mma.sync...");
17
18    // Write back into c_mem.
19    // n is c_mem's stride.
20    c_mem[(x/4)*n+2*(x%4)] = C[0];
21    c_mem[(x/4)*n+2*(x%4)+1] = C[1];
22    c_mem[(x/4+8)*n+2*(x%4)] = C[2];
23    c_mem[(x/4+8)*n+2*(x%4)+1] = C[3];
24
25    __syncwarp();
26 }
```

Fig. 4. Invoking a warp-level Tensor Core instruction in CUDA after loading data from memory.

```

1 @prism("device")
2 @requires(thread[32])
3 def simple_mma(
4     a: ptr(const(float)) @ thread[32],
5     b: ptr(const(float)) @ thread[32],
6     c: ptr(float) @ thread[32]):
7     x : int @ thread[] = id()
8     with group[thread[32]]:
9         A : float[4] @ thread[1]
10        B : float[2] @ thread[1]
11        C : float[4] @ thread[1]
12        # Reads do not need to be lowered.
13        A[0] = a_mem[(x/4)*k+(x%4)];
14        A[1] = a_mem[(x/4)*k+(x%4)+4];
15        A[2] = a_mem[(x/4+8)*k+(x%4)];
16        A[3] = a_mem[(x/4+8)*k+(x%4)+4];
17        B[0] = b_mem[(x%4)*n+(x/4)];
18        B[1] = b_mem[(x%4+4)*n+(x/4)];
19        # Skipping initialize C to 0...
20        intrinsic.mma(
21            A[0], A[1], A[2], A[3],
22            B[0], B[1],
23            C[0], C[1], C[2], C[3]),
24            out=[C[0], C[1], C[2], C[3]])
25        # Must be at thread[1] to write.
26        idx = \
27            lambda ro, co: (x/4+ro)*n+2*(x%4)+co
28        with
29            partition(c_mem, p=thread[32], f=idx)
30            as c_thrd:
31                c_thrd[0, 0] = C[0]
32                c_thrd[0, 1] = C[1]
33                c_thrd[8, 0] = C[2]
34                c_thrd[8, 1] = C[3]
35        # --- Sync point ---
```

Fig. 5. Invoking a warp-level Tensor Core instruction in Prism after loading data from memory.

```

246 1 @prism("global")
247 2 # Top-level perspective bounds and shared memory usage.
248 3 #requires(grid[1], block[1], thread[32], smem=1280)
249 4 def mmaTF32NaiveKernel(A: ptr(const(float)) @ grid[1],
250 5                         B: ptr(const(float)) @ grid[1],
251 6                         C: ptr(float) @ grid[1],
252 7                         M: int @ grid[0],
253 8                         N: int @ grid[0],
254 9                         K: int @ grid[0]):
255 10
256 11     # Starts out with grid[1] perspective.
257 12     with group(grid[1]):
258 13         # @ grid[1] inferred from current perspective
259 14         # Each block computes an 16 x 8 tile
260 15         num_blocks_n : const(int) = (N + 8 - 1) / 8
261 16
262 17         # id() function returns the block id
263 18         # inferred from @block[1].
264 19         blk_row : const(int) @ block[1] = (id() / num_blocks_n) * 16
265 20         blk_col : const(int) @ block[1] = (id() % num_blocks_n) * 8
266 21
267 22         # Give each block an offset into C
268 23         offset = lambda x: blk_row * N + blk_col + x
269 24         with partition(C, p=block[1], f=offset) as C_blk:
270 25             with group(block[1]):
271 26                 # SHMEM declarations are only allowed
272 27                 # with a block[1] perspective.
273 28                 A_smem : shared(float[16 * 8]) @ block[1]
274 29                 B_smem : shared(float[8 * 8]) @ block[1]
275 30                 C_smem : shared(float[16 * 8]) @ block[1]
276 31                 # Now, id() returns the thread id
277 32                 idx = lambda x: x * 4
278 33                 # To write to C_smem, drop to thread[1] perspective
279 34                 with partition(C_smem, p=thread[1], f=idx) as C_th:
280 35                     for i in range(0, 4, 1):
281 36                         with group(thread[1]):
282 37                             C_th[i] = 0
283
284 38
285 39     # --- Sync point --- (backedge from for loop)
286 40     for j in range(0, 4, 1):
287 41         global_row : int @ thread[1] = blk_row + row
288 42         global_col : int @ thread[1] = i * 8 + col
289 43         with partition(A_smem, p=thread[1], f=..) as A_th:
290 44             with group(thread[1]):
291 45                 A_thrd[0] = A[global_row * K + global_col]
292
293 46
294 47     # --- Sync point --- (backedge from for loop)
295 48     for j in range(0, 2, 1):
296 49         # Similar to write into A_smem ...
297 50         with partition(B_smem, p=thread[1], f=..) as B_th:
298 51             with group(thread[1]):
299 52                 B_thrd[0] = B[global_row_b * N + global_col_b]
300
301 53
302 54     # Give each warp an offset into C_smem.
303 55     # --- Sync point --- (backedge from for loop)
304 56     with claim(C_smem, p=thread[32]) as Cs_warp:
305 57         match split(thread): # Masks off other threads
306 58             case 32:
307 59                 # Call function that performs a
308 60                 # Tensor Core instruction.
309 61                 simple_mma(A_smem, B_smem, Cs_warp)
310 62
311 63     # --- Sync point ---
312
313 64
314 65
315 66     for j in range(0, 4, 1):
316 67         flat_idx_c : int @ thread[1] = id() * 4 + j
317 68         row_c : int @ thread[1] = flat_idx_c / MMA_K
318 69         col_c : int @ thread[1] = flat_idx_c % MMA_K
320 70         idx = lambda x: row_c * N + col_c + x
321 71         with partition(C_blk, p=thread[1], f=idx) as C_th:
322 72             with group(thread[1]):
323 73                 C_th[0] = C_smem[row_c * MMA_N + col_c]
324
325 74 return

```

Fig. 6. Naive tensor float 32 matrix multiplication in Prism (full program can be found in Appendix B.1).

### 3 The Prism Language

Prism is an imperative, low-level language designed at a level of abstraction comparable to that of CUDA. Unlike CUDA, however, Prism’s syntax materializes the mapping of computations onto the compute and memory hierarchy explicitly in the program source. Using this information, Prism enforces that programs only execute collective operations with sufficient resources.

To guide our discussion about Prism’s design, we use the program in Figure 6 as a running example. It computes a matrix multiplication between two float arrays, **A** and **B**, to produce an output matrix **C**. In this program, each block computes an independent  $16 \times 8$  tile of the output. To do so, blocks first locate the tile index assigned to them (line 19–20). Next, threads in each block load corresponding rows and columns from **A** and **B** (line 41–53) into shared memory. Finally, the program invokes a warp-level, Tensor Core instruction to compute the output (line 62), requiring threads in a warp to converge. We encapsulate this Tensor Core instruction in a function, demonstrating how function composition works in Prism. This function is the same as the one in Figure 5.

### 3.1 Levels

Prism models the machine's compute hierarchy through *levels*. There are three levels in Prism—**grid**, **block** and **thread**—which are organized as expected: a **grid** consists of multiple **blocks**, each of which consist of multiple **threads**. Levels are ordered, with **thread < block < grid**.

There are two key differences between Prism’s levels and those in CUDA. First, Prism does not model CUDA’s three-dimensional grid or block structure. Second, there are two commonly-used “levels”, warp and warpgroup, absent from our hierarchy. On the hardware, the units of each level are arranged in a linear order, and the three-dimensional structures of CUDA are simply interpretations of this ordering, not distinct hardware resources. Similarly, warps and warpgroups are organizational constructs defined in terms of existing levels. Namely, a warp is a group of 32 threads whose first

295 thread ID is aligned to 32. A warpgroup, which was introduced with the release of the Hopper  
 296 architecture [8], consists of 4 consecutive warps.

297 Rather than baking these interpretations into Prism by adding new dimensions and levels, we let  
 298 users express multi-dimensional structures and define groupings of custom sizes.

### 299 3.2 Perspectives

300 Perspectives are the central concept of Prism, representing the view of the hierarchy from which  
 301 the programmer is defining the machine's behavior. Perspectives allow Prism to determine which  
 302 compute resources the programmer is controlling, whether they are available in the program's  
 303 context, and if those resources are sufficient for a given operation.

304 A perspective is a level—**grid**, **block**, or **thread**—paired with a static constant **n**, specifying the  
 305 number of units at that level. For example, **thread[2]** denotes a perspective of two threads, **block[4]**  
 306 denotes a perspective of four blocks, and so on. Perspectives also carry *alignment information*: a  
 307 perspective of size **n** is aligned to **n**. In this way, a warp is simply a desugaring of **thread[32]**, and a  
 308 warpgroup is a desugaring of **thread[128]**.

309 Finally, perspectives are partially ordered. We say that **level<sub>2</sub>[n<sub>2</sub>]** is *broader* than **level<sub>1</sub>[n<sub>1</sub>]**, or  
 310 that **level<sub>1</sub>[n<sub>1</sub>]** is *narrower* than **level<sub>2</sub>[n<sub>2</sub>]**, if either:

- 311     (1) **level<sub>1</sub>** < **level<sub>2</sub>**; or,  
 312     (2) **level<sub>1</sub>** = **level<sub>2</sub>** and **n<sub>1</sub>** divides **n<sub>2</sub>**.

### 314 3.3 Perspectives on Code

315 Code is associated with a set of perspectives, called a *perspective bound*, which corresponds to the  
 316 compute resources whose behavior it defines. At every point in the program, the perspective bound  
 317 for that point indicates which layer of the hierarchy is being programmed, and how that layer can be  
 318 destructed into narrower perspectives. For example, a line of code with a **{block[1], thread[4]}**  
 319 perspective bound tells Prism that the current line of code is being programmed at the **block[1]**  
 320 perspective and that the **block[1]** perspective can be destructed into a some number of **thread[4]**  
 321 perspectives. As a shorthand, when we refer to a code's perspective, we mean the broadest perspective  
 322 available in its perspective bound (in this example, **block[1]**).

323 Functions begin with a top-level perspective bound. In Figure 6, the perspective bound is defined  
 324 on line 2, using the notation **@requires(grid[n<sub>1</sub>], block[n<sub>2</sub>], thread[n<sub>3</sub>])**. Programmers can  
 325 then shape the program's current perspective using two constructs: **group** and **split**.

326 **Group**. The **group** construct lets programmers shift from a broader perspective to some number  
 327 of narrower perspectives contained in it. Operationally, the **group** construct does this by replicating  
 328 code written from the narrower perspective across the broader one. In effect, **group** forks many  
 329 copies of a narrower perspective.

330 Let's consider this in context of our example. In Figure 6, execution  
 331 begins at **grid[1]** on line 12. At that point, a programmer controls  
 332 the whole grid's behavior. To produce different output tiles, the  
 333 programmer shifts their perspective to **block[1]** on line 25. The  
 334 code within the **group(block[1])** defines the behavior of a single  
 335 block, and is replicated across all blocks in the grid.

336 Not all uses of **group** are valid: the examples in Figure 7 have no  
 337 meaning on the hardware. On line 4, the program tries to broaden its perspective to **block[1]** from  
 338 **thread[2]**, which is illegal. On the other hand, in the second example, the program tries to narrow its  
 339 perspective from **block[6]** to **block[5]**. While 5 is indeed less than 6, Prism cannot evenly replicate  
 340 **block[5]** inside a **block[6]** perspective, and so rejects this program. Recall, from Section 3.2, that  
 341 whether one perspective is narrower than another is dependent on divisibility, not just size.

```

1 # Example 1
2 with group(thread[2]):
3     # Illegal because block > thread.
4     with group(block[1]):
5         pass
6     # Example 2
7     with group(block[6]):
8         # Illegal because 6 % 5 != 0.
9         with group([block[5]]):
10            pass

```

Fig. 7. Illegal uses of **group**.

344 To eliminate such cases, Prism only allows an invocation of `group(level[n])` if the current  
 345 perspective bound contains a perspective broader than `level[n]`. Once `group(level[n])` is invoked,  
 346 it modifies the current perspective bound in two ways. First, it removes all perspectives broader than  
 347 `level[n]` from it. Second, it sets the broadest perspective within the `group` to be `level[n]`.

348 **Split.** Unlike `group`, which is used for replication into equally-sized,  
 349 narrower perspectives, `split` is used for sharding the current perspective  
 350 unequally. For example, Figure 8 shows a `split` from `thread[4]` into  
 351 one branch with `thread[2]` perspective and two with `thread[1]`. The  
 352 three arms of the `split` execute independently in parallel<sup>1</sup>, as a form of  
 353 unordered composition. When `split(level)` is invoked, the perspectives  
 354 of its branches diverge. At the end of the `split`, they reconverge,  
 355 and continue execution with the original perspective.

356 Use of `split` is necessary to write warp-specialized [4] code, a pro-  
 357 gramming pattern used in high-performance kernels. Another important  
 358 use of `split`—masking off threads—can be found in our running example.  
 359 Line 58 in Figure 6 shows a `split` that requests the first warp in the block,  
 360 narrowing from `block[1]` into a single `thread[32]`. This warp is later  
 361 used to execute a Tensor Core operation.

362 Because `split` corresponds to unordered composition, it must provide  
 363 each of its branches their requested perspectives simultaneously. Prism  
 364 thus checks that the sum of the perspectives requested by all branches  
 365 of the `split` can be satisfied. For example, the program in Figure 9 does  
 366 not type check. Finally, because perspectives enforce alignment, every  
 367 branch of the `split` must also be aligned; not all `splits` whose sizes are  
 368 at most the available units are valid. Figure 10 shows an example violating  
 369 this constraint: the second branch of the split is not aligned to 2.

370 Once `split` is invoked, for each branch that requests `n` units, all perspectives broader than  
 371 `level[n]` are removed from its perspective bound, and the available units for `level` are set to `n`.

### 373 3.4 Perspectives on Data

374 Section 3.2 described how programmers can control different layers of the hierarchy by changing  
 375 their perspectives on code through `group` and `split`. To ensure these operations remain meaningful,  
 376 Prism must ensure that threads inside a perspective remain logically grouped, even when they  
 377 encounter `for`, `while`, and `if` statements. As we saw in Section 2.2, making this guarantee requires  
 378 Prism to track how data varies across threads.

379 **3.4.1 Thread-Local Data.** In Prism, each local variable has a perspective, which indicates the fre-  
 380 quency at which its values change in space. This frequency remains constant for the duration of a  
 381 program, and it tells Prism that a `level[n]` variable is always indistinguishable to threads within  
 382 that perspective. For example, `blk_row @ block[1]` on line 19 in Figure 6 has the same value across  
 383 all threads in a block.

384 Programmers specify the perspective that each variable lives at in its declaration. A variable `v` of  
 385 type `int` is declared at `thread[1]` perspective using the syntax `v : int @ thread[1]`.<sup>2</sup> To enforce  
 386 this frequency invariant, Prism restricts reads from and writes to variables based on their perspectives.  
 387 The rule can be summarized as follows: Prism allows programs to “read up” from broader perspectives  
 388 and “write down” to narrower ones.

```
1 with group(thread[4]):
2   match split(thread):
3     case 2:
4       ...
5     case 1:
6       ...
7     case 1:
8       ...
```

Fig. 8. Example `split`.

```
1 with group(thread[4]):
2   match split(thread):
3     case 4:
4       ...
5     case 1:
6       ...
```

Fig. 9. Illegal split exceeds the available perspectives.

```
1 with group(thread[3]):
2   match split(thread):
3     case 1:
4       ...
5     case 2:
6       ...
```

Fig. 10. Illegal split violates alignment.

<sup>1</sup>Despite the use of the `match` syntax, *all* branches of the `split` execute.

<sup>2</sup>If not explicitly annotated, Prism infers a variable’s perspective to be the perspective of the code where it was declared.

393    **Read Up.** Variables can only be read if their perspective is at  
 394    least as broad as the current code perspective. Figure 11 gives an  
 395    example of an illegal read that would violate this constraint. While  
 396    `__syncthreads()` should always be safe in `block[1]`, branching on  
 397    the variable `flag`—which may take different values across threads  
 398    in a block—can cause only some of those threads to reach the  
 399    `__syncthreads()`, violating its collective invariant.

```
1 flag : bool @ thread[1] = ...
2 with group(block[1]):
3   if (flag)
4     __syncthreads();
```

Fig. 11. Illegal read of `thread[1]` variable.

400    **Write Down.** Writes are dually constrained. Only values that  
 401    live at broader perspectives can be written into variables that live  
 402    at narrower ones. For example, a `block[1]` variable can be used to  
 403    write into a `thread[1]` variable, but not vice versa. An example of an  
 404    illegal write is shown in Figure 12, where writing from a `thread[1]`  
 405    variable into a `block[1]` one would cause a deadlock.

```
1 x : bool @ thread[1] = ...
2 y : bool @ block[1] = ...
3 with group(block[1]):
4   y = x
5   if(y):
6     __syncthreads();
```

Fig. 12. Illegal write of `thread[1]` variable into a `block[1]` variable.

406    Together, the “read up” and “write down” rules ensure that information only flows from broader  
 407    perspectives to narrower ones. In Figure 6, we can see the “read up” rule in action on lines 42 and 43.  
 408    Meanwhile, lines 19 and 20 are instances of the “write down” rule.

410    **3.4.2 Thread-Shared Data.** Unlike thread-local data, which is literally replicated across threads and  
 411    is backed by distinct physical storage, thread-shared data consists of pointers into shared and global  
 412    memory that are visible to some collection of threads. As a result, the perspective such data inhabits  
 413    can evolve as the program executes.

414    In Prism, there are three mechanisms for obtaining thread-shared data. The first is to directly  
 415    allocate data residing in shared memory.<sup>3</sup> The second and third are to obtain offsets into existing  
 416    data by using the `partition` or `claim` operations. These constructs mirror `group` and `split`, and  
 417    are used to temporarily narrow the perspective associated with a pointer.

418    Prism simplifies shared memory allocations by requiring them to have a static size. The `@requires`  
 419    annotation specifies the amount of shared memory a function expects to have. Prism uses this  
 420    information to ensure that a function’s allocations do not exceed its declared limit, and checks  
 421    whether there is enough shared memory available at its call sites. The function in Figure 6 declares  
 422    the amount of shared memory it will require on 2. We use standard techniques [34] to statically  
 423    bound memory usage with Prism’s type system.

424    Additionally, since shared memory is only visible to threads within the same block, such allocations  
 425    are only permitted at `block[1]`. An example of such an allocation is shown on lines 28-30 of Figure 6.  
 426    During compilation, Prism will automatically handle the necessary pointer arithmetic to assign each  
 427    allocation an appropriate offset within the shared memory space.

428    **Partition.** The `partition` operation plays the same role for memory that `group` plays for code.  
 429    It is used to refine the perspective of a pointer by computing offsets for each narrowed perspective.  
 430    Concretely, the `partition` operation takes a pointer variable `x`, an indexing function `f`, and produces  
 431    a new variable `y` at a narrower perspective `level[n]` to be used inside the `partition`. After the  
 432    partition ends, the threads participating in the `partition` must synchronize to restore the original  
 433    pointer `x` to its perspective.<sup>4</sup>

434    Inside the scope of the partition, the original variable `x` cannot be used, and the pointer can only  
 435    be accessed using `y`. Each use of `y` applies the indexing function `f` to compute the true offset of the

437    <sup>3</sup>Prism assumes that all global memory allocations have been made before launching the CUDA kernel, as is typically the case  
 438    with CUDA programs.

439    <sup>4</sup>Strictly speaking, this synchronization is only required at the next point the memory is to be *re-used*. Prism has a mechanism  
 440    for optimizing the placement of these synchronization points, which is described in Section 5.

access. For example, the use of **partition** on line 34 in Figure 6 takes a pointer **C\_smem** at **block[1]**, gives it a new name **C\_th**, and distributes it across **thread[1]** perspectives by transforming each occurrence of **C\_th[i]** in the body into **C\_smem[i\*4]**.

At this point, it is necessary to consider how different indexing functions affect the possibility of data races. If the indexing function is injective, each narrower perspective receives a distinct offset into the underlying array, and the resulting partition is free of data races. When the indexing function is not injective, multiple threads may race on the same location, introducing a potential data race.

Preventing data races is not one of Prism's goals. In Prism, data races *can* occur within a **partition**; however, since the data is eventually synchronized before it is reused, the last writer wins. Prior work, in particular Descend [35], describes a type system for data-race free GPU programs, and we believe a similar approach can be combined with Prism's. Prism instead focuses on the interaction between the compute and memory hierarchies and on reasoning about them simultaneously to ensure that operations are executed only with sufficient compute resources, a guarantee that Descend cannot provide.

Out-of-bounds accesses are undefined behavior.

**Claim.** The **claim** operation lets programmers narrow a pointer's perspective by giving it to only one collection of threads with that narrower perspective. For example, line 57 of Figure 6 shows an example of a **claim**, where the pointer **C\_smem @ block[1]** is only available to a single warp in the block and narrowed to **Cs\_warp @ thread[32]** to call a Tensor Core operation.

A **claim** takes the original pointer **x**, the target perspective to narrow it to, and a new name **y** to assign to the narrowed memory. In all **splits** within the **claim**, the new pointer **y** is accessible only within one branch; sibling branches are not permitted to read or write from this memory. As with **partition**, the original variable **x** is not accessible within the **claim**.

### 3.5 The **id()** Function

As opposed to exposing users to special hardware variables like **blockIdx.x**, and **threadIdx.x**, Prism provides an **id()** function instead. The **id** function returns the relative index of a narrow perspective within a broader perspective. That is, the interpretation of the **id** function depends on both the perspective of the variable it is being written to and the perspective of the code invoking it. In Figure 6, we use a call to **id** with **grid[1]** perspective on lines 19 and 20 to locate the tile that each block is in charge of computing. On line 32, however, a write of **id** into a **thread[1]** variable at **block[1]** perspective will return the relative ID of the thread within the block, not in the grid.

### 3.6 Collective Operations

There are two types of collective operations in Prism.

**Function Calls.** Each function carries a *perspective signature*, which consists of its top-level perspective bound, shared memory usage, and its arguments' perspectives. An example of such a signature can be seen in lines 3-9 in Figure 6. At call sites, Prism ensures that the callee's perspective signature can be satisfied by the caller. Since the perspective bound describes a minimum requirement, functions can be called with a broader code perspective than necessary, accounting for alignment. When checking function arguments at call sites, Prism distinguishes between primitive data types and pointers. For primitive data types, like **int** or **bool**, arguments can have broader perspectives than specified in the function signature. On the other hand, pointers that live at broader perspectives can only be passed if the pointer is marked as **const**, indicating that it will only be used for reading. Otherwise, we require that the pointer's perspective exactly match that of the function's perspective signature.

*Intrinsics.* The compiler provides a pre-defined set of collective intrinsics –like the Tensor Core instruction discussed in Section 2.1—each of which declares its perspective signature. Programmers may also add to this set using **unsafe**. From within **unsafe** code, programmers can inline assembly instructions and wrap them in a Prism function, specifying its perspective signature. Prism checks these call sites like those of any other function.

### 3.7 Asynchrony

Asynchronous data movement works similarly to other memory operations. Users can mark storage as asynchronous with the **async** construct. As with **partition** and **claim**, **async** hides the old variable and exposes a fresh one that is only accessible within the **async** statement. Inside, Prism ensures the new variable is only used by **async** data-movement intrinsics.

Prism includes two such intrinsics: bulk [16] and non-bulk [19] asynchronous data-movement instructions. We show an example of the former in Figure 13. As with the data operations in Section 3.4, Prism inserts the necessary synchronization before the program’s next use of the original variable, ensuring that the asynchronous transfer has completed.

```
1 with async(old_name) as new_name:
2   match split(thread):
3     case 1:
4       cp_async(smem, new_name, 16);
```

Fig. 13. Example of an **cp.async** instruction in Prism.

## 4 Formalization in Bundl

Having introduced the full Prism language, we now describe Bundl, a core calculus that formalizes its most fundamental aspects by statically tracking perspectives on code and data. We use Bundl to argue that well-typed Prism programs are not only type-safe, but will also never execute operations for which they lack the correct perspective.

In this section, we describe Bundl’s type system and operational semantics—in particular how it manages compute and data perspectives—and build up to a formal proof of type-and-perspective safety. We instrument Bundl’s operational semantics with runtime perspective enforcement: its rules will get stuck if they encounter an operation for which they have the wrong perspective. This runtime enforcement means that our safety theorem guarantees that dynamically-realized perspectives match the ones inferred by the type system.

### 4.1 Bundl Type System

The core idea in Bundl is to track, at the type level, the program’s perspective on code and data. To achieve this, we borrow techniques from the literature on dependency tracking [1]. In particular, the code perspective is tracked on the typing judgment, which has the form  $\Gamma \vdash^\pi e : \tau$  for expressions and  $\Gamma \vdash^\pi s$  for statements. The  $\pi$  over the  $\vdash$  is the code perspective on  $e$  and  $s$ , and is comprised of a level and a size, the same structure as a perspective in Prism.

The typing context also tracks the perspective at which each variable lives; data can only be read from or written to a variable when its perspective is compatible with that of the code interacting with it. This requirement is made manifest in the **T-VAR** rule, found in Figure 14. Observe that the  $\pi$  in the variable rule must match exactly between data and code; principles like “read up” and “write down” are instead encoded directly in the rules for reading and writing, like **T-ARR-ACCESS**, which views the array being read with broader perspective than the current code perspective.

Figure 14 also shows other key rules, which fall into two main categories: those for managing perspectives on code and those for perspectives on data.

**4.1.1 Managing Perspectives on Code.** Bundl’s **group** statement directly corresponds to Prism’s, and is checked by the **T-GROUP** rule. Given some statement  $s$  that checks with perspective  $\pi$ , the statement **group**  $q\ s$  will check with  $q \cdot \pi$ , enforcing Prism’s divisibility requirement.

|                                                                                                                                                                                                                                                                                                                                                              |                                                                                                                                                                                   |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\boxed{\Gamma \vdash^\pi e : \tau}$<br>$\frac{x :^\pi \tau \in \Gamma}{\Gamma \vdash^\pi x : \tau}$ T-VAR                                                                                                                                                                                                                                                   | <i>(Expression typing)</i>                                                                                                                                                        |
| $\boxed{\Gamma \vdash^\pi s}$<br>$\frac{\Gamma \vdash^{(h,n_1)} s_1 \quad \Gamma \vdash^{(h,n_2)} s_2 \quad n_1, n_2 \text{ align to } n}{\Gamma \vdash^{(h,n)} \mathbf{split}(n_1, n_2) \{s_1\} \{s_2\}}$ T-SPLIT<br>where $n_1, n_2 \text{ align to } n ::= (n_1 + n_2 \leq n) \text{ and } (n_1   n) \text{ and } (n_2   n) \text{ and } (n_2   n_1 + n)$ | <i>(Statement typing)</i>                                                                                                                                                         |
| $\frac{\Gamma \vdash^\pi s}{\Gamma \vdash^{q \cdot \pi} \mathbf{group} q \ s}$ T-GROUP                                                                                                                                                                                                                                                                       | $\frac{\Gamma, y : \downarrow^\pi \tau[]^l \vdash^\pi s \quad l \neq \mathbf{Local}}{\Gamma, x :^\pi \tau[]^l \vdash^\pi \mathbf{lower} x \text{ into } y \text{ in } s}$ T-LOWER |
| $\frac{\Gamma, y :^{(h,n/c)} \tau[]^l \vdash^{(h,n)} s \quad c   n \quad l \neq \mathbf{Local}}{\Gamma, x :^{(h,n)} \tau[]^l \vdash^{(h,n)} \mathbf{partition} x \text{ into } y \text{ by } c \text{ in } s}$ T-PARTITION                                                                                                                                   | $\frac{\Gamma \vdash^\pi s}{\Gamma \vdash^\pi \mathbf{destruct} \text{ in } s}$ T-DESTRUCT                                                                                        |
| $\frac{\Gamma, y :^{(h,n')} \tau[]^l \vdash^{(h,n')} s \quad n' \leq n \quad l \neq \mathbf{Local}}{\Gamma, x :^{(h,n)} \tau[]^l \vdash^{(h,n)} \mathbf{claim} x \text{ into } y \text{ at } n' \text{ in } s}$ T-CLAIM                                                                                                                                      |                                                                                                                                                                                   |

Fig. 14. Core typing rules of Bundl. The typing rules presented here are a simplified selection of the full rules, which can be found in Appendix A.2.

The **split** statement, meanwhile, is checked by the **T-SPLIT** rule, and functions like a binary version of the n-ary **split** construct in Prism. It enforces the same divisibility requirements to ensure that the perspectives on code and data remain properly aligned, and then checks the two sub-statements  $s_1$  and  $s_2$  with the divided, narrower perspectives.

In Bundl, to better model the details of how perspectives shift down the GPU hierarchy, we introduce a third construct called **destruct** that makes explicit exactly where such shifts occur. In the corresponding rule, **T-DESTRUCT**, the  $\downarrow$  operation on  $\pi$ s “destructs” the perspective into many narrower perspectives at a lower level. This operation is defined as  $\downarrow(\mathbf{Grid}, 1) = (\mathbf{Block}, B)$  and  $\downarrow(\mathbf{Block}, 1) = (\mathbf{Thread}, T)$ , where  $B$  and  $T$  are parameters to a particular instantiation of Bundl to describe the number of blocks per grid and threads per block.<sup>5</sup>

**4.1.2 Managing Perspectives on Data.** The mechanism for managing data perspective mirrors that of code perspective, with each operation for data corresponding to an operation for code.

The **partition** operation is analogous to **grouping** a code perspective. The typing rule for this operation, **T-PARTITION**, requires that the data perspective on  $x$ , the variable to be partitioned, is the same as the current perspective on code. After partitioning, a fresh variable  $y$  is introduced with a new perspective  $\pi/c$ . Within the **partition**, we disallow references to  $x$  and continue checking the body with the original  $\pi$ ; the **partition** has no effect on the code perspective.

Unlike **partition**, which divides up a piece of data equally among narrower perspectives, the **claim** operation views the claimed data with exactly one narrower perspective. Accordingly, Bundl needs to ensure that only one branch of a **split** operation, with the appropriate  $\pi$ , can refer to the claimed variable. To ensure that this is the case, the **T-CLAIM** rule links the data perspective of the variable to the compute perspective of the code claiming it by changing both at the same time. This represents a minor difference from Prism, which uses additional static analysis to ensure that a claimed variable is only accessed in a single **split** branch.

<sup>5</sup>Bundl is abstracted over these  $B$  and  $T$  values, so instead of tracking perspective bounds the way that Prism does, it only tracks the top-level perspective described in Section 3.2.

|     |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|
| 589 | $L, S, \Sigma, P \rightsquigarrow L', S', \Sigma', P'$                                                                                                                                                                                                                                                                                                                                                            | <i>(Machine judgment)</i> |
| 590 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 591 | $\frac{L(t), S(b), \Sigma, t, b, 0 \vdash^{(\text{Grid}, 1)} s \rightsquigarrow s' + \eta', \sigma', \Sigma'}{L, S, \Sigma, P \rightsquigarrow L[t \mapsto \eta'], S[b \mapsto \sigma'], \Sigma', P[(t, b) \mapsto s']} \text{S-PROGRAM}$                                                                                                                                                                         | S-PROGRAM                 |
| 592 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 593 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 594 | $\eta, \sigma, \Sigma, t, b, p \vdash^\pi s \rightsquigarrow s' + \eta', \sigma', \Sigma'$                                                                                                                                                                                                                                                                                                                        | <i>(Thread judgment)</i>  |
| 595 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 596 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 597 | $\frac{p < n_1 \quad n_1, n_2 \text{ align to } n \quad \eta, \sigma, \Sigma, t, b, p \vdash^{(h, n_1)} s_1 \rightsquigarrow s'_1 + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, p \vdash^{(h, n)} \text{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \text{split}(n_1, n_2)\{s'_1\}\{s_2\} + \eta', \sigma', \Sigma'} \text{S-SPLIT-LEFT}$                                                            | S-SPLIT-LEFT              |
| 598 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 599 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 600 | $\frac{p \geq n_1 \quad p < n_1 + n_2 \quad n_1, n_2 \text{ align to } n \quad \eta, \sigma, \Sigma, t, b, p - n_1 \vdash^{(h, n_2)} s_2 \rightsquigarrow s'_2 + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, p \vdash^{(h, n)} \text{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \text{split}(n_1, n_2)\{s_1\}\{s'_2\} + \eta', \sigma', \Sigma'} \text{S-SPLIT-RIGHT}$                              | S-SPLIT-RIGHT             |
| 601 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 602 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 603 | $\frac{\eta, \sigma, \Sigma, t, b, p \text{ mod } n \vdash^{(h, n)} s \rightsquigarrow s' + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, p \vdash^{(h, q, n)} \text{group } q \ s \rightsquigarrow \text{group } q \ s'; + \eta', \sigma', \Sigma'} \text{S-GROUP}$                                                                                                                                       | S-GROUP                   |
| 604 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 605 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 606 | $\frac{\eta, \sigma, \Sigma, t, b, t \text{ mod } T \vdash^{(\text{Thread}, T)} s \rightsquigarrow s' + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, 0 \vdash^{(\text{Block}, 1)} \text{destruct in } s \rightsquigarrow \text{destruct in } s' + \eta', \sigma', \Sigma'} \text{S-DESTRUCT-BLOCK}$                                                                                                       | S-DESTRUCT-BLOCK          |
| 607 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 608 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 609 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |
| 610 | $\frac{\eta, \sigma, \Sigma, t, b, p \vdash^{(\text{Block}, 1)} x := \text{alloc Shared } \tau \ n \ \text{in } s \rightsquigarrow s + \eta, \sigma[x \mapsto^\pi \langle x, n \rangle], \Sigma}{\eta, \sigma, \Sigma, t, b, p \vdash^{(\text{Block}, 1)} x := \text{alloc Shared } \tau \ n \ \text{in } s \rightsquigarrow s + \eta, \sigma[x \mapsto^\pi \langle x, n \rangle], \Sigma} \text{S-ALLOC-SHARED}$ | S-ALLOC-SHARED            |
| 611 |                                                                                                                                                                                                                                                                                                                                                                                                                   |                           |

Fig. 15. Core semantic rules of Bundl. As with the typing rules, we present only a simplified selection of the full rules, which can be found in Appendix A.3.

Lastly, the **T-LOWER** rule mirrors the **T-DESTRUCT** rule; it uses the  $\downarrow$  operator to move a variable from one level of the hierarchy to another, distributing it equally among all the narrower perspectives at that level in the same manner as **T-PARTITION**.

Note that these rules only apply to thread-shared memory, i.e., arrays that do not live in **Local**. The provenance of an array type is denoted by the superscript  $l$  above it.

## 4.2 Bundl Semantics

Having explained the key rules of the type system, we can move on to discuss Bundl’s operational semantics. To reflect the fact that a GPU program executes in parallel across numerous threads, we model the semantics of Bundl in the style of Turon et al. [57], using a two-level small step judgment. We present the key rules of this semantics in Figure 15.

The top level (i.e., machine-level) judgment has just one rule: **S-PROGRAM**. This rule acts as a “frame” for the lower level (i.e., thread-level) judgment, and steps a collection of thread-ID-indexed local memories ( $L$ ), a collection of block-ID-indexed shared memories ( $S$ ), a global memory ( $\Sigma$ ), and a thread pool ( $P$ ) to an updated collection of memories and updated thread pool. The thread pool maps thread and block IDs to code, intuitively representing the program being executed by each thread at the current moment. The **S-PROGRAM** rule non-deterministically chooses a thread ID and block ID and steps it according to the thread-level judgment. This allows the semantics to model the full range of non-deterministic behavior arising from the GPU’s thread scheduler.<sup>6</sup>

<sup>6</sup>In reality, the GPU’s warp scheduler issues instructions to threads in a warp in lockstep, but modeling every thread as completely independent is both simpler and a conservative overestimate of the nondeterministic behavior of the GPU.

638 The thread-level judgment has the shape  $\eta, \sigma, \Sigma, t, b, p \vdash^\pi s \rightsquigarrow s' \dashv \eta', \sigma', \Sigma'$ , where

- 639 •  $\eta$  denotes thread-local memory,
- 640 •  $\sigma$  denotes shared memory,
- 641 •  $\Sigma$  denotes global memory,
- 642 •  $t$  denotes the thread's ID,
- 643 •  $b$  denotes the ID of the block in which the thread lives, and
- 644 •  $p$  denotes the relative position of the thread within  $\pi$  (the perspective ID).

645 Critically, notice that a  $\pi$  also appears on the thread-level judgment just as it does on the typing  
 646 judgment. This is because the thread semantics *dynamically tracks and enforces perspectives*. The same  
 647 way evaluation of a program “gets stuck” if a value does not have the right type, the semantics of Bundl  
 648 also get stuck if code attempts to access data or invoke commands with the wrong perspective. As an  
 649 example, observe the **S-ALLOC-SHARED** rule in Figure 15, which requires a (**Block**,1) perspective and  
 650 will fail to step if encountered with a different one. This runtime perspective is present in Bundl, but  
 651 is erased by Prism during compilation; in Section 4.3 we use it to prove that well-typed programs will  
 652 always execute with the same perspective that the type system viewed them with.  
 653

654 The semantic rules for perspectives involve manipulating  $p$  to track which threads take which code  
 655 paths when perspectives are **split** or **grouped**. Notice that in **S-PROGRAM**, the thread-level judgment  
 656 always begins with perspective (**Grid**,1): all the perspective management rules are congruences,  
 657 narrowing the perspective of further evaluation as determined by the particular rule used.  
 658

659 These rules take great care to ensure that  $p$  always describes the relative position of a compute  
 660 resource within its perspective; the payoff is that Bundl’s semantics can later use this  $p$  value to model  
 661 the way that Prism automatically adjusts indices into data when **partitioning** a data perspective.  
 662 Beyond these key rules for managing perspective on code, we have modeled all other core features of  
 663 Prism, such as asynchronous operations and thread synchronization, in Bundl. To handle such features  
 664 we equip the operational semantics with additional structure, including sets of semaphores [25] for  
 665 synchronization and a stack of effect handlers for modeling deferred asynchronous computations  
 666 inspired by Ahman and Pretnar [3]. We have elided these details here for simplicity, but interested  
 667 readers can find them in their full complexity in Appendix A.3.  
 668

### 669 4.3 Soundness Theorem

670 Together, the type system and operational semantics allow us to prove the following syntactic  
 671 soundness theorem, which says that Bundl programs are type safe and do not get stuck trying to  
 672 execute operations for which they lack the required perspective:  
 673

674 **THEOREM 4.1.** *(Type-and-Perspective Safety). For any program s such that  $\Gamma \vdash^\pi s$ , either:*

- 675 (1) *s is **skip**, or*
- 676 (2) *for any well-typed environments  $\eta, \sigma$ , and  $\Sigma$ , there is an  $s', \eta', \sigma',$  and  $\Sigma'$  such that*  

$$\eta, \sigma, \Sigma, t, b, p \vdash^\pi s \rightsquigarrow^* s' \dashv \eta', \sigma', \Sigma' \text{ and } \Gamma' \vdash^\pi s', \text{ where } \Gamma' \text{ is an extension of } \Gamma, \text{ and } \eta', \sigma', \text{ and } \Sigma' \text{ are}$$

$$\text{well-typed with respect to } \Gamma'.$$

677 **PROOF.** Via the usual progress and preservation lemmas, available in Appendix A.4. □

678 It is worth noting that this soundness theorem guarantees a syntactic safety property, not a liveness  
 679 property: it does not guarantee that all threads sharing perspective  $\pi$  that *can* reach a program point  
 680 typed with  $\pi$  *will* eventually do so. Indeed, in the presence of nontermination, liveness does not  
 681 hold—some of the threads could **split** off and loop forever. While we believe the liveness version of  
 682 this theorem holds for a terminating fragment of Bundl, it is not provable with syntactic methods; the  
 683 proof would require semantic techniques that are notoriously challenging and would be a research  
 684 contribution [5, 28, 57] in and of itself. We plan to tackle this proof for Bundl in future work.  
 685

## 5 Implementation

Prism is implemented as an embedded language in Python. Once a program type checks, Prism lowers it to a CUDA file. All perspective information is erased during this step, and the generated CUDA contains no run-time checks. The file can then be compiled by `nvcc` [40], NVIDIA’s closed-source compiler, to produce an executable. Because Prism operates at roughly the same level of abstraction as CUDA, there is a one-to-one mapping between most language constructs and their CUDA counterparts. A notable change is the addition of three parameters to each device function: the thread’s relative ID, the block’s ID, and an offset for shared memory allocations.

*Inserting Synchronization.* Most of Prism’s implementation is straightforward, but inserting synchronization points is more involved. As described in Section 3.4, once data has been **partitioned**, Prism is responsible for synchronizing the data after the **partition** ends.

To determine where this synchronization must occur, Prism constructs a *data-control-flow graph* from the program. Nodes correspond to **partitions**, and edges capture program-order precedence: a parent **partition** must complete before its child begins. The graph can have backedges introduced through loops. In this graph, each **partition** is categorized as a read or a write **partition** by checking whether the **partitioned** variable ever appears as an lvalue. Synchronization points are inserted according to the following scheme:

- (1) If the *parent partition* is a write, a synchronization point is inserted before the current one to ensure that it observes the most recent data; or
  - (2) If the *current partition* is a write, a synchronization point is inserted before it to ensure that all preceding reads have completed.

Figure 16 shows an example graph with synchronization points derived from these two conditions. The inferred synchronization points in Figure 6 have also been marked on lines 40, 48, 56, and 63.

Using this information, Prism emits *wait* operations before **partitions** begin and *arrive* operations after they end, using CUDA’s general *split-barrier* [20] primitive to implement them. For special cases, such as synchronizing an entire block or warp, Prism instead uses primitives like `__syncthreads()` or `__sync`.

Synchronization for asynchronous data movement is handled in exactly the same way and uses the same underlying graph. CUDA allows asynchronous loads to be associated with a split barrier, so Prism binds each asynchronous transfer to the appropriate wait–arrive pair inferred from the graph. Certain features, such as commit group-style synchronization [17, 18], require additional reasoning, and Prism performs further static analysis to insert the necessary synchronization.

It is worth noting that naively inserting synchronization immediately after each **partition** would be correct but prohibitively slow. To avoid this, Prism applies two optimizations: a wait-motion pass pushes waits downward toward the first use of the **partitioned** variable, and an arrive-motion pass pulls arrives upward toward its last use.

## 6 Evaluation

Having explained the design of Prism, we now evaluate it in the context of three main questions:

- RQ1** Can Prism express a variety of composable CUDA programs?
  - RQ2** Can Prism express programs that use advanced GPU features?
  - RQ3** Can Prism match the performance of existing, speed-of-light CUDA code?



Fig. 16. An example data-control-flow graph with synchronization points.

To perform this evaluation, we use two GPUs. The first is the NVIDIA H100 SXM5, a server-grade chip that supports Tensor Core operations and a dedicated hardware copy engine, the Tensor Memory Accelerator (TMA) [16]. Notably, the H100 introduces a new logical level called the warpgroup, and we show that our programming model can accommodate it. Moreover, because the H100 has historically served as the primary GPU for large-scale AI training, many CUDA kernels are already highly optimized for this hardware and achieve near speed-of-light performance, providing a rigorous baseline for comparison. To ensure our results generalize beyond the H100, we also test programs on a second GPU, the NVIDIA 4070 SuperTi—a consumer-grade chip.

As mentioned in Section 5, when Prism typechecks a program, it produces a CUDA file. We compile the CUDA file with `nvcc` version 12.3 [40] with flags `-O3 -use_fast_math`. We initialize all inputs using a random number generator [31] and report the average runtime sampled over 10 iterations, following a warm-up phase of 5 iterations. All aggregate results reported in this section are geometric means. All programs in the evaluation have been reproduced in Appendix B.

## 6.1 RQ1: Can Prism Express a Variety of CUDA Programs?

We evaluate Prism’s expressivity by writing programs that have fundamentally different patterns of convergence, along with a library modeled after CUB [38] that contains composable pieces.

### 6.1.1 Programs with Different Convergence Behavior.

**Matrix Multiplication** (4070Ti). We chose matrix multiplication as our first benchmark for two main reasons. First, there exist several implementations that achieve near-peak performance, providing a strong baseline. Second, it allows for a range of increasingly sophisticated implementations that exercise different parts of the language, making it ideal for evaluating expressiveness.

We adapt the codebase from Boehm [6] to implement `float` matrix multiplication, commonly referred to as `sgemm`. Concretely, we compute  $C \leftarrow \alpha AB + \beta C$  where  $A$ ,  $B$  and  $C$  are matrices and  $\alpha$  and  $\beta$  are scalars. As this is a `float` benchmark, it does not need advanced GPU features like asynchrony or Tensor Cores to achieve speed-of-light performance.

We implement five variants of the `sgemm` benchmark: (1) a naive version that follows the traditional single-program multiple-data pattern, written from the perspective of a thread; (2) a version that exploits memory coalescing [10], still expressed at the thread level; (3) a version that builds on (2) by introducing 2D tiling and staging data in shared memory, which requires shifting first to the perspective of a block and then to the perspective of a thread, while also requiring block-level synchronization; (4) a version that applies 2D tiling with vectorized loads, again written from the perspective of a block; and (5) a version that combines the optimizations from (2) through (4) while adding an additional level of tiling from the perspective of a warp.

We can express all five variants cleanly, and the resulting programs are almost the same as their CUDA counterparts, which distinguish perspectives through disciplined style. Prism, on the other hand, enforces this discipline at compile time. We evaluate the performance of these variants in Section 6.3.

**Single-Pass Parallel Prefix Scan with Decoupled Look-Back** (4070Ti). We also implement scan, a widely used parallel primitive, in Prism. We focus on the prefix-sum scan, which computes, for each position in an array, the sum of all elements up to that position. Prefix sum sits in a different corner of the GPU design space from matrix multiplication: it is memory intensive, requires careful attention to the convergence behavior of threads, and traditionally requires multiple passes over data.

We implement the single-pass parallel prefix scan with decoupled look-back, introduced by Merrill and Garland [36], an elegant algorithm that does not require multiple passes over the input data, and involves several distinct points of convergence. Within each block, work is decomposed into

785 fine-grained thread-level and warp-level scans. After producing the local result, blocks publish their  
 786 partial prefix to global memory. Finally, each block waits until enough information from earlier  
 787 blocks is available, at which point it accumulates the value and completes its section of the scan.

788 We implement this full strategy in Prism. Our implementation uses **unsafe** to implement a global-  
 789 memory spinlock that lets a block check when the previous block's data is ready.

### 791 6.1.2 Case Study: Can Prism help build composable functions?

793 While answering the other RQs, we found ourselves developing a small library of functions—similar  
 794 in spirit to CUB—that we would frequently call. In this section, we qualitatively study how Prism can  
 795 help programmers design libraries that they can compose with confidence.

796 As mentioned in Section 1, CUB occupies  
 797 a unique design space in the GPU  
 798 library ecosystem. Unlike many other li-  
 799 braries such as cuBLAS [12], cuDNN [14],  
 800 and cuSPARSE [15], which provide host-  
 801 side functions, CUB provides a device-side  
 802 library organized into different levels. It  
 803 makes these levels apparent by prefixing  
 804 each of its functions with **Device**, **Block**,  
 805 and **Thread**. The prefix sum already used variants of these functions, translated into Prism.

```

1 template<typename T, int BlockDimX,
2         int ItemsPerThread, BlockLoadAlgorithm Algorithm,
3         int BlockDimY = 1, int BlockDimZ = 1>
4 class BlockLoad:
5
6 // --- Calling the load function by instantiating the class ---
7
8 using BlockLoad = cub::BlockLoad<int, 128, 4, BLOCK_LOAD_DIRECT>;
9 // Allocate shared memory for BlockLoad
10 __shared__ typename BlockLoad::TempStorage temp_storage;
11 int thread_data[4]; // Thread local data
12 BlockLoad(temp_storage).Load(d_data, thread_data);

```

Fig. 17. Using **BlockLoad** in CUB.

806 Let's turn our attention to a particular CUB function—**BlockLoad**—and examine how it is equiva-  
 807 lently expressed in Prism; we will see how Prism's type system reifies CUB's implicit assumptions.  
 808 In CUB, the load function is implemented as a class, as shown in Figure 17.

809 CUB exposes a leaky abstraction, where information about the number of threads, block sizes, and  
 810 other details seeps through:

- 812 (1) The CUB documentation needs to specify the number of threads that the function can assume  
     813 to be available, because within the function, each thread must locate itself in the computation  
     814 and use its **threadIdx.x** accordingly.
- 815 (2) The “item per thread” design can serve two purposes. The first is performance: if loops have  
     816 constant bounds, they can be unrolled. The second is correctness: the function relies on the  
     817 assumption that all threads call the function with an equal number of values to load.
- 818 (3) The CUB documentation specifies that **thread\_data** can be data local to each thread.
- 819 (4) Finally, it says that if shared memory is being overwritten, a **\_\_syncthreads()** call must be  
     820 made to ensure that all reads have completed.

821 In Prism, however, we are able to describe these require-  
 822 ments through the interface shown in Figure 18, reducing  
 823 the need to communicate numerous implementation de-  
 824 tails through documentation:

```

1 @prism("device")
2 @requires(block[1], thread[32])
3 def block_load(input : ptr(const(int)) @ block[1],
4               output : ptr(int) @ thread[1],
5               items_per_thread : int @ block[1]):

```

Fig. 18. **block\_load** signature in Prism.

- 826 (1) We do not need to pass in the number of threads at all. Whenever Prism calls a function, it  
     827 tracks the relative thread ID, so each function can be written locally as if it were running alone,  
     828 rather than having to determine where the thread resides on the grid.
- 829 (2) We do not need to make **item\_per\_thread** a template argument for *correctness*. Its frequency  
     830 is set at the function signature, so Prism will never allow a function to be called with a value  
     831 living at a narrower perspective.

- 834 (3) In our interface, **thread\_data** is explicitly set to a thread-local value. Since it is not marked  
 835 as **const**, Prism conservatively assumes it may be written to, and enforces at compile time  
 836 that only **thread[1]** values are passed in.  
 837 (4) Finally, using our synchronization pass outlined in Section 5, a **\_\_syncthreads()** call will  
 838 be inserted automatically if **input** is going to be used for writing.

839 Moreover, our version of CUB’s function is built as a  
 840 composition of two other functions, whose interfaces are  
 841 shown in Figures 19 and 20. The **block\_load** function  
 842 is implemented by a call to **warp\_load**, which in turn  
 843 calls **thread\_load**. Had we mistakenly attempted to call  
 844 **warp\_load** from inside **thread\_load**, however, Prism  
 845 would reject this, rather than silently failing at runtime.  
 846

```
1 @prism("device")
2 @requires(thread[32])
3 # block_load calls into warp_load
4 def warp_load(input : ptr(const(int)) @ thread[32],
5               output : ptr(int) @ thread[1],
6               items_per_thread : int @ thread[32]):
```

Fig. 19. **warp\_load** signature in Prism.

```
1 @prism("device")
2 @requires(thread[1])
3 # warp_load calls into thread_load
4 def thread_load(input : ptr(const(int)) @ thread[1],
5                  output : ptr(int) @ thread[1],
6                  items_per_thread : int @ thread[1]):
```

Fig. 20. **thread\_load** signature in Prism.

## 847 6.2 RQ2: Can Prism Express 848 Programs that Use Advanced GPU Features?

849 To answer this question, we write a matrix multiplication for the **bf16** datatype on the H100, also  
 850 known as **hgemm**. This benchmark is an acid test of our language, as **hgemm** pushes several language  
 851 features to the extreme. To write an **hgemm** that can hit peak throughput on an H100, we need to write  
 852 a warp-specialized kernel that uses the TMA—an asynchronous hardware copy engine that can move  
 853 tiles of data at a time—and the warpgroup-level Tensor Core instructions, or **wmma** [9]—new to the  
 854 Hopper architecture. A high-performance kernel for this matrix-multiplication overlaps computation  
 855 with data movement by pipelining loads.

856 The implementation in Prism looks different from CUDA code, particularly in how pipelining is  
 857 expressed. Since Prism uses named variables introduced by **partitions** or **claims** to determine the  
 858 synchronization each region requires, when pipelining, we cannot dynamically change the pipeline  
 859 slot simply by maintaining an index that wraps around based on the pipeline’s length. Instead, each  
 860 pipeline slots must be given separate names so that Prism can track them independently and overlap  
 861 compute with data-movement. This leads to pipeline slots that must be individually named and forces  
 862 the load logic to be effectively “unrolled”. This, in turn, forces all pipelines in Prism to be statically  
 863 sized. In practice, these pipelines are statically sized anyway to ensure they fit in shared memory.

864 Notably, for this benchmark, in addition to **wmma** and TMA, the program needs to dynamically real-  
 865 locate registers between producer and consumer warpgroups, an instruction only available on Hopper.  
 866 This redistribution is a warpgroup-level collective operation, and Prism can check it like any other.  
 867 Moreover, getting **wmma** to work did not require introducing a new perspective into the language;  
 868 **thread[128]** was sufficient. We did, however, need to add a dedicated TMA-style asynchronous data-  
 869 movement construct, since Prism must eventually insert the appropriate synchronization for these  
 870 transfers.

## 871 6.3 RQ3: Can Prism Match the Performance 872 of Existing, Speed-of-Light CUDA Code?

873 In Section 6.1 and Section 6.2, we examined programs  
 874 that expressed the same computation in multiple  
 875 ways, relied on multiple points of convergence, and  
 876 used advanced GPU features. We now discuss their  
 877 performance.



Fig. 21. Performance of **sgemm** on square matrices as matrix dimension  $M=N=K$  increases.

The performance of the matrix multiplication benchmarks is shown in Figure 21, where we demonstrate that Prism is competitive with cuBLAS [12].

For the prefix scan, we compare our performance to CUB in Figure 22, which shows that Prism is within 7% of CUB’s achieved bandwidth for arrays that do not fit in the L2 cache.

Finally, we evaluate our H100 implementation and show (in Figure 23) that it is competitive with cuBLAS on square sizes, coming within 15%. This performance difference arises due to one of `nvcc`’s optimization passes failing to trigger: the dynamic register allocation instruction is merely a hint—not a directive—to `nvcc`, and its optimization pass is sensitive to the way that pipeline structure is expressed. We emphasize that this is an exacting benchmark, and coming close to cuBLAS’s performance demonstrates Prism’s ability to control low-level features.

## 7 Related Work

Prism builds on a rich tradition of systems languages for GPU programming and theoretical foundations of parallel programming. Prism differs from these systems in a crucial way: it treats collective operations



Fig. 22. Performance for prefix sum as input array size increases.



Fig. 23. Performance of `hgemm` on square matrices as matrix dimension  $M=N=K$  increases.

**Imperative Languages for GPUs.** CUDA [39], ROCm [2], and OpenCL [53] are imperative programming languages that expose low-level access to GPU hardware. These languages offer no support for managing collective operations, statically or otherwise; instead, users must orchestrate computation on the machine by describing how individual threads execute code. The subtle failures permitted by this model are the motivation for Prism.

Descend [35], a new Rust-inspired [50] language, uses a type system to track aspects of the compute and memory hierarchy and is thus the closest to our work. However, Descend’s main concern is preventing data races, as opposed to ensuring that collective operations execute in valid contexts. As a result, Descend lacks support for many collective operations like Tensor Cores; in fact, because Descend allows threads to read their own IDs and thus induce data-dependent control flow, adding such support is not possible with its current design. It is also worth noting that while Descend does formalize a type system, it does not attempt to prove any properties about it.

**Functional Languages for GPUs.** Futhark [33], Accelerate [7] and Vertigo [27] are functional array languages with compilers targeting CUDA. These languages expose a high level interface, abstracting away details of the hardware entirely in exchange for stronger safety guarantees. Prism, however, exposes low-level details of the hardware and thus does not trade performance for safety.

**Tile-Based Kernel DSLs.** More recently, tile-based GPU languages—Triton [55], Pallas [52], Tilus [26], and Helion [44]—offer a middle ground between high-level abstraction and low-level control. However, these languages sidestep the question of composability entirely because they restrict programmers to a single layer of the hierarchy: a block (Triton, Tilus, Helion), or a warpgroup

(Pallas). Gluon [56], meanwhile, is a new low-level, tile-based language, but it does not check and cannot enforce that collective operations are executed at the correct layer of the hierarchy.

**Task-Based and Scheduling Languages.** Languages like Cypress [58], Halide [45], Fireiron [29] and RISE/ELEVATE [30] expose a scheduling language that lets users modify an existing reference program through external commands. Because these languages operate by transforming a fixed source of truth, they expose a fundamentally different programming model than Prism.

**Libraries Built on CUDA.** Many GPU libraries, such as CUTLASS [51], CUB [38], and ThunderKittens [47], expose device functions that operate at various levels of the compute hierarchy. These functions are typically organized via C++ namespaces to reflect their intended usage (e.g., warp-level, block-level). However, this organization is purely conventional: it encodes hierarchy through naming rather than enforcing it statically. As a result, correct use requires careful discipline from both the library implementer (to uphold naming and usage invariants) and the user (to correctly interpret them). Any mismatch or subtle misunderstanding between the intended and actual use of these functions goes unchecked by the compiler.

**Theoretical Foundations.** The design of Bundl is heavily inspired by existing work on dependency tracking [1]. Dependency tracking calculi allow type systems to track how data and code depend on each other, and have commonly been used to implement secure information flow analyses [24]. In Bundl, data that lives at a narrow perspective is unable to flow into data living at a broader perspective, and we use dependency tracking to capture this restriction in Bundl’s type system.

Prior theoretical work has tackled reasoning about thread divergence. In particular, Muller and Hoffmann [37] build a sophisticated quantitative program logic for this use case. Singhania [46], meanwhile, uses static analysis techniques to predict thread divergence to unlock optimizations.

## 8 Conclusion, Limitations, and Future Work

We have presented Prism, a new, low-level GPU language that statically guarantees safe usage of compute resources by construction, without sacrificing low-level control. Prism introduces a new mental model for writing GPU code, which we are excited to make more expressive and ergonomic.

Particularly, we plan to improve the experience of writing pipelines in Prism. As discussed in Section 6.2, Prism currently requires pipeline slots to be explicitly named; we can make this more ergonomic this by adding language support for generating pipeline-style code.

Prism can be extended to more architectures; in particular, it can accommodate newer GPUs—such as Blackwell—that support coarser-grained Tensor Core operations. More broadly, we hope to generalize Prism’s model to other hierarchical compute environments, including distributed systems.

We believe that Prism is capable of guaranteeing data-race freedom, but additional work is required to support this both formally and in practice. In particular, we think that if Prism restricted users to **partitions** with injective indexing functions, data-race freedom would follow naturally. We are also interested in exploring how the design principles of Descend [35], which build on Rust’s ownership model [50], could be applied to Prism.

On the theoretical side, we plan to explore a terminating fragment of Bundl and prove the liveness property discussed in Section 4.3: that all threads sharing perspective  $\pi$  eventually reach the parts of a program viewed at that  $\pi$ . This amounts to showing that threads sharing the same perspective execute the same code and observe the same data, which we hope to prove using logical relations, following Turon et al. [57] and Spies et al. [48, 49].

We believe that Prism is a promising low-level substrate that enables confident composition and can serve as a foundation for building higher-level libraries and abstractions.

## 981 References

- [1] Martín Abadi, Anindya Banerjee, Nevin Heintze, and Jon G. Riecke. 1999. A core calculus of dependency. In *Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages* (San Antonio Texas USA, 1999-01). ACM, 147–160. doi:[10.1145/292540.292555](https://doi.org/10.1145/292540.292555)
- [2] Advanced Micro Devices, Inc. 2024. AMD ROCm™ Software. <https://www.amd.com/en/products/software/rocm.html>. Accessed: 2025-11-10.
- [3] Danel Ahman and Matija Pretnar. 2021. Asynchronous effects. *Proceedings of the ACM on Programming Languages* 5, POPL (Jan. 2021), 1–28. doi:[10.1145/3434305](https://doi.org/10.1145/3434305)
- [4] Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: leveraging warp specialization for high performance on GPUs. *SIGPLAN Not.* 49, 8 (Feb. 2014), 119–130. doi:[10.1145/2692916.2555258](https://doi.org/10.1145/2692916.2555258)
- [5] Lars Birkedal, Filip Sieczkowski, and Jacob Thamsborg. 2012. A Concurrent Logical Relation. In *Computer Science Logic (CSL'12) - 26th International Workshop/21st Annual Conference of the EACSL (2012)*. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 107–121. doi:[10.4230/LIPIcs.CSL.2012.107](https://doi.org/10.4230/LIPIcs.CSL.2012.107)
- [6] Simon Boehm. 2022. SGEMM\_CUDA: Fast CUDA matrix multiplication from scratch. [https://github.com/siboehm/SGEMM\\_CUDA](https://github.com/siboehm/SGEMM_CUDA). GitHub repository. Accessed: 2025-11-10.
- [7] Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In *Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming* (Austin, Texas, USA) (DAMP '11). Association for Computing Machinery, New York, NY, USA, 3–14. doi:[10.1145/1926354.1926358](https://doi.org/10.1145/1926354.1926358)
- [8] NVIDIA Corporation. 2022. NVIDIA H100 GPU Whitepaper. (2022). <https://resources.nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c> Accessed: 2025-11-12.
- [9] NVIDIA Corporation. 2025. Asynchronous Warpgroup Level Matrix Multiply-Accumulate Instructions. <https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions> Accessed: 2025-11-13.
- [10] NVIDIA Corporation. 2025. Coalesced Access to Global Memory. <https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-to-global-memory> Accessed: 2025-11-13.
- [11] NVIDIA Corporation. 2025. cub::BlockReduce. [https://nvidia.github.io/cccl/cub/api/classcub\\_1\\_1BlockReduce.html](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockReduce.html) Accessed: 2025-11-09.
- [12] NVIDIA Corporation. 2025. cuBLAS. <https://docs.nvidia.com/cuda/cublas/index.html> Accessed: 2025-11-09.
- [13] NVIDIA Corporation. 2025. cuBLASDx. <https://docs.nvidia.com/cuda/cublasdx/#nvidia-cublasdx> Accessed: 2025-11-11.
- [14] NVIDIA Corporation. 2025. cuDNN. <https://docs.nvidia.com/deeplearning/cudnn/latest/> Accessed: 2025-11-09.
- [15] NVIDIA Corporation. 2025. cuSPARSE. <https://docs.nvidia.com/cuda/cusparse/index.html> Accessed: 2025-11-09.
- [16] NVIDIA Corporation. 2025. Data Movement and Conversion Instructions: Bulk Copy. <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-bulk-copy> Accessed: 2025-11-12.
- [17] NVIDIA Corporation. 2025. Data Movement and Conversion Instructions: cp.async.bulk.commit\_group. <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-commit-group> Accessed: 2025-11-13.
- [18] NVIDIA Corporation. 2025. Data Movement and Conversion Instructions: cp.async.commit\_group. <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-commit-group> Accessed: 2025-11-13.
- [19] NVIDIA Corporation. 2025. Data Movement and Conversion Instructions: Non-bulk Copy. <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-non-bulk-copy> Accessed: 2025-11-12.
- [20] NVIDIA Corporation. 2025. Parallel Synchronization and Communication Instructions: mbarrier. <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier> Accessed: 2025-11-13.
- [21] NVIDIA Corporation. 2025. Synchronization Functions. <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#synchronization-functions> Accessed: 2025-11-11.
- [22] NVIDIA Corporation. 2025. Warp Shuffle Functions. <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-shuffle-functions> Accessed: 2025-11-10.
- [23] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiaoshi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge,

- Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhua Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyu Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhioshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, and Zizheng Pan. 2025. DeepSeek-V3 Technical Report. doi:[10.48550/arXiv.2412.19437](https://doi.org/10.48550/arXiv.2412.19437) arXiv:2412.19437.
- [24] Dorothy E. Denning and Peter J. Denning. 1977. Certification of programs for secure information flow. *Commun. ACM* 20, 7 (July 1977), 504–513. doi:[10.1145/359636.359712](https://doi.org/10.1145/359636.359712)
- [25] Edsger Wybe Dijkstra. 1963. Over de sequentialiteit van procesbeschrijvingen. (1963). <https://training-ir7.tdl.org/handle/123456789/594>
- [26] Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, and Gennady Pekhimenko. 2025. Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation. arXiv:2504.12984 [cs.LG] <https://arxiv.org/abs/2504.12984>
- [27] C. Elliott. 2005. Vertigo – GPU Compiler & Embedded Language for Graphics Processors. <http://conal.net/Vertigo/>. Accessed: 2025-11-10.
- [28] Azadeh Farzan, Zachary Kincaid, and Andreas Podelski. 2016. Proving Liveness of Parameterized Programs. In *Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science*. ACM, New York NY USA, 185–196. doi:[10.1145/2933575.2935310](https://doi.org/10.1145/2933575.2935310)
- [29] Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodík, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In *Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques* (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 71–82. doi:[10.1145/3410463.3414632](https://doi.org/10.1145/3410463.3414632)
- [30] Bastian Hagedorn, Johannes Lenfers, Thomas Kundfinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. *Proc. ACM Program. Lang.* 4, ICFP, Article 92 (Aug. 2020), 29 pages. doi:[10.1145/3408974](https://doi.org/10.1145/3408974)
- [31] Horace He. 2024. Strangely, Matrix Multiplications on GPUs Run Faster When Given “Predictable” Data! <https://www.thonking.ai/p/strangely-matrix-multiplications> Accessed: 2025-11-13.
- [32] Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation*. ACM, Barcelona Spain, 556–571. doi:[10.1145/3062341.3062354](https://doi.org/10.1145/3062341.3062354)
- [33] Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. *SIGPLAN Not.* 52, 6 (June 2017), 556–571. doi:[10.1145/3140587.3062354](https://doi.org/10.1145/3140587.3062354)
- [34] Jan Hoffmann. 2011. *Types with potential: polynomial resource bounds via automatic amortized analysis*. Ph. D. Dissertation. lmu.
- [35] Bastian Köpcke, Sergei Gorlatch, and Michel Steuwer. 2024. Descend: A Safe GPU Systems Programming Language. 8 (2024), 841–864. Issue PLDI. doi:[10.1145/3656411](https://doi.org/10.1145/3656411)
- [36] Duane Merrill and Michael Garland. 2016. *Single-pass Parallel Prefix Scan with Decoupled Look-back*. Technical Report NVR-2016-002. NVIDIA Corporation. Accessed: 2025-11-10.
- [37] Stefan K. Müller and Jan Hoffmann. 2021. Modeling and analyzing evaluation cost of CUDA kernels. 5 (2021), 1–31. Issue POPL. doi:[10.1145/3434306](https://doi.org/10.1145/3434306)
- [38] NVIDIA Corporation. 2025. *CUB: CUDA Unbound*. <https://docs.nvidia.com/cuda/cub/index.html> CUDA Toolkit Documentation.
- [39] NVIDIA Corporation. 2025. *CUDA C++ Programming Guide*. NVIDIA. <https://docs.nvidia.com/cuda/cuda-c-programming-guide/> Accessed: 2025-11-06.
- [40] NVIDIA Corporation. 2025. CUDA Toolkit 12.3 Downloads (Archive). <https://developer.nvidia.com/cuda-12-3-0-download-archive>. Accessed: 2025-11-11.

- 1079 [41] NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA – Asynchronous Warpgroup Level Matrix Multiply-  
1080 Accumulate Instructions. <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=tcgen05%2520cp#asynchronous-warpgroup-level-matrix-instructions>. Accessed: 2025-11-10.
- 1081 [42] NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA – TensorCore 5th Generation Instructions (tc-  
1082 gen05). <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=tcgen05%2520cp#tensorcore-5th-generation-instructions>. Accessed: 2025-11-10.
- 1083 [43] NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA – Warp Level Matrix Instructions. <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=tcgen05%2520cp#warp-level-matrix-instructions>. Accessed: 2025-11-10.
- 1084 [44] PyTorch Team at Meta. 2025. Helion: A High-Level DSL for Performant and Portable ML Kernels. <https://pytorch.org/blog/helion/>. Accessed: 2025-11-10.
- 1085 [45] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. *SIGPLAN Not.* 48, 6 (June 2013), 519–530. doi:[10.1145/2499370.2462176](https://doi.org/10.1145/2499370.2462176)
- 1086 [46] Nimit Singhania. 2018. *Static Analysis for GPU Program Performance*. PhD dissertation. University of Pennsylvania, Philadelphia, PA, USA. Supervisors: Rajeev Alur and Joseph Devietti.
- 1087 [47] Benjamin Frederick Spector, Simran Arora, Aaryan Singhal, Arjun Parthasarathy, Daniel Y Fu, and Christopher Re. 2025. ThunderKittens: Simple, Fast, and \\$\text{\\textit{Adorable}}\\$ Kernels. In *The Thirteenth International Conference on Learning Representations*. <https://openreview.net/forum?id=0fJfVOSUra>
- 1088 [48] Simon Spies, Lennard Gähler, Daniel Gratzer, Joseph Tassarotti, Robbert Krebbers, Derek Dreyer, and Lars Birkedal. 2021. Transfinite Iris: resolving an existential dilemma of step-indexed separation logic. In *Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation*. ACM, Virtual Canada, 80–95. doi:[10.1145/3453483.3454031](https://doi.org/10.1145/3453483.3454031)
- 1089 [49] Simon Spies, Neel Krishnaswami, and Derek Dreyer. 2021. Transfinite step-indexing for termination. *Proceedings of the ACM on Programming Languages* 5, POPL (Jan. 2021), 1–29. doi:[10.1145/3434294](https://doi.org/10.1145/3434294)
- 1090 [50] Rust Team. 2025. Rust Programming Language. <https://rust-lang.org>
- 1091 [51] Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Aditya Atluri, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS. <https://github.com/NVIDIA/cutlass>
- 1092 [52] The JAX Authors. 2024. Pallas:Mosaic GPU – A JAX Kernel Language for GPUs. <https://docs.jax.dev/en/latest/pallas/gpu/index.html>. Accessed: 2025-11-10.
- 1093 [53] The Khronos Group Inc. 2025. OpenCL™ – The Open Standard for Parallel Programming of Heterogeneous Systems. <https://www.khronos.org/opencl/>. Accessed: 2025-11-10.
- 1094 [54] Philippe Tillet. 2025. Introducing Triton: Open-source GPU programming for neural networks. <https://openai.com/index/triton/>
- 1095 [55] Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In *Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages* (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:[10.1145/3315508.3329973](https://doi.org/10.1145/3315508.3329973)
- 1096 [56] triton-lang. 2025. “01-intro.py” – Introduction to Gluon tutorial in Triton. <https://github.com/triton-lang/triton/blob/main/python/tutorials/gluon/01-intro.py>. Accessed: 2025-11-10.
- 1097 [57] Aaron J. Turon, Jacob Thamsborg, Amal Ahmed, Lars Birkedal, and Derek Dreyer. 2013. Logical relations for fine-grained concurrency. In *Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages* (Rome Italy, 2013-01-23). ACM, 343–356. doi:[10.1145/2429069.2429111](https://doi.org/10.1145/2429069.2429111)
- 1098 [58] Rohan Yadav, Michael Garland, Alex Aiken, and Michael Bauer. 2025. Task-Based Tensor Computations on Modern GPUs. *Proc. ACM Program. Lang.* 9, PLDI, Article 163 (June 2025), 25 pages. doi:[10.1145/3729262](https://doi.org/10.1145/3729262)
- 1099  
1100  
1101  
1102  
1103  
1104  
1105  
1106  
1107  
1108  
1109  
1110  
1111  
1112  
1113  
1114  
1115  
1116  
1117  
1118  
1119  
1120  
1121  
1122  
1123  
1124  
1125  
1126  
1127