

# 1 Modular GPU Programming with Typed Perspectives

2 ANONYMOUS AUTHOR(S)

3 To program the modern GPU, one must balance two frames of mind simultaneously: on the one hand issuing  
4 instructions to individual threads to control their behavior, while on the other tracking the convergence of many  
5 threads acting in concert to issue *collective operations*—necessary for performance-critical hardware features  
6 like Tensor Cores. The tension between these two mindsets makes modular programming error-prone; GPU  
7 languages like CUDA offer no support for enforcing invariants about collective behavior, and such invariants  
8 are particularly fragile at function and module boundaries. A solution is required that enforces these invariants  
9 statically while still granting programmers access to the low-level control structures they need to write perform-  
10 ant code. In this work, we provide this solution in the form of Parallax, a new GPU programming language that  
11 guarantees correct composition by construction using *typed perspectives*, which materialize the programmer’s  
12 assumptions about the granularity at which they are controlling the behavior of threads. We describe the design  
13 of Parallax, implement a compiler for it, and lay its theoretical foundations in a core calculus called Prism. We  
14 evaluate Parallax against a diverse suite of benchmarks and find that it offers programmers the safety guarantees  
15 they need to confidently write modular code without sacrificing performance or low-level control.

## 17 1 Introduction

18 CUDA [23] is a low-level, imperative programming language for NVIDIA GPUs. These GPUs are  
19 organized into a hierarchy of compute resources. *Threads* are the basic unit of sequential execution;  
20 *blocks* are groups of threads that can cooperate through shared scratchpad memory; and the *grid* is  
21 the full collection of blocks launched for a GPU kernel. A CUDA program executes in parallel across  
22 this hierarchy, but is written from the *perspective of a single thread*.

23 While operations are specified per-thread, some operations are only semantically valid when exe-  
24 cuted *collectively* by a group of threads. The `__syncthreads()` primitive, for instance, synchronizes  
25 all threads within a block, and will cause a deadlock if executed by only some of the threads within that  
26 block. There are many such collective operations, including tensor core instructions [25–27], warp  
27 shuffle operations, and other kinds of synchronization primitives. These operations, therefore, require  
28 programmers to carefully marshal compute resources to coordinate which threads execute which lines  
29 of code. As a result, such collective operations *break the illusion* of threads executing independently.

30 The conceptual clash between regular statements (executed by a single thread), and collective  
31 operations (executed by a groups of threads) hinders compositional reasoning. In CUDA, function  
32 calls, like all other statements, are per-thread operations. However, a function may contain code that  
33 executes a collective operation, creating a gap between the per-thread syntax for its invocation and  
34 the cooperative semantics of its execution.

35 This gap creates tension between abstraction and correctness, a tension apparent even in widely  
36 used CUDA libraries that package common functionality via function interfaces. Consider the  
37 following snippet of documentation taken directly from CUB [22], a library of parallel primitives. The  
38 documentation describes the `BlockReduce` function [7], in which all threads in a block collaboratively  
39 apply a reduction operation, such as a maximum or prefix sum, over an array:

40     Computes a block-wide reduction for `thread0` using the specified binary reduction functor. The first `num_valid` threads  
41 each contribute one input element.

- 43     • The return value is undefined in threads other than `thread0`.
- 44     • A subsequent `__syncthreads()` threadblock barrier should be invoked after calling this method if the collective’s  
45 temporary storage (e.g., `temp_storage`) is to be reused or repurposed.

46 Through informal documentation, CUB attempts to convey several assumptions implicit in  
47 `BlockReduce`’s implementation. First, the function is only well-defined when invoked by all threads

50 in a block. Second, because it accesses memory shared among threads, any subsequent reuse of that  
 51 storage requires synchronization to ensure all threads have completed their accesses.

52 In effect, CUB attempts to retrofit CUDA with information about the compute and memory re-  
 53 quirements of collective operations. In fact, by prefixing functions with identifiers such as `Block`,  
 54 CUB creates ad-hoc “namespaces” for different functions that assume similar invariants. However,  
 55 without a type system to statically enforce these requirements in CUDA, correct usage of this function  
 56 depends on the user carefully reading and interpreting the documentation.

57 In this work, we ask the following question: can we provide low-level access to collective operations  
 58 while statically guaranteeing that they execute with the necessary configuration of compute re-  
 59 sources? By reifying these configurations as type-level *perspectives*—so named because they describe  
 60 the view of GPU resources against which each statement is written—we find that we can. Our key  
 61 insight is that GPU programmers naturally map computations onto different compute resources, and  
 62 unlike CUDA, which obscures this mapping, we can expose it in a type system that tracks perspectives.  
 63 This tracking enforces that collective operations are executed with the necessary resources, while  
 64 still allowing users to access low-level interfaces. Further, it allows users to define and compose their  
 65 own collective operations by specifying the perspective with which their code must be run.

66 Our approach is a departure from previous work, which attempts to resolve the mismatch between  
 67 per-thread syntax and collective execution by restricting access to low-level operations. Triton  
 68 [37], a tile-based, multi-dimensional array language, limits the user’s perspective to the block level,  
 69 which prevents programmers from writing the highest-performance kernels. A variety of functional  
 70 languages, such as Futhark [18], provide compile-time guarantees through their type systems but  
 71 are machine-agnostic: they cannot expose low-level control over hardware. Some other efforts,  
 72 like Descend [20], aim to provide a low-level, safe GPU systems programming language, but lack  
 73 support for collective operations like tensor cores. As a result, CUDA remains the de-facto standard  
 74 for writing high-performance kernels on modern GPUs.

75 We introduce Parallax, a new low-level GPU programming language that guarantees correct usage  
 76 of hardware resources by construction. Inspired by dependency calculi [1], Parallax statically tracks  
 77 the configurations of compute and memory resources with type-level perspectives. We also develop  
 78 Prism, a core calculus underpinning Parallax, provide formal rules for manipulating the perspectives  
 79 of both code and data, and prove *type-and-perspective safety*. This ensures that Prism is sound, and  
 80 that code always has the right perspective to execute operations at run time.

81 A parallel goal, alongside to correctness, is performance. We incorporate modern GPU features  
 82 such as tensor cores and asynchronous data movement into Parallax, and demonstrate that Parallax  
 83 can achieve the same performance as hand-written, highly optimized code on an H100 and a 4070  
 84 Ti Super. Our contributions are:

- 85
- 86 (1) Parallax, a low-level GPU programming language with type-level perspectives that track the  
   87 logical grouping of compute and memory resources (Section 3);
- 88 (2) Prism, a formal model of Parallax that tracks perspectives in both its type system and op-  
   89 erational semantics, along with a soundness theorem guaranteeing that programs execute  
   90 operations only when they have been statically proven to possess the necessary perspective.
- 91 (3) An implementation of Parallax (Section 5) that demonstrates that it can support modern  
   92 GPU features like tensor cores and asynchronous data movement, achieving performance  
   93 comparable to hand-optimized CUDA implementations (Section 6).

94  
 95 Section 7 discusses related work, and Section 8 discusses the limitations of our approach and  
 96 outlines future work.  
 97

```

99  1 int tid = threadIdx.x;
100 2 if (tid >= 0 && tid < 31){
101 3   float A[4];
102 4   float B[2];
103 5   float C[4] = { 0 };
104 6   # Populate A with unique values
105 7   for (int i = 0; i < 4; i++)
106 8     A[i] = tid * 4 + i;
107 9   # Populate B with unique values
10810  for (int i = 0; i < 2; i++)
10911    B[i] = tid * 4 + i;
11012   # Issue a warp-level tensor-core
11113   # operation: D = A * B + C
11214
11315   ← asm("mma.sync.aligned.m16n8k8.."
11416     "{%0, %1, %2, %3},    /*D*/"
11517     "{%4, %5, %6, %7},    /*A*/"
11618     "{%8, %9},           /*B*/"
11719     "{%10, %11, %12, %13}; /*C*/"
11820   : "r"(C[0]), "r"(C[1]), ...
11921   : "r"(A[0]), "r"(A[1]), ...
12022   : "r"(B[0]), "r"(B[1]),
12123   : "r"(C[0]), "r"(C[1]), ...);

```

Fig. 1. Invoking a warp-level tensor-core instruction in CUDA. We elide some typecasts necessary to invoke the assembly operation.

```

1  tid : int @thread[1] = id();
2  with group(thread[32]):
3    A : float[4] @thread[1]
4    B : float[2] @thread[1]
5    C : float[4] @thread[1]
6
7    # Populate A with unique values
8    for i in range(0, 4, 1):
9      A[i] = tid * 4 + i
10     C[i] = 0
11   # Populate B with unique values
12   for i in range(0, 2, 1):
13     B[i] = tid * 4 + i
14
15   # Issue tensor-core op.
16   intrinsic.mma(
17     A[0], A[1], A[2], A[3],
18     B[0], B[1],
19     C[0], C[1], C[2], C[3],
20     out=[A[0], A[1], A[2],
21          ← A[3]])
22

```

Fig. 2. Invoking a warp-level tensor-core instruction in Parallax.



Fig. 3. The `@` syntax represents the frequency at which a variable varies across threads T1–T32. Each `@` denotes a coloring of the variable; threads that “view” the variable with the same color must observe the same value.

## 2 Background & Motivation

Before diving into the design of Parallax, we begin with an overview of the GPU’s compute and memory hierarchies and outline the challenges posed by reasoning about them collectively. We also discuss how Parallax can prevent users from using these hierarchies incorrectly in ways that CUDA would otherwise permit.

### 2.1 Compute Hierarchy

In CUDA, programmers launch computations that run on thousands of threads. These threads are organized into a *compute hierarchy* that defines how work is distributed and scheduled on the GPU. At the top of this hierarchy is the *grid*, representing all threads launched as part of a single computation. The grid is divided into *blocks*, each containing a user-specified number of threads. A *thread*, in turn, is the finest unit of execution, and it is the machine’s basic unit of sequential control.

Users define the behavior of their computation by writing a single program which is replicated across all threads. This program controls the behavior of individual threads by reading built-in identifiers like `threadIdx.x`—to determine a thread’s position within a block—and `blockIdx.x`—to determine the block’s position within the grid—at runtime. By making control flow decisions based on these two variables, users can assign each thread to its share of the full computation.

A natural and tempting way to interpret these built-in variables is to think of them as the indices of implicit “parallel-for loops” surrounding the program, where each iteration is executing simultaneously. While this view is sufficient to understand CUDA’s programming model when threads do independent work, it quickly breaks down in the presence of collective operations.

Unlike most instructions, which are executed by a single thread, collective operations must be executed collaboratively by a group of threads. For example, on line 14 of Figure 1 we perform a “warp-level” tensor core operation [27], which is only *meaningful* when invoked by a collection of 32 threads—a *warp*—acting together. For the call, each thread sets up its portion of the operands, and the operation is performed *once* for the entire warp, with the results scattered across participating threads.

148 The first 32 threads in a block are tasked with this operation; programmers mentally group these 32  
 149 threads into a collective unit, and then reflect that grouping in the program via the `if`-statement on  
 150 line 2. If the condition on line 2 were instead `tid >= 0 && tid < 30`, making fewer than 32 threads  
 151 reach line 14, the instruction would be undefined. To make matters worse, the restriction is not  
 152 only on the number of threads executing the operation, but also on their alignment. In this case,  
 153 the starting ID of the group of 32 participating threads must be aligned to a multiple of 32. So, if the  
 154 condition on line 2 were instead `tid >= 1 && tid < 32`, the instruction would still be invalid, even  
 155 though 32 threads would execute it.

156 Collective operations make reasoning about CUDA programs challenging because they force pro-  
 157 grammers to track the *convergence behavior* of threads. Specifically, programmers must reason both  
 158 about how many threads reach a particular point in the program and how those threads are arranged,  
 159 accounting for alignment. This difficulty is further amplified due to two reasons. First, programs  
 160 may have multiple *points* of convergence, requiring programmers to mentally track the relative ID  
 161 of a thread within a logical group as that group evolves over the course of a program’s execution.  
 162 Second, threads may be participating in multiple *levels* of convergence within the same program. In  
 163 our example, we only considered the warp-level tensor core operation, but there are other collective  
 164 operations that require convergence at different granularities like the *warp-group*-level tensor core  
 165 operations, which must be issued collectively by four warps, or the block-level `__syncthreads()`  
 166 synchronization primitive, which must be executed by all threads within a block.

167 Reasoning about collective operations is already error-prone within the context of a single func-  
 168 tion, but becomes even more difficult when reasoning interprocedurally *across* functions. A callee  
 169 may assume a certain number of threads or blocks and may structure its computation, including  
 170 convergence behavior, around that assumption. However, such assumptions are not visible in the  
 171 callee’s function interface, which only exposes the input and output types. To invoke that function  
 172 correctly, users must read documentation, or worse, read the callee’s implementation to understand  
 173 its assumptions, breaking modular reasoning.

174 In Parallax, we materialize the programmer’s mental grouping of the compute hierarchy explicitly  
 175 in the program’s source. Consider the example in Figure 2 which shows an equivalent rewrite of  
 176 the CUDA program in Figure 1 using Parallax. In Figure 2, the call to the `mma` intrinsic is valid only  
 177 because it is executed within a `group` of 32 threads, as made explicit on line 2. Since Parallax’s `group`  
 178 construct already enforces both the size and alignment of participating threads, the validity of the  
 179 `mma` operation is guaranteed by construction.

## 180 2.2 Data and the Memory Hierarchy

181 Data on the GPU is organized into a *memory hierarchy*, mirroring  
 182 the compute hierarchy. All threads in a grid can read and write from  
 183 *global memory*, where operands typically reside at the start of a  
 184 computation and where results are eventually written. Each block  
 185 has access to a limited amount of *shared memory*, a programmer-  
 186 managed scratchpad typically used to stage repeatedly-used data.  
 187 Finally, each thread maintains its own state in *registers* and *local*  
 188 *memory*, which is used for storing the thread’s stack. Crucially,  
 189 registers and local memory are per-thread data, while shared and  
 190 global memory are accessible by multiple threads.

191 *Thread-Local Data.* Because registers and local memory are  
 192 owned by individual threads, a variable with the same name in  
 193 a program can have different values on different threads at run time.

```

1 int x = threadIdx.x;
2 if (x >= 0 && x < 31){
3     float A[4];
4     float B[2];
5     float C[4] = { 0 };
6
7     // k is A's stride
8     A[0] = a_mem[(x/4)*k+(x%4)];
9     A[1] = a_mem[(x/4)*k+(x%4)+1];
10    A[2] = a_mem[(x/4+8)*k+(x%4)];
11    A[3] = a_mem[(x/4+8)*k+(x%4)+1];
12
13    // n is B's stride
14    B[0] = b_mem[(x%4)*n+(x/4)];
15    B[1] = b_mem[(x%4+4)*n+(x/4)];
16
17    asm("mma_sync .");
18
19    // Write back into c_mem
20    // n is C 's stride
21    c_mem[(x/4)*n+2*(x%4)] = C[0];
22    c_mem[(x/4)*n+2*(x%4)+1] = C[1];
23    c_mem[(x/4+8)*n+2*(x%4)] = C[2];
24
25    ↪ c_mem[(x/4+8)*n+2*(x%4)+1]=C[3];
  
```

Fig. 4. Invoking a warp-level tensor core instruction in CUDA after loading data from memory.

197 In CUDA, threads can use this to induce divergent behavior. We've  
 198 already seen an example of this: the `tid` variable in Figure 1. Branching  
 199 on this variable is an instance of data-dependent control flow; in the general case, the possibility  
 200 of divergence lurks anywhere that execution branches on data .

201 A key challenge for Parallax is ensuring that all threads organized by other language constructs like  
 202 `group` remain logically unified, even in the face of data-dependent control flow, which is necessary to  
 203 compute anything interesting. To do this, Parallax must track the frequency with which values vary  
 204 in space. The `@` syntax attached to each variable declaration denotes the rate at which a value changes  
 205 across compute resources on the machine. For example, a `thread[1]` value can vary for every thread,  
 206 while a `block[1]` can vary for every block, but not for threads within that block. Intuitively, the `@`  
 207 construct “colors” a variable across threads, and any threads that share a color are required to agree  
 208 on the variable's value. Figure 3 illustrates valid and invalid colorings of a variable.

209 Using this information, Parallax enforces rules for reading from and writing to data. So, for example,  
 210 had we attempted to introduce a condition based on `tid` within the `group`'s scope in Figure 2, Parallax  
 211 would reject that program at compile time; code that diverges with low frequency cannot read from  
 212 variables that change at a higher frequency. We will explain the syntax and rules of legal Parallax  
 213 programs in greater detail in Section 3.

214  
 215 *Thread-Shared Data.* Shared and global memory, on the other hand, is not replicated by every  
 216 thread; instead, all threads in a block see the same shared memory, while all threads on the grid see  
 217 the same global memory.

218 CUDA does not explicitly model shared and global memory spaces , nor does it track whether  
 219 allocations in a given memory space exceed device limits. Parallax, on the other hand, distinguishes  
 220 these spaces and restricts shared memory to static allocations, throwing an error if an allocation  
 221 exceeds device limits. We will discuss these aspects in detail in ???. For now, however, we focus on  
 222 how memory converges and diverges in tandem with the compute hierarchy.

223 Because threads share their view of global and shared mem-  
 224 ory, these spaces are subject to convergent and divergent behav-  
 225 ior the same way that threads are. For example, when launching  
 226 a kernel on the GPU, a global memory pointer may initially  
 227 belong to the grid because every thread sees the same pointer  
 228 value at the start of the computation. To write to that mem-  
 229 ory, each thread computes an offset from the original base ad-  
 230 dress. In doing so, the pointer effectively *diverges* across threads  
 231 within the larger memory region so each thread can write in-  
 232 dependently. After these writes, the memory must *reconverge*,  
 233 ensuring that all threads have completed their updates before  
 234 the memory can safely be returned to its original logical owner.

235 Let us reconsider the tensor core example in Figure 1, this  
 236 time initializing the operands of the tensor core operation from  
 237 pointers into memory. In Figure 4, **A** and **B** are now populated  
 238 from data in global memory, **a\_mem** and **b\_mem**. After the tensor  
 239 core operation completes, the result is written back to **c\_mem**,  
 240 also in global memory. This variable **c\_mem** is logically accessed  
 241 from two different levels of the compute hierarchy. When enter-  
 242 ing this code, each thread in a group of 32 threads sees the same  
 243 value for **c\_mem**. Then, each thread within these 32 locates an  
 244 offset within **c\_mem** that it can write to (lines 21-24). To restore

```

1  @cuda("device")
2  @requires(thread[32])
3  def simple_mma(
4      a: ptr(const float) @ thread[32],
5      b: ptr(const float) @ thread[32],
6      c: ptr(float) @ thread[32]):
7      x : int @ thread[1] = id()
8      with group(thread[32]):
9          A : float[4] @ thread[1]
10         B : float[2] @ thread[1]
11         C : float[4] @ thread[1]
12
13         # Reads do not need to be lowered.
14         A[0] = a_mem[(x/4)*k+(x%4)];
15         A[1] = a_mem[(x/4)*k+(x%4)+1];
16         A[2] = a_mem[(x/4+8)*k+(x%4)];
17         A[3] = a_mem[(x/4+8)*k+(x%4)+1];
18         B[0] = b_mem[(x%4)*n+(x/4)];
19         B[1] = b_mem[(x%4+4)*n+(x/4)];
20
21         # Skipping initialize C to 0...
22         intrinsic.mma(
23             A[0], A[1], A[2], A[3],
24             B[0], B[1],
25             C[0], C[1], C[2], C[3],
26             out=[C[0], C[1], C[2], C[3]])
27
28         # Must be at thread[1] to write.
29         idx : int @ thread[1] =
30             lambda ro, co: (x/4+ro)*n+2*(x%4)+co
31             with partition(c_mem, p=thread[32], f=idx) as
32                 c_thrd:
33                     c_thrd[0, 0] = C[0]
34                     c_thrd[0, 1] = C[1]
35                     c_thrd[0, 2] = C[2]
36                     c_thrd[0, 3] = C[3]
37
38         # <-- Sync point inferred by name

```

Fig. 5. Invoking a warp-level tensor core instruction in Parallax after loading data from memory.

```

246 1 @ccuda("global")
247 2 # Top-level perspective bounds and shared memory usage.
248 3 #requires(grid[1], block[1], thread[32], smem=1280)
249 4 def mmaTF32NaiveKernel(A: ptr(const(float)) @ grid[1],
250 5 B: ptr(const(float)) @ grid[1],
251 6 C: ptr(float) @ grid[1],
252 7 M: int @ grid[1],
253 8 N: int @ grid[1],
254 9 K: int @ grid[1]):
```

**Fig. 6.** Tensor Float 32 matmul

```

for i in range(0, K_tiles, 1):
    # SYNC POINT (back edge from for loop)
    for j in range(0, 4, 1):
        global_row : int @ thread[1] = warp_row + row
        global_col : int @ thread[1] = i * 8 + col
        with partition(A_smem, p-thread[1], f=..) as A_th:
            with group(thread[1]):
                A_thd[0] = A[global_row * K + global_col]

# SYNC POINT (back edge from for loop)
for j in range(0, 2, 1):
    # Similar to write into A_smem ...
    with partition(B_smem, p-thread[1], f=..) as B_th:
        with group(thread[1]):
            B_th[0] = B[global_row_b * N + global_col_b]

# Give each warp an offset into C_smem
# SYNC POINT (back edge from for loop)
with claim(C_smem, p-thread[32]) as Cs_warp:
    match split(thread):
        case 32:
            # Call tensor core function inside
            # a function interface.
            simple_mma(A_smem, B_smem, Cs_warp)

# Write back final result to C
for j in range(0, 4, 1):
    flat_idx_c : int @ thread[1] = id() * 4 + j
    row_c : int @ thread[1] = flat_idx_c / MMA_K
    col_c : int @ thread[1] = flat_idx_c % MMA_K
    idx_d = lambda x: row_c * N + col_c + x
    with partition(C_blk, p-thread[1], f=idx) as C_th:
        with group(thread[1]):
            C_th[0] = C_smem[row_c * MMA_N + col_c]

```

**return**

74 **return**

**c\_mem** back to its 32 thread-level ownership, all 32 threads must first synchronize to ensure all per-thread writes have completed. This is similar to the requirement we saw documented in CUB in Section 1.

272 Similarly to the compute hierarchy (Section 2.1), CUDA programmers are responsible for tracking  
273 the logical owner of a piece of memory as it evolves over the course of a program, and for ensuring  
274 that appropriate synchronization occurs whenever that ownership changes.

Parallax, by contrast, makes the evolution of a memory object’s logical ownership explicit in the type system and *automatically* inserts the synchronization required to restore memory to its original ownership. In Figure 5, we show how memory is lowered through the compute hierarchy for the same tensor core operation introduced in Figure 4. The lowering of `c_mem` is required in this program as Parallax only permits writes when data is owned by a single thread. To lower `c_mem`, we use Parallax’s `partition` operation on line 31. The operation takes the name of the memory to partition, `c_mem`, and lowers it to a single thread, and assigns it a new name, `c_thrd`; Parallax will not allow references to the old name `c_mem` within the partition’s scope. The `partition` function also takes an indexing function, and each time `c_thrd` is accessed, this indexing function is implicitly applied. Once the `partition`’s scope ends, Parallax will insert a synchronization before the first use of the original name, so that all per-thread writes have completed.

We would like to emphasize that preventing data races is not one of Parallax’s goals. In Parallax, data races *can* occur; however, since the data is eventually synchronized before it is reused, the last-writer wins. Out-of-bounds accesses are considered undefined behavior. We believe that prior work, such as Descend [20], lays the foundation for reasoning about data-race free programs, and that a Descend-like memory model can be adapted to Parallax. Parallax is instead focused on the *interaction* between the compute and memory hierarchies, and on reasoning about them simultaneously to ensure that an operation is executed only when sufficient compute resources are available, a guarantee that prior work like Descend cannot provide.

### 295 3 The Parallax Language

296 Parallax is an imperative, low-level programming language designed at a level of abstraction comparable  
 297 to CUDA. Unlike CUDA, Parallax’s syntax materializes the mapping of a computation onto the  
 298 compute and memory hierarchy explicitly in the program source. Using this information, Parallax  
 299 enforces that programs only execute collective operations with sufficient resources.

300 To guide our discussion about Parallax’s design, we use the program in Figure 6 as a running  
 301 example. It computes a matrix multiplication between two float arrays, **A** and **B**, to produce an output  
 302 matrix **C**. In this program, each block computes a  $16 \times 8$  independent tile of the output. To do so, the it  
 303 first locates the tile index assigned to it (line 19-20). Next, threads in each block load corresponding  
 304 rows and columns from **A** and **B** (line 41-53) into shared memory. Finally, the program invokes a  
 305 warp-level, tensor-core instruction to compute the output on (line 62), requiring threads in a warp to  
 306 converge at the point. We encapsulate this tensor-core instruction in a function, demonstrating how  
 307 function composition works in Parallax. The function is the same as Figure 5.  
 308

#### 309 3.1 Levels

310 Parallax models the machine’s compute hierarchy through *levels*. There are three levels in Parallax—  
 311 **grid**, **block** and **thread**—which are organized as expected: a **grid** consists of multiple units of the  
 312 **block** level, which in turn consist of multiple units of the **thread** level. Levels have an order defined  
 313 over them, with **thread** < **block** < **grid**.

314 There are two key differences between Parallax’s levels and those in CUDA. First, Parallax does  
 315 not model CUDA’s three-dimensional grid or block structure. Second, there are two commonly-used  
 316 “levels”, *warp* and *warpgroup*, noticeably absent from our hierarchy. On the hardware, the units of each  
 317 level are arranged in a single linear order, and the three-dimensional structures of CUDA are simply  
 318 interpretations of this ordering, not distinct hardware resources. Similarly, warps and warpgroups are  
 319 organizational constructs defined in terms of existing levels. Namely, a warp is a group of 32 threads  
 320 whose first thread ID is aligned to 32. A warpgroup, which only became a meaningful construct with  
 321 the release of the Hopper architecture , consists of 128 threads aligned to 128.  
 322

323 Rather than baking these interpretations into Parallax by adding new dimensions and levels, we  
 324 let users express multi-dimensional structures and define groupings of custom sizes within Parallax.  
 325

#### 326 3.2 Perspectives

327 Perspectives are the central concept of Parallax. A perspective represents the layer of the hierarchy  
 328 from which the programmer is defining the machine’s behavior. They are what allow Parallax to  
 329 determine which compute resources the programmer is controlling, whether they are available in  
 330 the program’s context, and, once provided, if those resources are sufficient for a given operation.  
 331

332 A *perspective* is a level—**grid**, **block**, or **thread**—paired with a static constant **n**, specifying the  
 333 number of units at that level. A perspective is written as **level** [**n**] . For example, **thread**[2] denotes  
 334 a perspective of two threads, **block**[4] denotes a perspective of four blocks, and so on. Perspectives  
 335 also carry *alignment information*: a perspective of size **n** is aligned to **n**. In this way, a warp is simply  
 336 a desugaring of **thread**[32], and a warpgroup is a desugaring of **thread**[128].

337 Finally, perspectives have a partial order defined on them. We say that **level**<sub>2</sub>[**n**<sub>2</sub>] is *broader* than  
 338 **level**<sub>1</sub>[**n**<sub>1</sub>], or that **level**<sub>1</sub>[**n**<sub>1</sub>] is *narrower* than **level**<sub>2</sub>[**n**<sub>2</sub>], if either:

- 339     (1) **level**<sub>1</sub> < **level**<sub>2</sub>; or,
- 340     (2) **level**<sub>1</sub> = **level**<sub>2</sub> and **n**<sub>2</sub> % **n**<sub>1</sub> = 0.

341 The condition **n**<sub>2</sub> % **n**<sub>1</sub> = 0 enforces alignment, and we will see why it is required in Section 3.4.  
 342

### 344 3.3 Perspectives on Code

345 Code is associated with a set of perspectives, called a *perspective bound*. Perspective bounds directly  
 346 correspond to the compute resources a program expects to exercise control over the course of its  
 347 execution. At every point in the program, the perspective bound for that point indicates which  
 348 layer of the hierarchy is being programmed, and how that layer can be destructed into narrower  
 349 perspectives. For example, a line of code with a `{block[1], thread[4]}` perspective bound tells  
 350 Parallax that the current line of code is being programmed at the `block[1]` perspective and that  
 351 the `block[1]` perspective can be destructed into a some number of `thread[4]` perspectives. As a  
 352 shorthand, when we refer to a code's perspective, we mean the broadest perspective available in its  
 353 perspective bounds.

354 Each function starts out with a top-level perspective bound defined using the notation `@requires(grid[n1], b`  
 355 In Figure 6, the perspective bound is defined on line 2. Starting from this top-level declaration, pro-  
 356 grammers can shape the program's current perspective using two constructs `group` and `split`.

357  
 358 **Group.** The `group` construct lets programmers shift from a broader perspective to some number  
 359 of narrower perspectives contained in it. Operationally, the `group` construct does this by replicating  
 360 code written from the narrower perspective across the broader one. That is, `group` forks many copies  
 361 of a narrower perspective.

362 Let's consider this in context of our example. In Figure 6, execution begins at `grid[1]` on line 12.  
 363 At that point, a programmer logically controls the whole grid's behavior. To produce different output  
 364 tiles, the programmer needs to be able to control the behavior of individual blocks, and they shift  
 365 their perspective to `block[1]` on `todo`. Inside the `group(block[1])`'s scope, the behavior of the  
 366 machine can be defined with respect to a single block, and code is replicated across all blocks in the  
 367 grid.

368 As the reader may have already guessed, not all uses of `group`  
 369 are valid. The examples in Figure 7 have no interpretation on the  
 370 hardware. On line 4, the program tries to broaden its perspective to  
 371 `block[1]` from a `thread[1]` perspective, which is clearly illegal.  
 372 On the other hand, in the second example, the program tries to  
 373 narrow its perspective to a `block[5]` from a `block[6]` perspective.  
 374 While  $5 < 6$ , Parallax cannot evenly replicate `block[5]` perspective  
 375 inside a `block[6]` perspective, and so rejects this program. Recall,  
 376 from Section 3.2, that the condition on whether a perspective is  
 377 narrower than another is actually a condition on divisibility, not just size.

378 To eliminate such malformed cases, Parallax only allows an invocation of `group(level[n])` if the  
 379 current perspective bounds contains a perspective broader than `level[n]`. Once `group(level[n])`  
 380 is invoked, it modifies the current perspective bounds in two key ways. First, it removes all perspective  
 381 broader than `level[n]` from it. Second, it sets the broadest perspective within that `group`'s scope to  
 382 be `level[n]`.

383  
 384 **Split.** Unlike `group`, which is used for replication into multiple narrower perspectives (all of equal  
 385 size), `split` is used for sharding the current perspective unequally. For example, Figure 8 shows  
 386 a `split` from `thread[4]` perspective to one branch with `thread[2]` perspective and two with  
 387 `thread[1]` perspective. The three arms of the `split` execute independently in parallel<sup>1</sup>, as a form  
 388 of unordered composition. When `split(level)` is invoked, the perspectives of its branches diverge.  
 389 At the end of the `split`, they reconverge, and continue execution with the original perspective.

```

1 # Example 1
2 with group(thread[2]):
3   # Illegal because block > thread
4   with group(block[1]):
5     pass
6
# Example 2
7 with group(block[6]):
8   # Illegal because 6 % 5 != 0
9   with group([block[5]]):
10    pass

```

Fig. 7. Illegal uses of `group`.

390  
 391 <sup>1</sup>Despite the use of the pattern-matching `case` syntax, *all* branches of the `split` execute.

393    Use of **split** is necessary to write warp-specialized code, a  
 394    programming pattern used in high-performance kernels. Another  
 395    important use of **split**—masking off threads—can be found in our  
 396    running example. Line 63 in Figure 6 shows a **split** that requests  
 397    the first warp in the block, narrowing from **block[1]** into a single  
 398    **thread[32]**. Later in Figure 8, this warp is used to execute a tensor-  
 399    core operation.

400    Because **split** corresponds to unordered composition, it must provide each of its branches  
 401    their requested perspectives simultaneously. Parallax thus checks that the cumulative sum of the  
 402    perspectives requested by each branch of the **split** can be satisfied. For example, the program  
 403    in Figure 9 does not type check. Finally, because perspectives enforce alignment, every branch of  
 404    the **split** must also be aligned; not all **splits** whose sizes sum to less than or equal to the available  
 405    units are valid. Figure 10 shows an example of this constraint being violated.

406    Once **split** is invoked, for each branch that requests **n** units,  
 407    all perspectives broader than **level[n]** are removed from its per-  
 408    spective bound, and the available units for **level** are set to exactly  
 409    **n**.

### 411    3.4 Perspectives on Data

412    Section 3.2 described how a programmer can control different layers  
 413    of the hierarchy by changing the perspective of code through  
 414    **group** and **split**. To ensure these operations remain meaningful,  
 415    Parallax must ensure that compute units inside a perspective do  
 416    not diverge. Parallax must thus reason about how control flow in-  
 417    troduced through **for**, **while** and **if** statements affects the control  
 418    flow of the threads that comprise a perspective. To guarantee data-  
 419    dependent control flow does not cause divergence, Parallax tracks  
 420    how data may vary across the threads of a perspective. Parallax  
 421    distinguishes between two kinds of data: thread-local data (or, regis-  
 422    ter and stack data), and thread-shared data (or, shared memory and  
 423    global memory).

424    **3.4.1 Thread-Local Data.** Thread-local data is data stored in registers and stack-variables that is  
 425    private to each thread. In the program source in CUDA, a single thread-local variable may contain  
 426    different values at runtime simultaneously across different threads. Indeed, to compute anything  
 427    useful, threads must operate on different input values, and so reasoning about this divergent state is  
 428    essential.

429    Instead of imposing severe restrictions on data-dependent control flow, Parallax solves this problem  
 430    by assigning variables perspectives. Intuitively, the perspective of a variable tells Parallax the  
 431    frequency at which its values change in space. For example, a **grid[1]** variable is the same for all  
 432    threads in a grid, meanwhile a **block[2]** variable may change every two blocks. Crucially, this  
 433    frequency tells us that a **block[2]** variable is *indistinguishable* by threads within 2 blocks where  
 434    the starting block ID is aligned to 2. Figure 3 shows different frequencies as different colorings of a  
 435    variable.

436    Programmers specify the perspective that each variables live at in its declaration. A variable **v** of  
 437    type **int** is declared at **thread[1]** perspective using the syntax **v : int @ thread[1]**.<sup>2</sup> Variables  
 438    are only available for reading and writing from certain perspective. The rule can be summarized

```
1 with group(thread[4]):  
2   with split(thread):  
3     case 2:  
4       ...  
5     case 1:  
6       ...  
7     case 1:  
8       ...
```

Fig. 8. Simple example of **split**.

```
1 # Example 1  
2 with group(thread[4]):  
3   match split(thread):  
4     case 4:  
5       ...  
6     # Illegal! 4 + 1 > 4  
7     case 1:  
8       ...
```

Fig. 9. Illegal split exceeds the available perspectives.

```
1 # Example 2  
2 with group(thread[3]):  
3   match split(thread):  
4     case 2:  
5       ...  
6     case 1:  
7       ...  
8     # Illegal! 2 + 1 <= 3,  
9     # but the second branch  
10    # is not aligned by 2.
```

Fig. 10. Illegal split violates alignment.

<sup>2</sup>If not explicitly annotated, Parallax infers a variable's perspective to be the perspective of the code where it was declared.

442 as follows: Parallax allows programs to “write down” to narrower perspectives and “read up” from  
 443 broader ones.

444     **Read Up.** Variables can only be read if their perspective is at least  
 445 as broad as the current code perspective. Figure 11 gives an example  
 446 of an illegal read that would violate this constraint. Because the  
 447 variable **flag** varies across threads, branching on it causes a subset  
 448 of threads in the block to reach the `__syncthreads()`, causing a  
 449 deadlock, despite the fact that `__syncthreads()` should always be  
 450 safe in `block[1]`.  
 451

```
1 flag : bool @ thread[1] = ...
2 with group(block[1]):
3   if (flag)
4     __syncthreads();
```

Fig. 11. Illegal read of  
**thread[1]** variable.

452     **Write Down.** Writes are dually constrained. Only values that live at broader perspectives can be  
 453 written into variables that live at narrower perspectives. For example, a `block[1]` variable can be  
 454 used to write into a `thread[1]` variable, but values from a `thread[1]` variable cannot be written  
 455 into a `block[1]` variable. The latter constraint is demonstrated in Figure 12, where writing from a  
 456 `thread[1]` variable into a `block[1]` one again would cause a deadlock.  
 457

458     Together, the “read up” and “write down” rules ensure that in-  
 459 formation only flows from broader perspectives to narrower ones.  
 460 In Figure 6, we can see rule “read up” in action on lines 45 and 46. Meanwhile, “write down” is being used to set up variables on line  
 461 18-22.

```
1 x : bool @ thread[1] = ...
2 y : bool @ block[1] = ...
3 with group(block[1]):
4   y = x
5   if(y):
6     __syncthreads();
```

Fig. 12. Illegal write of  
**thread[1]** variable into a  
**block[1]** variable.

462     3.4.2 *Thread-Shared Data.* Unlike thread-local data, which is liter-  
 463 ally replicated across threads and is backed by distinct physical  
 464 storage, thread-shared data consists of pointers into shared and  
 465 global memory that are visible to some collection of threads. As a result, the perspective such data  
 466 inhabits can evolve as the program executes.  
 467

468     In Parallax, there are three mechanisms for obtaining thread-shared data. The first method is to  
 469 directly allocate data residing in shared memory (Parallax assumes that all global memory allocations  
 470 have been made before launching the CUDA kernel, as is typically the case with CUDA programs).  
 471 The second and third are to obtain offsets into existing data by using the `partition` or `claim`  
 472 operations. These constructs mirror `group` and `split`, and are used to temporarily narrow the  
 473 perspective associated with a pointer.

474     Parallax simplifies shared memory allocation by requiring all shared memory allocations have a  
 475 static size. Since shared memory is only visible to all threads in a block, a shared memory allocation  
 476 is only valid with if code has a `block[1]` perspective. An example of such an allocation is shown on  
 477 lines 28-30 of Figure 6. During compilation, Parallax will automatically handle the necessary pointer  
 478 arithmetic to assign each allocation an appropriate offset within the shared memory space. The  
 479 `@requires` annotation specifies the amount of shared memory a function expects to have available,  
 480 and Parallax uses this information to ensure that a function’s allocations do not exceed its declared  
 481 limit, and checks whether there is enough shared memory available at its call sites. Figure 6 declares  
 482 the amount of shared memory it will require on 2. We use standard techniques to statically bound  
 483 memory usage with Parallax’s type system.

484     Now, given an allocation into global or shared memory, Parallax introduces two language constructs  
 485 `partition` or `claim` that help programmers narrow the perspective at which a pointer resides.

486     **Partition.** The `partition` operation plays the same role for memory that `group` plays for code. It  
 487 is used to narrow the perspective of a pointer by creating many copies of it at a narrower perspective,  
 488 and adding to each a distinct offset from the original pointer.  
 489

For example, a use of `partition` could take an array `x : ptr(int) @ thread[4]` and compute offsets `x + i : ptr(int) @ thread[1]`. While the original address was identical across groups of 4 threads, the new offset pointers have diverged to point to different addresses on each thread. These new pointers can then be used within the scope of the `partition`. After the partition ends, the threads participating in the `partition` must synchronize to restore the original pointer `x` to its perspective.

Concretely, the `partition` operation takes a pointer variable `x`, an indexing function `f`, and produces a new variable `y` at a narrower perspective `level[n]` to be used in an inner scope.

Inside the scope of the partition, original variable `x` cannot be used, and the memory can only be accessed using `y`. Each use of `y` applies the indexing function `f` to compute the true offset of the access. In Figure 6, for example, we partition the `C` buffer to `block[1]` perspective on line 25 first, and then narrow it further to `thread[1]` perspective on line 78 so that we can actually write to it. When the scope of the `partition` statement ends, Parallax automatically inserts the necessary synchronization to restore the memory to its original perspective. Strictly speaking, this synchronization is only required at the next point the memory is to be *re-used*. Parallax has a mechanism for optimizing the placement of these synchronization points, which is described in Section 5.

At this point, it is necessary to consider how different indexing functions affect the possibility of data races. If the indexing function is injective, each narrower perspective receives a distinct offset into the underlying array, the resulting partition is free of data races. When the indexing function is not injective, multiple threads may write to the same location, introducing a potential data race. Out-of-bounds accesses are undefined behavior.

As we discussed in Section 2.2, achieving data-race freedom is *not* one of Parallax's goals, and our guarantees continue to hold even when indexing functions are not injective. Because Parallax automatically inserts the necessary synchronization before the original memory name is restored to its higher perspective, Parallax ensures that all writes have completed before accessing the original name. The subsequent uses of that pointer will observe the value written by the last writer.

**Claim.** The `claim` operation lets the programmer narrow a pointer's perspective without adding an offset, by giving it to only one collection of threads with that narrower perspective. For example, line 57 of Figure 6 shows an example of `claim`, where the pointer `C_smem @ block[1]` is *claimed* by a single warp in the block and narrowed to `Cs_warp @ thread[32]` to call a Tensor Core operation.

Concretely, the `claim` construct takes the original pointer variable `x`, the target perspective to narrow to, and a new name `y` to assign to the narrowed memory. As with `partition`, the original pointer `x` is not accessible within the scope of the `claim`. Moreover, if the scope of the `claim` includes a `splits`, the new pointer `y` is accessible only within *one* of its branches; sibling branches are not permitted to read or write from this memory.

### 3.5 The `id()` Function

As opposed to exposing users to special hardware `blockIdx.x`, `threadIdx.x` variables directly, Parallax provides the `id()` function instead. The `id` function returns the relative index from a given perspective. The interpretation of the `id` function depends on both the perspective of the variable it is being written to, and the perspective that the `id` function is being invoked in. In our example in Figure 6, we use a call to `id` in `grid[1]` scope at line 18 and 19 to locate the tile that each block is ultimately charge of computing. In Figure 2, however, a call to `id` in `thread[32]` scope at line 7 will not return the global thread ID of the threads launched in the kernel, but rather the *relative* thread ID within the 32 threads, ranging from 0 to 31.

### 540 3.6 Collective Operations

541 With perspectives on code and data in place, we can now turn our attention to Parallax's reasoning  
 542 about collective operations.

543 There are two ways to access collective operations in Parallax. The first is to use intrinsics provided  
 544 by the compiler. These intrinsics have built-in perspective classes attached to them, and Parallax  
 545 enforces that the operation is only ever executed if code is being executed from the correct perspective.  
 546 In Figure 2, the tensor-core call at line 22 is calling into intrinsics provided by our compiler. The  
 547 second, more flexible approach allows users to extend these operations using the `unsafe` feature.  
 548 Using unsafe code, users can inline a collective operation—often a single assembly instruction—into  
 549 a Parallax function and ensure that the function's top-level requirements, specified with `@require`,  
 550 faithfully describe the perspectives required to execute that instruction. Then, Parallax will check  
 551 the call to the function as any other function call, and will check whether it is being called with a  
 552 compatible perspective. The rules for checking function calls are described in Section 3.8.  
 553

### 554 3.7 Asynchrony

555 Asynchronous data-movement works similarly to other data operations.

556 Users can mark storage as asynchronous with the `async`  
 557 construct. As with `partition` and `claim`, invoking `async` re-  
 558 moves the old name and exposes a fresh name that is only  
 559 accessible within the scope of the `async` statement. Inside this  
 560 scope, Parallax restricts the new name to be used solely with  
 561 the asynchronous data-movement intrinsics that the language  
 562 provides. These intrinsics operate on a per-thread level, so Parallax will require users to program  
 563 from the `thread[1]` perspective.  
 564

565 Parallax models two primary forms of asynchronous data movement: `async mempy` memory  
 566 copies and tensor-memory accelerator instructions. We show an example of the former in Figure 13.  
 567 As with the data operations in ??, Parallax is responsible for inserting the necessary synchronization  
 568 before the program's first use of the original name, ensuring that the asynchronous transfer has  
 569 completed.

### 570 3.8 Function Composition

571 To check whether a function call is valid, Parallax needs to ensure that the call site can provide the  
 572 set of perspectives that a function will attempt to use over its execution, and that the arguments  
 573 passed to the function also respect the perspectives declared for its input and output types. Both  
 574 requirements are represented explicitly via the `@require` construct.

575 As mentioned in Section 3.2, each function carries a perspective class. We call this top-level  
 576 perspective class the function's *perspective signature*. The signature specifies the top-level perspective  
 577 class the function *assumes* will be provided to it. This signature represents the *minimum amount* of  
 578 compute and memory resources the function must be called with. In the example above in Figure 6,  
 579 line 2 defines the perspective signature. Up to this point, we have described programs written *inside*  
 580 functions under the assumption that the resource signature can be satisfied. In Section 3.2, for  
 581 example, we used this signature to set up the top-level perspective class for the function.  
 582

583 **Compute perspectives.** At a function's call site, then, we need to ensure that the function's  
 584 perspective signature can be satisfied. Snap compares the current compute perspective with the  
 585 one required by the function, and verifies that only functions whose perspective signatures can be  
 586 satisfied are called.  
 587

```
1 with async(old_name) as new_name:
2   match thread:
3     case 1:
4       cp_async(smem, new_name, 16);
```

Fig. 13. Example of an `cp.async` instruction in Parallax.

**Per-argument checks.** In Parallax, functions have pass-by-value semantics. For arguments, we distinguish between primitive data types and pointers to memory. For primitive data types, like `int`, `bool`, and stack allocations, users can pass either arguments that live at or above the data perspective specified by the function’s signature. For pointers, we allow passing pointers that live at higher data perspective if the pointer is marked as `const`, indicating that it will only be used for reading (line 4 and 5 in Figure 2). Otherwise, we require that pointers match the exact perspective of the function signature.

## 4 Formalization in Prism

Having introduced the full Parallax language, we now describe Prism, a core calculus that formalizes its most fundamental aspects by statically tracking perspectives on code and data. We use Prism to argue that well-typed Parallax programs are not only type-safe, but will also never get stuck trying to execute operations for which they lack the correct perspective.

In this section, we describe Prism’s type system and operational semantics—in particular how it manages compute and data perspectives—and build up to a formal proof of type-and-perspective safety. We instrument Prism’s operational semantics with runtime perspective enforcement: its rules will get stuck if they encounter an operation for which they have the wrong perspective. This runtime enforcement means that our safety theorem can guarantee that dynamically-realized perspectives match the ones inferred by the type system.

### 4.1 Prism Type System

The core idea in Prism is to track, at the type-level, the program’s perspective on code and data. To achieve this, we borrow techniques from the literature on *dependency tracking* [1]; in particular, the code perspective is tracked on the typing judgment, which has the form  $\Gamma \vdash^\pi e : \tau$  for expressions and  $\Gamma \vdash^\pi s$  for statements. The  $\pi$  over the  $\vdash$  is the code perspective on  $e$  and  $s$ , and is comprised of a level and a size, the same structure as a perspective in Parallax.

The typing context also tracks the perspective at which each variable lives; data can only be read or written from a variable when its perspective is compatible with that of the code interacting with it. This requirement is made manifest in the `T-VAR` rule, which can be found in Figure 14. Observe that the  $\pi$  in the variable rule must match exactly between data and code; principles like “read up” and “write down” are instead encoded directly in the rules for reading and writing, like `T-ARR-ACCESS`, which views the array being read with broader perspective than the current code perspective.

Figure 14 also shows other key rules, and these rules fall into two main categories: rules for managing the code perspective and rules for managing data perspective.

**4.1.1 Managing Perspectives on Code.** In Parallax, there are two operations which manage code perspectives: `group` and `split`. In Prism, to better model the details of how perspectives shift throughout programs, we introduce a third construct, `destruct`.

Prism’s `group` statement directly corresponds to Parallax’s, and is checked by the `T-GROUP` rule. Given some perspective that can be divided into  $q$  equally sized narrower  $\pi$ s, the statement `group q s` will check with perspective  $q \cdot \pi$  provided that  $s$  itself checks with perspective  $\pi$ . The starting perspective is of the form  $q \cdot \pi$ , and encodes Parallax’s alignment requirement.

The `split` statement, meanwhile, is checked by the `T-SPLIT` rule, and functions like a binary version of the n-ary `split` construct in Parallax. It enforces the same divisibility requirements to ensure that the perspectives on code and data remain properly aligned, and then checks the two sub-statements  $s_1$  and  $s_2$  with the divided, narrower perspectives.

The final rule for managing code perspective is `T-DESTRUCT`, which shifts perspective down the GPU hierarchy and makes explicit in Prism programs exactly where such shifts occur. The

|                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                               |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\boxed{\Gamma \vdash^\pi e : \tau}$<br>$\frac{x :^\pi \tau \in \Gamma}{\Gamma \vdash^\pi x : \tau}$ T-VAR                                                                                                                                                                                                                | <i>(Expression typing)</i>                                                                                                                                                    |
| $\frac{\Gamma \vdash^{\pi'} e_1 : \tau[] \quad \Gamma \vdash^\pi e_2 : \mathbf{int} \quad \pi \leq \pi'}{\Gamma \vdash^\pi e_1[e_2] : \tau}$ T-ARR-ACCESS                                                                                                                                                                 |                                                                                                                                                                               |
| $\boxed{\Gamma \vdash^\pi s}$                                                                                                                                                                                                                                                                                             | <i>(Statement typing)</i>                                                                                                                                                     |
| $\frac{\Gamma \vdash^{(h,n_1)} s_1 \quad \Gamma \vdash^{(h,n_2)} s_2 \quad n_1, n_2 \text{ align to } n}{\Gamma \vdash^{(h,n)} \mathbf{split}(n_1, n_2)\{s_1\}\{s_2\}}$ T-SPLIT<br>where $n_1, n_2 \text{ align to } n ::= (n_1 + n_2 \leq n) \text{ and } (n_1   n) \text{ and } (n_2   n) \text{ and } (n_2   n_1 + n)$ |                                                                                                                                                                               |
| $\frac{\Gamma \vdash^\pi s}{\Gamma \vdash^{q,\pi} \mathbf{group} q s}$ T-GROUP                                                                                                                                                                                                                                            | $\frac{\Gamma, y : \downarrow^\pi \tau[] \vdash^\pi s \quad l \neq \mathbf{Local}}{\Gamma, x :^\pi \tau[] \vdash^\pi \mathbf{lower} x \text{ into } y \text{ in } s}$ T-LOWER |
| $\frac{\Gamma, y :^{(h,n/c)} \tau[] \vdash^\pi s \quad c   n \quad l \neq \mathbf{Local}}{\Gamma, x :^{(h,n)} \tau[] \vdash^\pi \mathbf{partition} x \text{ into } y \text{ by } c \text{ in } s}$ T-PARTITION                                                                                                            | $\frac{\Gamma \vdash^\pi s}{\Gamma \vdash^\pi \mathbf{destruct} \text{ in } s}$ T-DESTRUCT                                                                                    |
| $\frac{\Gamma, y :^{(h,n')} \tau[] \vdash^\pi s \quad n' \leq n \quad l \neq \mathbf{Local}}{\Gamma, x :^{(h,n)} \tau[] \vdash^\pi \mathbf{claim} x \text{ into } y \text{ at } n' \text{ in } s}$ T-CLAIM                                                                                                                |                                                                                                                                                                               |

Fig. 14. Core typing rules of Prism. The typing rules presented here are a simplified selection of the full rules, which can be found in Appendix A.2.

$\downarrow$  operation on  $\pi$ s “destructs” the perspective into many narrower perspectives at a lower level;  $\downarrow(\mathbf{Grid}, 1) = (\mathbf{Block}, B)$  and  $\downarrow(\mathbf{Block}, 1) = (\mathbf{Thread}, T)$ , where  $B$  and  $T$  are parameters to a particular instantiation of Prism to describe the number of blocks per grid and threads per block.<sup>3</sup> Because  $\downarrow$  is only defined on  $(\mathbf{Grid}, 1)$  and  $(\mathbf{Block}, 1)$ , the rule enforces that one can only **destruct** their code perspective with exactly one grid or block.

4.1.2 *Managing Perspectives on Data.* The mechanism for managing data perspective mirrors that of code perspective, with each operation for data corresponding to an operation for code.

The **partition** operation is analogous to **group**-ing a code perspective. The typing rule for this operation, T-PARTITION, requires that the data perspective on the variable to be partitioned,  $x$ , is the same as the current perspective on code. After partitioning, a fresh variable  $y$  is introduced with a new perspective  $\pi/c$ , which we use as a shorthand to denote a division of a perspective  $\pi$  into  $c$  equally-sized narrower perspectives. Within the body of the **partition**, we disallow any references to the original variable  $x$  and continue checking with the original code perspective  $\pi$ ; the **partition** has no effect on the current code perspective.

Unlike **partition**, which divides up a piece of data equally among narrower perspectives, the **claim** operation views the claimed data with exactly one narrower perspective. Accordingly, Prism needs to ensure that only one branch of a **split** operation, with the appropriate  $\pi$ , can refer to the claimed variable. To ensure that this is the case, the T-CLAIM rule links the data perspective of the variable to the compute perspective of the code claiming it by changing *both* at the same time. This

<sup>3</sup>Prism is abstracted over these  $B$  and  $T$  values, so instead of tracking perspective bounds the way that Parallax does, it only tracks the top-level perspective described in Section 3.2.

$$\begin{array}{l}
687 \quad \frac{L(t), S(b), \Sigma, t, b, 0 \vdash^{(\text{Grid}, 1)} s \rightsquigarrow s' + \eta', \sigma', \Sigma' \quad P(t, b) = s}{L, S, \Sigma, P \rightsquigarrow L[t \mapsto \eta'], S[b \mapsto \sigma'], \Sigma', P[(t, b) \mapsto s']} \text{ S-PROGRAM} \\
688 \\
689 \\
690 \\
691 \quad \frac{p < n_1 \quad n_1, n_2 \text{ align to } n \quad \eta, \sigma, \Sigma, t, b, p \vdash^{(h, n_1)} s_1 \rightsquigarrow s'_1 + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, p \vdash^{(h, n)} \mathbf{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \mathbf{split}(n_1, n_2)\{s'_1\}\{s_2\} + \eta', \sigma', \Sigma'} \text{ S-SPLIT-LEFT} \\
692 \\
693 \\
694 \quad \frac{p \geq n_1 \quad p < n_1 + n_2 \quad n_1, n_2 \text{ align to } n \quad \eta, \sigma, \Sigma, t, b, p - n_1, \vdash^{(h, n_2)} s_2 \rightsquigarrow s'_2 + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, p \vdash^{(h, n)} \mathbf{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \mathbf{split}(n_1, n_2)\{s_1\}\{s'_2\} + \eta', \sigma', \Sigma'} \text{ S-SPLIT-RIGHT} \\
695 \\
696 \\
697 \\
698 \quad \frac{\eta, \sigma, \Sigma, t, b, p \mathbf{mod} n \vdash^{(h, n)} s \rightsquigarrow s' + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, p \vdash^{(h, q \cdot n)} \mathbf{group} \ q \ s \rightsquigarrow \mathbf{group} \ q \ s'; + \eta', \sigma', \Sigma'} \text{ S-GROUP} \\
699 \\
700 \\
701 \quad \frac{\eta, \sigma, \Sigma, t, b, t \mathbf{mod} T \vdash^{(\text{Thread}, T)} s \rightsquigarrow s' + \eta', \sigma', \Sigma'}{\eta, \sigma, \Sigma, t, b, 0 \vdash^{(\text{Block}, 1)} \mathbf{destruct \ in} \ s \rightsquigarrow \mathbf{destruct \ in} \ s' + \eta', \sigma', \Sigma'} \text{ S-DESTRUCT-BLOCK} \\
702 \\
703 \\
704 \\
705 \quad \frac{\eta, \sigma, \Sigma, t, b, p \vdash^{(\text{Block}, 1)} x := \mathbf{alloc \ Shared} \ \tau \ n \ \mathbf{in} \ s \rightsquigarrow s + \eta, \sigma[x \mapsto \pi(x, n)], \Sigma}{\eta, \sigma, \Sigma, t, b, p \vdash^{(\text{Block}, 1)} x := \mathbf{alloc \ Shared} \ \tau \ n \ \mathbf{in} \ s \rightsquigarrow s + \eta', \sigma', \Sigma} \text{ S-ALLOC-SHARED} \\
706
\end{array}$$

Fig. 15. Core semantic rules of Prism. As with the typing rules, we present only a simplified selection of the full rules, which can be found in Appendix A.3.

represents a minor difference from Parallax, which uses additional static analysis to ensure that a claimed variable is only accessed in a single **split** branch.

Lastly, the **T-LOWER** rule mirrors the **T-DESTRUCT** rule; it uses the  $\downarrow$  operator to move a variable from one level of the hierarchy to another, distributing it equally among all the narrower perspectives at that level in the same manner as **T-PARTITION**.

## 4.2 Prism Semantics

Having explained the key rules of the type system, we can move on to discuss Prism’s operational semantics. To reflect the fact that GPU programs execute in parallel across numerous threads, we model the semantics of Prism in the style of Turon et al. [40], using a two-level small step judgment. We present the key rules of this semantics in Figure 15.

The top level (i.e., device-level) judgment has just one rule: **S-PROGRAM**. This rule acts as a “frame” for the lower level (i.e., thread-level) judgment, and steps a collection of thread-ID-indexed local memories ( $L$ ), a collection of block-ID-indexed shared memories ( $S$ ), a global memory ( $\Sigma$ ), and a *thread pool* to an updated collection of memories and thread pool. The thread pool maps thread and block IDs to code, intuitively representing the program being executed by each thread at the current moment. The **S-PROGRAM** non-deterministically chooses a thread ID and block ID and steps it according to the thread-level judgment. This allows the semantics to model the full range of non-deterministic behavior arising from the GPU’s thread scheduler.<sup>4</sup>

The thread-level judgment has the shape  $\eta, \sigma, \Sigma, t, b, p \vdash^\pi s \rightsquigarrow s' + \eta', \sigma', \Sigma'$ , where

- $\eta$  denotes thread-local memory,
- $\sigma$  denotes shared memory,

<sup>4</sup>In reality, the GPU’s thread scheduler guarantees that the threads of a warp execute in lockstep, but modeling every thread as completely independent is both simpler and a conservative overestimate of the nondeterministic behavior of the GPU.

- 736     •  $\Sigma$  denotes global memory,  
 737     •  $t$  denotes the thread's ID,  
 738     •  $b$  denotes the ID of the block in which the thread lives, and  
 739     •  $p$  denotes the relative position of the thread within  $\pi$  (the perspective ID).

740 Critically, notice that a  $\pi$  also appears on the thread-level judgment just as it does on the typing  
 741 judgment. This is because the thread semantics *dynamically tracks and enforces* perspectives. The  
 742 same way evaluation of a program “gets stuck” if a value does not have the right type, the semantics of  
 743 Prism also get stuck if code attempts to access data or invoke commands with the wrong perspective.  
 744 As an example, observe the **S-ALLOC-SHARED** rule in Figure 15, which requires a (**Block**,1) perspective  
 745 and will fail to step if encountered with a different one. This runtime perspective is present in Prism,  
 746 but is erased by Parallax during compilation; we will later use it to prove that well-typed programs  
 747 will always execute with the same perspective that the type system viewed them with.  
 748

749 The semantic rules for perspectives involve manipulating  $p$  to track which threads take which code  
 750 paths when perspectives are **split** or **grouped**. Notice that in **S-PROGRAM**, the thread-level judgment  
 751 always begins with perspective (**Grid**,1): all the perspective management rules are congruences,  
 752 narrowing the perspective of further evaluation as determined by the particular rule used.

753 These rules take great care to ensure that  $p$  always describes the relative position of a compute  
 754 resource within its perspective; the payoff is that Prism’s semantics can later use this  $p$  value to model  
 755 the way that Parallax automatically adjusts indices into data when partitioning a data perspective.  
 756 Beyond these key rules for managing perspective on code, we have modeled many of the other  
 757 features of Parallax, such as asynchronous operations and thread synchronization, in Prism. To  
 758 handle features like these we equip the operational semantics with additional structure, including  
 759 sets of semaphores [12] for synchronization and a stack of effect handlers for modeling deferred  
 760 asynchronous computations inspired by Ahman and Pretnar [3]. We have elided these details here  
 761 for simplicity, but interested readers can see further details in Appendix A.3.

### 762 4.3 Soundness Theorem

763 Together, the type system and operational semantics allow us to prove the following syntactic  
 764 soundness theorem, which says that Prism programs are type safe and do not get stuck trying to  
 765 execute operations for which they lack the required perspective:

766 THEOREM 4.1. *(Type-and-Perspective Safety)*. For any program  $s$  such that  $\Gamma \vdash^\pi s$ , either:

- 767     (1)  $s$  is **skip**, or
- 768     (2) for any well-typed environments  $\eta$ ,  $\sigma$ , and  $\Sigma$ , there is an  $s'$ ,  $\eta'$ ,  $\sigma'$ , and  $\Sigma'$  such that  
 769        $\eta, \sigma, \Sigma, t, b, p \vdash^\pi s \rightsquigarrow^* s' \dashv \eta', \sigma', \Sigma'$  such that  $\Gamma' \vdash^\pi s'$ , where  $\Gamma'$  is an extension of  $\Gamma$ , and  $\eta'$ ,  $\sigma'$ , and  
 770        $\Sigma'$  are well-typed with respect to  $\Gamma'$ .

771 PROOF. Via the usual progress and preservation lemmas, available in Appendix A.4. □

772 It is worth noting that this soundness theorem guarantees a syntactic safety property, not a liveness  
 773 property: it does not guarantee that all threads sharing perspective  $\pi$  that *can* reach a program point  
 774 typed with  $\pi$  *will* eventually do so. Indeed, in the presence of nontermination, liveness does not  
 775 hold—some of the threads could **split** off and loop forever. While we believe the liveness version of  
 776 this theorem holds for a terminating fragment of Prism, it is not provable with syntactic methods; the  
 777 proof would require semantic techniques that are notoriously challenging and would be a research  
 778 contribution [4, 15, 40] in and of itself. We hope to tackle this proof for Prism in future work.

## 785 5 Implementation

786 Parallax is implemented as an embedded language in Python. Once a program type checks, Parallax  
 787 lowers it to a CUDA file. All perspective information is erased during this step, and the generated  
 788 CUDA contains no run-time checks. The file can then be compiled by `nvcc`, NVIDIA's closed-source  
 789 compiler, to produce an executable. Because Parallax operates at roughly the same abstraction  
 790 level as CUDA, there is a one-to-one mapping between most language constructs and their CUDA  
 791 counterparts. A notable change is the addition of three parameters to each device function: the  
 792 thread's relative ID, the block's ID, and an offset into shared memory that it can use for allocations.  
 793

794 *Inserting Synchronization.* Most of Parallax's implementation is straightforward,  
 795 but inserting synchronization points is more involved. As described in ??, once  
 796 data has been partitioned, Parallax is responsible for inserting the appropriate  
 797 synchronization points when the partition's scope ends.

798 To determine where these points need to be inserted, Parallax constructs a *data-control-flow graph*  
 799 from the program. Each node corresponds to a partition, and each  
 800 edge captures program-order precedence: the parent partition must complete before  
 801 the child begins. The graph can have backedges introduced through loops. In this  
 802 graph, each partition is categorized as a *read* or a *write* partition. Synchronization  
 803 points are inserted according to the following scheme:

- 804 (1) If the parent partition is a write partition, a synchronization point is inserted  
     805 to ensure that the current partition observes the most recent data; or
- 806 (2) If the current partition is a write partition, a synchronization point is inserted  
     807 to ensure that all preceding reads on the same memory have completed.

808 Figure 16 shows an example graph with the synchronization points derived from  
 809 these two conditions. The inferred synchronization points in Figure 6 have also  
 810 been marked on lines **TODO**.

811 Using this information, Parallax emits *wait* operations before a partition begins  
 812 (to wait for its parent) and *arrive* operations when a partition ends (to signal completion). CUDA exposes a general *split-barrier* primitive that we use to implement  
 813 these waits and arrives. For special cases, such as synchronizing an entire block or  
 814 warp, Parallax instead uses primitives like `__syncthreads()` or `__syncwarp()`.  
 815

816 Synchronization for asynchronous data movement is handled in exactly the  
 817 same way and uses the same underlying graph. CUDA allows asynchronous load  
 818 operations to be associated with a split barrier, so Parallax binds each asynchronous  
 819 transfer to the appropriate wait–arrive pair inferred from the graph. Certain features,  
 820 such as commit-group-style synchronization, require additional reasoning, and  
 821 Parallax performs further static analysis to insert the necessary synchronization  
 822 around these operations.

823 It is worth noting that naively inserting synchronization immediately after each partition would be  
 824 correct but prohibitively slow. To avoid this, Parallax applies two optimization passes. A wait-motion  
 825 pass pushes waits downward toward the first use of the partitioned name, and an arrive-motion pass  
 826 pulls arrives upward toward its last use, reducing unnecessary stalls while preserving correctness.  
 827

## 828 6 Evaluation

829 Having explained the design of Parallax, we now evaluate it in the context of four main questions:

830 **RQ1** Can Parallax express a diversity of composable CUDA programs?

831 **RQ2** Can Parallax express programs that use advanced GPU features?



Fig. 16. An example data-control-flow graph lifted by Parallax with its inferred synchronization points.

834 **RQ3** Can Parallax match the performance of existing, speed-of-light CUDA code?

835 To perform this evaluation, we use two GPUs. The first is the NVIDIA H100 SXM5, a server-grade  
 836 chip that supports tensor-core operations and a dedicated hardware copy engine, the Tensor Memory  
 837 Accelerator. Notably, the H100 introduces a new logical level called the *warpgroup* (collection of  
 838 4-aligned warps), and we show that our programming model can accommodate this new level.  
 839 Moreover, because the H100 has historically served as the primary GPU for large-scale AI training,  
 840 many CUDA kernels are already highly optimized for this hardware and achieve near-speed-of-  
 841 light performance, providing a rigorous baseline for comparison. To ensure our results generalize  
 842 beyond the H100, we additionally test programs on a second GPU, the NVIDIA 4070 SuperTi GPU, a  
 843 consumer-grade chip.

844 Like mentioned in Section 5, when name typechecks a program, it produces a CUDA file. We  
 845 compile the CUDA file with `nvcc` version 12.3 [24] with flags `-O3 -use_fast_math`. We initialize all  
 846 inputs using a random number generator. We report the average runtime sampled over 10 iterations,  
 847 following a warm-up phase of 5 iterations. All aggregate results reported in this section are geometric  
 848 means. All programs in the evaluation have been reproduced in Section B.

850 **6.1 RQ1: Can Parallax express a diversity of CUDA programs?**

851 We evaluate Parallax’s expressivity by writing programs that have fundamentally different patterns  
 852 of convergence, and writing a library modeled after CUB that contains composable pieces.

853 **6.1.1 Programs with Different Convergence Behavior.**

854 **Matrix Multiplication.** We chose matrix multiplication as our first benchmark for two main  
 855 reasons. First, there exist several implementations that achieve near-peak performance, providing a  
 856 strong baseline. Second, matrix multiplication allows for a range of increasingly sophisticated imple-  
 857 ments that stress different parts of the language, making it ideal for evaluating expressiveness.

858 To undertake this exploration, we adapt the codebase from [5] to implement matrix multiplication  
 859 on the `float` datatype, commonly referred to as `sgemm`. Concretely, we compute  $C = \alpha * A \times B + \beta$   
 860 where  $A$ ,  $B$  and  $C$  are matrices with appropriate sizes and  $\alpha$  and  $\beta$  are scalars. As this is a `float`  
 861 benchmark, it *does not need to* use advanced GPU features like asynchrony or tensor cores to achieve  
 862 speed-of-light performance. We will discuss these advanced features in Section 6.2.

863 We implement five variants of the `sgemm` benchmark: (1) a naive version that follows the traditional  
 864 single-program multiple-data pattern, written from the perspective of a thread; (2) a version that  
 865 exploits memory coalescing, still expressed at the thread level; (3) a version that builds on (2)  
 866 by introducing 2D tiling and staging data in shared memory, which requires shifting first to the  
 867 perspective of a block and then to the perspective of a thread, while also requiring block-level  
 868 synchronization; (4) a version that applies 2D tiling with vectorized loads, again written from the  
 869 perspective of a block; and (5) a version that combines the optimizations from (2) through (4) while  
 870 adding an additional level of tiling from the perspective of a warp.

871 We can express all five variants cleanly, and the resulting programs remain close to their CUDA  
 872 counterparts, which adopt a separation between perspectives through disciplined style. Parallax, on  
 873 the other hand, requires programmers to follow this discipline, and enforces it at compile time. The  
 874 only notable difference is that some expressions must be hoisted earlier in the program so that their  
 875 logical introduction corresponds to the required perspective structure. We evaluate the performance  
 876 of these variants in Section 6.3.

877 **Single-pass Parallel Prefix Scan with Decoupled Look-Back.** We also implement scan, a widely  
 878 used parallel primitive, in Parallax. We focus on the prefix-sum scan, which computes, for each  
 879 position in an array, the sum of all elements up to that position. Prefix-sum sits in a different corner of  
 880

883 the GPU design space from matrix multiplication: it is memory intensive, requires careful attention  
 884 to the convergence behavior of different threads, and traditionally requires multiple passes over data.  
 885

886 We implement the single-pass parallel prefix scan with decoupled look-back, introduced by Merrill  
 887 and Garland [21], an elegant algorithm that does not require multiple passes over the input data. In  
 888 this algorithm, each block computes a local prefix sum over its assigned region of the array. Once  
 889 finished, it writes the final element of its region into a global array. Each block then *looks back* to  
 890 accumulate the contributions of prior blocks, allowing them to independently determine the global  
 891 prefix without a full sweep over memory. This decoupled look-back mechanism lets blocks progress  
 at different speeds while still producing correct global results.

892 The algorithm involves several distinct points of convergence. Within each block, work is de-  
 893 composed into fine-grained thread-level and warp-level scans. After producing the local result,  
 894 blocks publish their partial prefix to global memory. Finally, each block waits until enough prefix  
 895 information from earlier blocks becomes available, at which point it accumulates the value and  
 896 completes its section of the global scan.

897 We implement this full strategy in Parallax. Our implementation uses the unsafe feature to  
 898 implement a global-memory spinlock that lets blocks check when the previous block's data is ready.  
 899

900 **6.1.2 Case Study: Can Parallax help build composable functions?** Over the course of our evaluation,  
 901 we found ourselves developing a small library of functions—similar in spirit to the CUB library—  
 902 that we could call to execute operations. In this section, we would like to qualitatively study how  
 903 Parallax can help programmers design libraries that they can compose with confidence.

904 As mentioned in Section 1, CUB occupies a unique design point in the GPU library ecosystem.  
 905 Unlike many other libraries such as cuBLAS [8], cuDNN [9], and cuSPARSE [10], which provide  
 906 global interfaces that users can call, CUB provides a device-side library organized into different levels.  
 907 It makes these levels apparent by prefixing each of its functions with **Device**, **Block**, and **Thread**.  
 908 The single-scan prefix sum in used variants of functions we translated into Parallax already.

909 Currently, we have a library of functions that perform loads functions across block, warp and  
 910 thread; store functions across block, warp, thread, and scans across a block and a warp. Let's turn  
 911 our attention to a particular CUB function—the store function—and examine how it is equivalently  
 912 expressed in Parallax, and how Parallax's type system reifies the assumptions implicit in CUB.

913 In CUB, the store function is implemented as a class, as shown below:

```
914
915 template<typename T, int BlockDimX,
916           int ItemsPerThread, BlockLoadAlgorithm Algorithm = BLOCK_LOAD_DIRECT,
917           int BlockDimY = 1, int BlockDimZ = 1>
918 class BlockStore:
```

919 To use it, users must first instantiate an object of this class, and then call it with shared memory.

```
920
921 using BlockLoad = cub::BlockLoad<int, 128, 4, BLOCK_LOAD_DIRECT>;
922 // Allocate shared memory for BlockLoad
923 __shared__ typename BlockLoad::TempStorage temp_storage;
924 int thread_data[4]; // Thread local data
925 BlockLoad(temp_storage).Load(d_data, thread_data);
```

926 CUB exposes a leaky abstraction, where information about the number of threads, block sizes, and  
 927 other details seeps through:

- 928 (1) The CUB documentation needs to specify the number of threads that the function can assume  
 929 to be available, because within the function, each thread must locate itself in the computation  
 930 and use its **threadIdx.x** accordingly. If the starting **threadIdx.x** is not 0, the function must  
 931 compute its relative ID internally.

- 932 (2) The “item per thread” design is interesting and serves two purposes. The first is related to  
 933 performance: if loops have constant bounds, they can be unrolled, enabling downstream  
 934 optimizations. The second is related to correctness: the function relies on the assumption  
 935 that all threads call the function with an equal number of values to load.  
 936 (3) The CUB documentation also needs to specify that `thread_data` can be data local to each  
 937 thread.  
 938 (4) Finally, the CUB documentation specifies that if shared memory is being overwritten, a  
 939 `--syncthreads()` call must be made to ensure that all reads have completed.

940 On the other hand, the same function has a completely different interface in Parallax:

```
942 1 #include <cuda.h>
943 2 #requires(block[1], thread[32])
944 3 def block_load(input : ptr(const(int)) @ block[1], output : ptr(int) @ thread[1], items_per_thread : int @ block[1]):
```

945 In Parallax, we are able to encapsulate code effectively, reducing the need to communicate numerous  
 946 implementation details through documentation:

- 947 (1) We do not need to pass in the number of threads at all. Whenever Parallax calls a function, it  
 948 threads the relative ID through, so each function can be written locally as if it were running  
 949 alone, rather than having to determine where the thread resides in the global array.  
 950 (2) We do not need to make `item_per_thread` a template argument for *correctness*. Its frequency  
 951 is set at the function signature, so Parallax will never allow a function to be called with a  
 952 value living at a narrower perspective than its signature requires.  
 953 (3) In our interface, `thread_data` is explicitly set to a thread-local value. Since it is not marked  
 954 as `const`, Parallax conservatively assumes it may be written to, and enforces at compile time  
 955 that only `thread[1]` values are passed in.  
 956 (4) Finally, using our synchronization pass outlined in Section 5, a `--syncthreads()` call will  
 957 be inserted automatically if `src` is going to be used for writing.

958 Moreover, our CUB functions is built as a composition of 2 other functions:

```
960 1 #include <cuda.h>
961 2 #requires(thread[32])
962 3 # block_load calls into warp_load
963 4 def warp_load(input : ptr(const(int)) @ thread[32], output : ptr(int) @ thread[1], items_per_thread : int @ thread[32]):
964 5
965 6 # warp_load calls into thread_load
966 7 #include <cuda.h>
967 8 #requires(thread[1])
968 9 def thread_load(input : ptr(const(int)) @ thread[1], output : ptr(int) @ thread[1], items_per_thread : int @ thread[1]):
```

969 If we had mistakenly attempted to call `warp_load` from inside `thread_load`, Parallax would  
 970 return an error, not silently compute fail to meet the contract set out up the function.

## 970 6.2 RQ2: Can Parallax express programs that use advanced GPU features?

971 To test whether Parallax can express programs that use advanced features of modern GPUs, we write  
 972 a matrix multiplication for the `bf16` datatype for the H100.

973 This benchmark is an acid test of our language, as the H100 `bf16` matrix multiplication pushes  
 974 several language features to the extreme. To write a matrix multiplication that can hit peak throughput  
 975 on an H100, we need to write a warp-specialized kernel that uses the Tensor Memory Accelerator  
 976 (TMA), an asynchronous hardware copy engine that can move tiles of data at a time, and uses  
 977 the warp-group-level tensor core instructions, or `wmma`, specifically introduced for the Hopper  
 978 architecture. A high-performance kernel for this matrix-multiplication overlaps computation with  
 979 data movement by pipelining loads.



Fig. 17. Performance of `sgemm` on square matrices as matrix dimension  $M=N=K$  increases.  
 Fig. 18. Performance for prefix sum as input array size increases.  
 Fig. 19. Performance of `hgemm` on square matrices as matrix dimension  $M=N=K$  increases.

The implementation in Parallax looks different from normal CUDA code, particularly in how pipelining is expressed. Since Parallax uses *names* and subsequent `partition` or `claim` constructs to determine the synchronization each region requires, when pipelining, we cannot dynamically change the pipeline slot simply by maintaining an index variable that wraps around based on the pipeline length. Instead, each pipeline slot must be given a separate name so that Parallax can track them independently and actually overlap compute with data-movement. This leads to pipeline slots that must be individually named and forces the load logic to be effectively “unrolled,” since we can no longer iterate over a pipeline-slot index. In turn, this forces that all pipelines in Parallax be statically sized. In practice, these pipelines are statically sized in CUDA anyway because they occupy shared memory, which is a small, finite resource that must be explicitly managed.

Notably, for this benchmark, in addition to `wgmma` and TMA, the program needs to dynamically reallocate registers between producer and consumer warpgroups, an instruction only available for the Hopper. This redistribution is a warpgroup-level collective operation, and Parallax can check it like any other collective operation. Moreover, getting `wgmma` to work did not require introducing a new perspective into the language; `thread[128]` was sufficient. We did, however, need to add a dedicated TMA-style asynchronous data-movement construct, since Parallax must eventually insert the appropriate synchronization for these transfers.

### 6.3 RQ3: Can Parallax match the performance of existing, speed-of-light CUDA code?

In Section 6.1 and Section 6.2, we examined programs that expressed the same computation in multiple ways, expressed operations that rely on multiple points of convergence, and used advanced GPU features. We now discuss the performance of these programs.

The performance of the matrix multiplication benchmarks is shown in , where we demonstrate that Parallax is competitive with cuBLAS [8].

For the prefix scan, we compare our performance to , the library introduced earlier in Section 1, and show that we can reach approximately % of CUB’s peak bandwidth.

Finally, we evaluate our H100 implementation and show that it is competitive with cuBLAS on square sizes, falling between %. We emphasize that this is an exacting benchmark, and achieving this level of performance requires writing large kernels with precise control over low-level features.

## 7 Related Work

Parallax builds on a rich tradition for systems languages for GPU programming and theoretical foundations for reasoning about parallel programs. However, Parallax differs from these systems in a crucial way: it treats collective operations, and therefore composability, as first-class concerns.

**Imperative Languages for GPUs.** CUDA [23], ROCm [2], and OpenCL [36] are imperative, C-style programming languages that expose low-level access to GPU hardware. These languages do not model perspectives for code or data, statically or otherwise; instead, programmers must

orchestrate computation on the machine by describing how individual threads execute code. The subtle and often silent failures permitted by this model are the motivation for Parallax.

Descend [20], a new, Rust-inspired language, uses a type system to track aspects of the compute and memory hierarchy, and, is thus the closest to our work. However, Descend’s main concern is preventing data races, while ours is ensuring that collective operations execute in valid contexts. Moreover, Descend lacks support for many collective operations like tensor cores; in fact, because Descend allows threads to read their own IDs and thus induce data-dependent control flow, adding such support would require major changes to the language. It is also worth noting that while Descend does formalize a type system, it does not attempt to prove any properties about it.

**Functional Languages for GPUs.** Futhark [19], Accelerate [6] and Vertigo [14] are functional array languages with compilers targeting CUDA. These languages expose a high level interface, abstracting away details of the hardware entirely in exchange for stronger safety guarantees. Parallax, however, makes an intentional decision to expose low-level details of the hardware, and thus does not trade performance for safety.

**Tile-Based Kernel DSLs.** More recently, tile-based GPU languages—Triton [38], Pallas [35], Tilus [13], Helion [28]—offer a middle ground between high-level abstraction and low-level control. However, these languages sidestep the question of composability entirely because they restrict programmers to a single vantage point in the GPU compute hierarchy: a block (Triton, Tilus, Helion), or a warpgroup (Pallas). Gluon [39] and is a new low-level, tile-based language. However, it does not check and cannot enforce that collective operations are executed at the correct layer of the hierarchy.

**Task-Based and Scheduling Languages.** Languages like Cypress [41], Halide [29], Fireiron [16] and RISE/ELEVATE [17] expose a mapping specification or scheduling language that lets users modify an existing reference program through external commands. Because they operate by transforming a fixed source-of-truth rather than generating the program directly, they expose a fundamentally different programming model than Parallax.

**Libraries Built on CUDA.** Many GPU libraries, such as CUTLASS [34], CUB [22], and ThunderKittens [30], expose device functions that operate at various levels of the compute hierarchy. These functions are typically organized via C++ namespaces to reflect their intended usage (e.g., warp-level, block-level). However, this organization is purely conventional: it encodes hierarchy through naming rather than enforcing it with a compiler. As a result, correct use requires careful discipline from both the library implementer (to uphold naming and usage invariants) and the user (to correctly interpret them). Any mismatch or subtle misunderstanding between the intended and actual use of these functions goes unchecked by the compiler, leaving room for fragile errors.

**Theoretical Foundations.** The design of Prism is heavily inspired by existing work on dependency tracking [1]. Dependency tracking calculi allow type systems to track how data and code depend on each other, and have commonly been used to implement secure information flow analyses [11]. In Prism, data that lives at a narrow perspective is unable to flow into data living at a broader perspective, and we use dependency tracking to capture this restriction in Prism’s type system.

## 8 Conclusion, Limitations, and Future Work

We have presented Parallax, a new, low-level GPU language that statically guarantees safe usage of compute resources by construction, without sacrificing low-level control. Parallax introduces a new mental model for writing GPU code, which we are excited to make more expressive and ergonomic.

Currently, Parallax does not natively support explicit references to pointers; programmers must rely on built-in constructs to interact with memory. Prior work such as Descend [20], which builds

on Rust’s ownership model [33], provides a promising direction for extending Parallax with more flexible pointer semantics.

We also aim to improve the ergonomics of programming with Parallax, particularly by enhancing the pipelining experience. As discussed in Section 6.2, Parallax currently requires pipeline slots to be explicitly name; we can improve this by adding language support for generating pipeline-style code.

Parallax can be extended to more architectures; in particular, it can accommodate newer GPUs—such as Blackwell—that support coarser-grained tensor-core operations. More broadly, we hope to generalize the model presented here to hierarchical compute environments, including distributed systems.

On the theoretical side, we plan to explore a terminating fragment of Prism and prove the liveness property discussed in Section 4.3: that all threads sharing perspective  $\pi$  eventually reach the parts of a program viewed at that  $\pi$ . This amounts to showing that threads sharing the same perspective execute the same code and observe the same data, which we hope to prove using logical relations, following Turon et al. [40] and Spies et al. [31, 32].

We believe that Parallax is a promising low-level substrate that enables confident composition and can serve as a foundation for building higher-level libraries and abstractions.

## References

- [1] Martín Abadi, Anindya Banerjee, Nevin Heintze, and Jon G. Riecke. 1999. A core calculus of dependency. In *Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages* (San Antonio Texas USA, 1999-01). ACM, 147–160. doi:[10.1145/292540.292555](https://doi.org/10.1145/292540.292555)
- [2] Advanced Micro Devices, Inc. 2024. AMD ROCm™ Software. <https://www.amd.com/en/products/software/rocm.html>. Accessed: 2025-11-10.
- [3] Danel Ahman and Matija Pretnar. 2021. Asynchronous effects. *Proceedings of the ACM on Programming Languages* 5, POPL (Jan. 2021), 1–28. doi:[10.1145/3434305](https://doi.org/10.1145/3434305)
- [4] Lars Birkedal, Filip Sieczkowski, and Jacob Thamsborg. 2012. A Concurrent Logical Relation. In *Computer Science Logic (CSL ’12) - 26th International Workshop/21st Annual Conference of the EACSL (2012)*. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 107–121. doi:[10.4230/LIPIcs.CSL.2012.107](https://doi.org/10.4230/LIPIcs.CSL.2012.107)
- [5] Simon Boehm. 2022. SGEMM\_CUDA: Fast CUDA matrix multiplication from scratch. [https://github.com/sboehm/SGEMM\\_CUDA](https://github.com/sboehm/SGEMM_CUDA). GitHub repository. Accessed: 2025-11-10.
- [6] Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In *Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming* (Austin, Texas, USA) (DAMP ’11). Association for Computing Machinery, New York, NY, USA, 3–14. doi:[10.1145/1926354.1926358](https://doi.org/10.1145/1926354.1926358)
- [7] NVIDIA Corporation. 2025. cub::BlockReduce. [https://nvidia.github.io/cccl/cub/api/classcub\\_1\\_1BlockReduce.html](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockReduce.html). Accessed: 2025-11-09.
- [8] NVIDIA Corporation. 2025. cuBLAS. <https://docs.nvidia.com/cuda/cublas/index.html> Accessed: 2025-11-09.
- [9] NVIDIA Corporation. 2025. cuDNN. <https://docs.nvidia.com/deeplearning/cudnn/latest/> Accessed: 2025-11-09.
- [10] NVIDIA Corporation. 2025. cuSPARSE. <https://docs.nvidia.com/cuda/cusparse/index.html> Accessed: 2025-11-09.
- [11] Dorothy E. Denning and Peter J. Denning. 1977. Certification of programs for secure information flow. *Commun. ACM* 20, 7 (July 1977), 504–513. doi:[10.1145/359636.359712](https://doi.org/10.1145/359636.359712)
- [12] Edsger Wybe Dijkstra. 1963. Over de sequentialiteit van procesbeschrijvingen. (1963). <https://training-ir7.tdl.org/handle/123456789/594>
- [13] Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, and Gennady Pekhimenko. 2025. Tilos: A Tile-Level GPGPU Programming Language for Low-Precision Computation. arXiv:2504.12984 [cs.LG] <https://arxiv.org/abs/2504.12984>
- [14] C. Elliott. 2005. Vertigo — GPU Compiler & Embedded Language for Graphics Processors. <http://conal.net/Vertigo/>. Accessed: 2025-11-10.
- [15] Azadeh Farzan, Zachary Kincaid, and Andreas Podelski. 2016. Proving Liveness of Parameterized Programs. In *Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science*. ACM, New York NY USA, 185–196. doi:[10.1145/2933575.2935310](https://doi.org/10.1145/2933575.2935310)
- [16] Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodík, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In *Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques* (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 71–82. doi:[10.1145/3410463.3414632](https://doi.org/10.1145/3410463.3414632)

- [17] Bastian Hagedorn, Johannes Lenfers, Thomas Kundefinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. *Proc. ACM Program. Lang.* 4, ICFP, Article 92 (Aug. 2020), 29 pages. doi:10.1145/3408974
- [18] Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation*. ACM, Barcelona Spain, 556–571. doi:10.1145/3062341.3062354
- [19] Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. *SIGPLAN Not.* 52, 6 (June 2017), 556–571. doi:10.1145/3140587.3062354
- [20] Bastian Köpcke, Sergei Gorlatch, and Michel Steuwer. 2024. Descend: A Safe GPU Systems Programming Language. 8 (2024), 841–864. Issue PLDI. doi:10.1145/3656411
- [21] Duane Merrill and Michael Garland. 2016. *Single-pass Parallel Prefix Scan with Decoupled Look-back*. Technical Report NVR-2016-002. NVIDIA Corporation. Accessed: 2025-11-10.
- [22] NVIDIA Corporation. 2025. *CUB: CUDA Unbound*. <https://docs.nvidia.com/cuda/cub/index.html> CUDA Toolkit Documentation.
- [23] NVIDIA Corporation. 2025. *CUDA C++ Programming Guide*. NVIDIA. <https://docs.nvidia.com/cuda/cuda-c-programming-guide/> Accessed: 2025-11-06.
- [24] NVIDIA Corporation. 2025. CUDA Toolkit 12.3 Downloads (Archive). <https://developer.nvidia.com/cuda-12-3-0-download-archive>. Accessed: 2025-11-11.
- [25] NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA – Asynchronous Warpgroup Level Matrix Multiply-Accumulate Instructions. <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=tngen05%2520cp#asynchronous-warpgroup-level-matrix-instructions>. Accessed: 2025-11-10.
- [26] NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA – TensorCore 5th Generation Instructions (tcgen05). <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=tngen05%2520cp#tensorcore-5th-generation-instructions>. Accessed: 2025-11-10.
- [27] NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA – Warp Level Matrix Instructions. <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=tngen05%2520cp#warp-level-matrix-instructions>. Accessed: 2025-11-10.
- [28] PyTorch Team at Meta. 2025. Helion: A High-Level DSL for Performant and Portable ML Kernels. <https://pytorch.org/blog/helion/>. Accessed: 2025-11-10.
- [29] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. *SIGPLAN Not.* 48, 6 (June 2013), 519–530. doi:10.1145/2499370.2462176
- [30] Benjamin Frederick Spector, Simran Arora, Aaryan Singhal, Arjun Parthasarathy, Daniel Y Fu, and Christopher Re. 2025. ThunderKittens: Simple, Fast, and \$\\textit{Adorable}\$\$ Kernels. In *The Thirteenth International Conference on Learning Representations*. <https://openreview.net/forum?id=0fjfVOSUra>
- [31] Simon Spies, Lennard Gähler, Daniel Gratz, Joseph Tassarotti, Robbert Krebbers, Derek Dreyer, and Lars Birkedal. 2021. Transfinite Iris: resolving an existential dilemma of step-indexed separation logic. In *Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation*. ACM, Virtual Canada, 80–95. doi:10.1145/3453483.3454031
- [32] Simon Spies, Neel Krishnaswami, and Derek Dreyer. 2021. Transfinite step-indexing for termination. *Proceedings of the ACM on Programming Languages* 5, POPL (Jan. 2021), 1–29. doi:10.1145/3434294
- [33] Rust Team. 2025. Rust Programming Language. <https://rust-lang.org/>
- [34] Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaiyan, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Aditya Atluri, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS. <https://github.com/NVIDIA/cutlass>
- [35] The JAX Authors. 2024. Pallas:Mosaic GPU – A JAX Kernel Language for GPUs. <https://docs.jax.dev/en/latest/pallas/gpu/index.html>. Accessed: 2025-11-10.
- [36] The Khronos Group Inc. 2025. OpenCL™ – The Open Standard for Parallel Programming of Heterogeneous Systems. <https://www.khronos.org/opencl/>. Accessed: 2025-11-10.
- [37] Philippe Tillet. 2025. Introducing Triton: Open-source GPU programming for neural networks. <https://openai.com/index/triton/>
- [38] Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In *Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages* (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

- 1177 [39] triton-lang. 2025. “01-intro.py” – Introduction to Gluon tutorial in Triton. <https://github.com/triton-lang/triton/blob/main/python/tutorials/gluon/01-intro.py>. Accessed: 2025-11-10.
- 1178 [40] Aaron J. Turon, Jacob Thamsborg, Amal Ahmed, Lars Birkedal, and Derek Dreyer. 2013. Logical relations for fine-grained concurrency. In *Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages* (Rome Italy, 2013-01-23). ACM, 343–356. doi:[10.1145/2429069.2429111](https://doi.org/10.1145/2429069.2429111)
- 1181 [41] Rohan Yadav, Michael Garland, Alex Aiken, and Michael Bauer. 2025. Task-Based Tensor Computations on Modern GPUs. *Proc. ACM Program. Lang.* 9, PLDI, Article 163 (June 2025), 25 pages. doi:[10.1145/3729262](https://doi.org/10.1145/3729262)
- 1183  
1184  
1185  
1186  
1187  
1188  
1189  
1190  
1191  
1192  
1193  
1194  
1195  
1196  
1197  
1198  
1199  
1200  
1201  
1202  
1203  
1204  
1205  
1206  
1207  
1208  
1209  
1210  
1211  
1212  
1213  
1214  
1215  
1216  
1217  
1218  
1219  
1220  
1221  
1222  
1223  
1224  
1225

1226 **A Complete Prism Type System, Semantics, and Syntactic Soundness Proofs**

1227 **A.1 Basic Definitions**

1229                  Hierarchy Levels  $h ::= \text{Grid} \mid \text{Block} \mid \text{Thread}$   
 1230                  Memory Kinds  $l ::= \text{Local} \mid \text{Shared} \mid \text{Global}$   
 1231                  Perspectives  $\pi : h \times \mathbb{N}$   
 1232                  Base Types  $b ::= \text{bool} \mid \text{int} \mid \text{float}$   
 1233                  Types  $\tau ::= b \mid b[]^l \mid \text{Fun}(\Gamma, \pi, m) \mid \text{async } \tau$   
 1234                  Contexts  $\Gamma ::= \cdot \mid \Gamma, x : \pi \tau$   
 1235                  Shared Memory Remaining  $m : \mathbb{N}$

1236  
 1237  
 1238  
 1239  
 1240        Perspectives  $\pi$  are part of an algebra parameterized over some constant values  $T$  (the number of  
 1241        threads per block) and  $B$  (the number of blocks per grid). With these values, we have two isomorphisms:

$$( \text{Block}, 1 ) \cong ( \text{Thread}, T )$$

1242        and

$$( \text{Grid}, 1 ) \cong ( \text{Block}, B )$$

1243  
 1244  
 1245  
 1246        The **group** and **split** operations of Parallax allows us to move along this isomorphism, from left  
 1247        to right. For clarity, in Prism we split these operations into three: a **split** operation that can split  
 1248        perspectives into multiple narrower ones, and a **destruct** operation that directly moves us along  
 1249        the isomorphism, and a **group** that divides our current perspective into equal sized parts.

1250        Perspectives  $\pi$  are also lexicographically ordered in the obvious way.  $hs$  are ordered such that  
 1251         $\text{Thread} \leq \text{Block} \leq \text{Grid}$ , and  $(h_1, n_1) \leq (h_2, n_2)$  iff  $n_1 | n_2$  and  $h_1 \leq h_2$ .

1252        We define scalar multiplication  $i \times h$  of natural numbers with  $hs$ :  $i \times (h, n) = (h, in)$ .

1253        We also define division of perspectives and hierarchy levels of type  $\pi \times \pi \rightarrow \mathbb{N}$ .  $\text{Grid}/\text{Block} = B$   
 1254        and  $\text{Block}/\text{Thread} = T$ . We lift this to perspectives like so:  $(h_1, n_1) / (h_2, n_2) = ((h_1/h_2) \cdot n_1) / n_2$ .

1255        Lastly we define a partial  $\downarrow$  operator on  $hs$  such that  $\downarrow \text{Block} = \text{Thread}$  and  $\downarrow \text{Grid} = \text{Block}$ . Note that  
 1256         $\downarrow \text{Thread}$  is undefined. This operator lifts to  $\pi s$  whose second component is 1 and encodes the leftward  
 1257        component of the isomorphism above:  $\downarrow (\text{Block}, 1) = (\text{Thread}, T)$  and  $\downarrow (\text{Grid}, 1) = (\text{Block}, B)$ .

1258        Note also that the presentation of these rules in the main body of the paper elide the  $m$  portion,  
 1259        which tracks the maximum amount of memory a given computation is allowed to use. In the full  
 1260        system presented here, both the typing rules and the operational semantics carry an additional piece  
 1261        of information tracking allocated memory.

1263 **A.2 Complete Prism Typing Rules**

1264 **A.2.1 Expressions.**

$$\frac{x : \pi \tau \in \Gamma}{\Gamma \vdash^\pi x : \tau} \text{-Var} \quad \frac{}{\Gamma \vdash^\pi n : \text{int}} \text{-Int} \quad \frac{}{\Gamma \vdash^\pi f : \text{float}} \text{-Float}$$

$$\frac{}{\Gamma \vdash^\pi b : \text{bool}} \text{-Bool} \quad \frac{\pi < (\text{Grid}, 1)}{\Gamma \vdash^\pi \text{partition\_id} : \text{int}} \text{-Partition-Id}$$

$$\frac{\Gamma \vdash^\pi' e_1 : \tau[]^l \quad \Gamma \vdash^\pi e_2 : \text{int} \quad l = \text{Global} \text{ or } l = \text{Local} \quad \pi \leq \pi'}{\Gamma \vdash^\pi e_1[e_2] : \tau} \text{-Arr-Access}$$

|                      |                                                                                                                                                                                                                                                                                                                                  |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1275<br>1276<br>1277 | $\frac{\Gamma \vdash^{\pi'} e_1 : \tau[]^{\text{Shared}} \quad \Gamma \vdash^{\pi} e_2 : \text{int} \quad \pi \leq (\text{Block}, 1) \quad \pi \leq \pi'}{\Gamma \vdash^{\pi} e_1[e_2] : \tau} \text{ T-Arr-Access-Shared}$                                                                                                      |
| 1278<br>1279<br>1280 | $\frac{\Gamma \vdash^{\pi} e_1 : \text{int} \quad \Gamma \vdash^{\pi} e_2 : \text{int}}{\Gamma \vdash^{\pi} e_1 \text{ bop } e_2 : \text{int}} \text{ T-Bop} \quad \frac{\Gamma \vdash^{\pi} e_1 : \text{int} \quad \Gamma \vdash^{\pi} e_2 : \text{int}}{\Gamma \vdash^{\pi} e_1 \text{ cmp } e_2 : \text{bool}} \text{ T-Cmp}$ |
| 1281                 | <i>A.2.2 Statements.</i>                                                                                                                                                                                                                                                                                                         |
| 1282<br>1283<br>1284 | $\frac{f : \pi' \text{Fun}(x_i : \tau_i, \pi, m') \in \Gamma \quad \Gamma \vdash^{\pi} e_i : \tau_i \quad m' \leq m}{\Gamma \vdash_m^{\pi} f(e_1, \dots, e_n)} \text{ T-Function-Call}$                                                                                                                                          |
| 1285<br>1286<br>1287 | $\frac{\Gamma \vdash_m^{(h,n_1)} s_1 \quad \Gamma \vdash_m^{(h,n_2)} s_2 \quad n_1, n_2 \text{ align to } n}{\Gamma \vdash_m^{(h,n)} \text{split}(n_1, n_2)\{s_1\}\{s_2\}} \text{ T-Split} \quad \frac{\Gamma \vdash_m^{\downarrow\pi} s}{\Gamma \vdash_m^{\pi} \text{destruct in } s} \text{ T-Destruct}$                       |
| 1288<br>1289<br>1290 | $\frac{\Gamma \vdash_m^{(h,n)} s}{\Gamma \vdash_m^{(h,q \cdot n)} \text{group } q \ s} \text{ T-Group} \quad \frac{x : \pi' \tau \in \Gamma \quad \Gamma \vdash^{\pi} e : \tau \quad \pi' \leq \pi}{\Gamma \vdash_m^{\pi} x = e} \text{ T-Assn}$                                                                                 |
| 1291<br>1292<br>1293 | $\frac{\Gamma \vdash_m^{\pi} \text{init}_{\psi}}{\Gamma \vdash_m^{\pi} \text{dec}_{\psi}} \text{ T-Sync-Init} \quad \frac{\Gamma \vdash_m^{\pi} \text{dec}_{\psi}}{\Gamma \vdash_m^{\pi} \text{dec}_{\psi}} \text{ T-Sync-Dec}$                                                                                                  |
| 1294<br>1295         | $\frac{}{\Gamma \vdash_m^{\pi} \text{wait}_{\psi}} \text{ T-Sync-Wait} \quad \frac{}{\Gamma \vdash_m^{\pi} \text{skip}} \text{ T-Skip} \quad \frac{n \leq m}{\Gamma \vdash_m^{\pi} \text{free } n} \text{ T-Free}$                                                                                                               |
| 1296<br>1297<br>1298 | $\frac{\Gamma \vdash^{\pi} e : \tau \quad \Gamma, x : \pi' \tau \vdash_m^{\pi} s \quad \pi' \leq \pi \quad \tau \text{ not an array type}}{\Gamma \vdash_m^{\pi} x : \tau @ \pi' = e \text{ in } s} \text{ T-Decl}$                                                                                                              |
| 1299<br>1300<br>1301 | $\frac{\Gamma \vdash^{\pi'} e_1 : \tau[]^l \quad \Gamma \vdash^{\pi} e_2 : \text{int} \quad \Gamma \vdash^{\pi} e_3 : \tau \quad l = \text{Global or } l = \text{Local} \quad \pi' \leq \pi}{\Gamma \vdash_m^{\pi} e_1[e_2] = e_3} \text{ T-Arr-Assn}$                                                                           |
| 1302<br>1303<br>1304 | $\frac{\Gamma \vdash^{\pi'} e_1 : \tau[]^{\text{Shared}} \quad \Gamma \vdash^{\pi} e_2 : \text{int} \quad \Gamma \vdash^{\pi} e_3 : \tau \quad \pi \leq (\text{Block}, 1) \quad \pi' \leq \pi}{\Gamma \vdash_m^{\pi} e_1[e_2] = e_3} \text{ T-Arr-Assn-Shared}$                                                                  |
| 1305<br>1306<br>1307 | $\frac{\Gamma \vdash^{\pi} e : \text{bool} \quad \Gamma \vdash_{m_1}^{\pi} s_1 \quad \Gamma \vdash_{m_2}^{\pi} s_2}{\Gamma \vdash_{\max(m_1, m_2)}^{\pi} \text{if } e \text{ then } s_1 \text{ else } s_2} \text{ T-If}$                                                                                                         |
| 1308<br>1309<br>1310 | $\frac{\Gamma \vdash^{\pi} e : \text{bool} \quad \Gamma \vdash_m^{\pi} s}{\Gamma \vdash_m^{\pi} \text{while } e \text{ do } s} \text{ T-While} \quad \frac{\Gamma \vdash_{m_1}^{\pi} s_1 \quad \Gamma \vdash_{m_2}^{\pi} s_2}{\Gamma \vdash_{\max(m_1, m_2)}^{\pi} s_1 ; s_2} \text{ T-Seq}$                                     |
| 1311<br>1312<br>1313 | $\frac{\Gamma, x : \pi' \tau[]^l \vdash_m^{\pi} s \quad l = \text{Global or } l = \text{Local}}{\Gamma \vdash_{m+n \cdot \text{size}(\tau)}^{\pi} x := \text{alloc } l \ \tau \ n \text{ in } s} \text{ T-Alloc}$                                                                                                                |
| 1314<br>1315<br>1316 | $\frac{\Gamma, x : (\text{Block}, 1) \tau[]^{\text{Shared}} \vdash_m^{\pi} s}{\Gamma \vdash_{m+n \cdot \text{size}(\tau)}^{\pi} x := \text{alloc Shared } \tau \ n \text{ in } s} \text{ T-Alloc-Shared}$                                                                                                                        |
| 1317<br>1318<br>1319 | $\frac{\Gamma, y : (h, n/c) \tau[]^l \vdash_m^{(h, n)} s \quad c   n \quad l \neq \text{Local}}{\Gamma, x : (h, n) \tau[]^l \vdash_m^{(h, n)} \text{partition}_{\psi} x \text{ into } y \text{ by } c \text{ in } s} \text{ T-Partition}$                                                                                        |
| 1320<br>1321<br>1322 | $\frac{\Gamma, y : (h, n') \tau[]^l \vdash_m^{(h, n')} s \quad n' \leq n \quad l \neq \text{Local}}{\Gamma, x : (h, n) \tau[]^l \vdash_m^{(h, n)} \text{claim}_{\psi} x \text{ into } y \text{ at } n' \text{ in } s} \text{ T-Claim}$                                                                                           |
| 1323                 |                                                                                                                                                                                                                                                                                                                                  |

$$\begin{array}{c}
1324 \\
1325 \quad \frac{\Gamma, y : \downarrow^\pi \tau[]^l \vdash_m^\pi s \quad l \neq \text{Local}}{\Gamma, x : {}^\pi \tau[]^l \vdash_m^\pi \text{lower}_\psi x \text{ into } y \text{ in } s} \text{ T-Lower} \\
1326 \\
1327 \\
1328 \quad \frac{\Gamma, y : {}^{(\text{Thread},1)} \text{async } \tau[]^l \vdash_m^{(\text{Thread},1)} s}{\Gamma, x : {}^{(\text{Thread},1)} \tau[]^l \vdash_m^{(\text{Thread},1)} \text{async\_partition}_\phi x \text{ into } y \text{ in } s} \text{ T-Async-Partition} \\
1329 \\
1330 \\
1331 \quad \frac{\Gamma, x : {}^{(\text{Thread},1)} \text{async } \tau[], y : {}^{(\text{Thread},1)} \tau[]^l' \vdash_m^{(\text{Thread},1)} \text{async\_memcpy}(x,y)}{\Gamma, x : {}^\pi \tau[]^l, y : {}^\pi \tau[]^l' \vdash_m^\pi \text{memcpy}(x,y)} \text{ T-Async-Memcpy} \\
1332 \\
1333 \\
1334 \quad \frac{}{\Gamma, x : {}^\pi \tau[]^l, y : {}^\pi \tau[]^l' \vdash_m^\pi \text{memcpy}(x,y)} \text{ T-Memcpy} \\
1335 \\
1336 \quad \text{A.3 Complete Prism Semantics} \\
1337 \\
1338 \quad \text{A.3.1 Definitions.} \\
1339 \quad \text{Global Memory } \Sigma ::= \cdot | \Sigma, n \mapsto^\pi v \\
1340 \\
1341 \quad \text{Shared Memory } \sigma ::= \cdot | \sigma, n \mapsto^\pi v \\
1342 \quad \text{Local Memory } \eta ::= \cdot | \eta, n \mapsto^\pi v \\
1343 \\
1344 \quad \text{Block Memory Map } S ::= \forall n \in B, n \mapsto \sigma \\
1345 \\
1346 \quad \text{Thread Memory Map } L ::= \forall n \in T, n \mapsto \eta \\
1347 \\
1348 \quad \text{Synchronization Map } \Psi : \psi \rightarrow \mathbb{N} \rightarrow \mathbb{N} \\
1349 \\
1350 \quad \text{Deferred Computations Map } \Phi : \phi \rightarrow \{s\}
\end{array}$$

1351 In real GPUs, thread IDs are only unique within their block. However, in this calculus for simplicity  
1352 we assume thread IDs are global. One can convert back and forth between this abstracted notion  
1353 of a thread ID and a block-unique ID via addition modulo  $T$ . That is,  $t_{\text{real}} = t_{\text{simplified}} \bmod T$  and  
1354  $t_{\text{simplified}} = t_{\text{real}} + b \cdot T$ .

1355 By convention the names for local and shared and global memory do not conflict, as on the GPU  
1356 they will be separate pointer spaces. Additionally, we freely interchange between using names for  
1357 variables and integer locations.

1358 In the main body of the paper, for simplicity we elide the synchronization map and the deferred  
1359 computation map from the operational semantics, as our theorems do not make any guarantees  
1360 about non-interference. However, as they are part of the full semantics, we include them here  
1361 for completeness. By convention the synchronization and deferred computation maps are a total  
1362 functions, initialized to map to  $\lambda_. 0$  for  $\psi$ s and  $\lambda_. \{s\}$  for  $\phi$ s not explicitly initialize.

1363 The shape of the judgment for a single thread is  $\eta, \sigma, \Sigma, t, b, p, \Psi \vdash_m^\pi s \rightsquigarrow s' \dashv_m \eta', \sigma', \Sigma', \Psi'$ . The  $t$  here is  
1364 the thread ID, the  $b$  is the block ID, and the  $p$  is the perspective ID. The last of these three is modified  
1365 and managed by the rules for **split**, **group** and **destruct**, and tracks the relative position of the  
1366 thread within a group. This semantics is in a small step style.

1367 The shape of the judgment for expressions is  $\eta, \sigma, \Sigma \vdash_\pi^\pi e \Downarrow v$ . The two  $\pi$ s represent the ambient  
1368 compute context (i.e., the context in which resources are being read), while  $\pi'$ , represents the target  
1369 compute context (i.e., the compute context of the variable into which the result of the expression is  
1370 going to be written. This is relevant for computing the value of **partition\_id**, which divides the two  
1371 contexts. As a shorthand, we can divide a perspective by a scalar value like so:  $(h,n)/c = (h,n)/(h,c)$ .

1373 The overall evaluation of a program is expressed as

$$L, S, \Sigma, P, \Psi, \Phi \rightsquigarrow L', S', \Sigma', P', \Psi', \Phi'.$$

1376 In this judgment  $P$  serves as a *thread pool*, mapping pairs of thread and block IDs (which don't change)  
 1377 to statements and memory (which can be updated by stepping). One can think of  $P$  as tracking which  
 1378 program is running on each thread. This steps according to the following rule:

$$\frac{L(t), S(b), \Sigma, t, b, 0, \Psi, \Phi \vdash_m^{(\text{Grid}, 1)} s \rightsquigarrow s' \dashv_{m'} \eta', \sigma', \Sigma', \Psi', \Phi' \quad P(t, b) = (s, m)}{L, S, \Sigma, P, \Psi, \Phi \rightsquigarrow L[t \mapsto \eta'], S[b \mapsto \sigma'], \Sigma', P[(t, b) \mapsto (s', m')], \Psi', \Phi'} \text{ S-Program}$$

1382 For simplicity of notation, we define an **update** operation that searches the three environments  
 1383 for the one that contains the variable being used (by convention, there is no conflict between the  
 1384 environments, as in reality they exist in three separate address spaces). We also define a similar **get**  
 1385 operation that retrieves a variable from memory, and a **rename** operation that remaps a variable  
 1386 with the same value but under a different name.

$$\text{update}(\eta, \sigma, \Sigma, x, v) = (\eta[x \mapsto^\pi v], \sigma, \Sigma) \text{ when } x \in^\pi \eta$$

$$\text{update}(\eta, \sigma, \Sigma, x, v) = (\eta, \sigma[x \mapsto^\pi v], \Sigma) \text{ when } x \in^\pi \sigma$$

$$\text{update}(\eta, \sigma, \Sigma, x, v) = (\eta, \sigma, \Sigma[x \mapsto^\pi v]) \text{ when } x \in^\pi \Sigma$$

$$\text{get}(\eta, \sigma, \Sigma, x) = \eta(x) \text{ when } x \in^\pi \eta$$

$$\text{get}(\eta, \sigma, \Sigma, x) = \sigma(x) \text{ when } x \in^\pi \sigma$$

$$\text{get}(\eta, \sigma, \Sigma, x) = \Sigma(x) \text{ when } x \in^\pi \Sigma$$

$$\text{rename}(\eta, \sigma, \Sigma, x, y, \pi') = (\eta[y \mapsto^\pi \eta(x)], \sigma, \Sigma) \text{ when } x \in^\pi \eta$$

$$\text{rename}(\eta, \sigma, \Sigma, x, y, \pi') = (\eta, \sigma[y \mapsto^\pi \sigma(x)], \Sigma) \text{ when } x \in^\pi \sigma$$

$$\text{rename}(\eta, \sigma, \Sigma, x, y, \pi') = (\eta, \sigma, \Sigma[y \mapsto^\pi \Sigma(x)]) \text{ when } x \in^\pi \Sigma$$

### A.3.2 Perspective Management Rules.

$$\frac{p < n_1 \quad n_1, n_2 \text{ align to } n \quad \eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n_1)} s_1 \rightsquigarrow s'_1 \dashv_{m'} \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n)} \text{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \text{split}(n_1, n_2)\{s'_1\}\{s_2\} \dashv_{m'} \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Split-Left}$$

$$\frac{p < n_1 \quad n_1, n_2 \text{ align to } n}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n)} \text{split}(n_1, n_2)\{\text{skip}\}\{s_2\} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Split-Left-Done}$$

$$\frac{p \geq n_1 \quad p < n_1 + n_2 \quad n_1, n_2 \text{ align to } n \quad \eta, \sigma, \Sigma, t, b, p - n_1, \Psi, \Phi \vdash_m^{(h, n_2)} s_2 \rightsquigarrow s'_2 \dashv_{m'} \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n)} \text{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \text{split}(n_1, n_2)\{s_1\}\{s'_2\} \dashv_{m'} \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Split-Right}$$

$$\frac{p \geq n_1 \quad p < n_1 + n_2 \quad n_1, n_2 \text{ align to } n}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n)} \text{split}(n_1, n_2)\{s_1\}\{\text{skip}\} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Split-Right-Done}$$

$$\frac{p \geq n_1 + n_2 \quad n_1, n_2 \text{ align to } n}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n)} \text{split}(n_1, n_2)\{s_1\}\{s_2\} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Split-None}$$

$$\begin{array}{c}
1422 \\
1423 \quad \frac{\eta, \sigma, \Sigma, t, b, t \text{ mod } T, \Psi, \Phi \vdash_m^{(\text{Thread}, T)} s \rightsquigarrow s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, 0, \Psi, \Phi \vdash_m^{(\text{Block}, 1)} \text{destruct in } s \rightsquigarrow \text{destruct in } s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Destruct-Block} \\
1424 \\
1425 \\
1426 \\
1427 \quad \frac{\eta, \sigma, \Sigma, t, b, b \text{ mod } B, \Psi, \Phi \vdash_m^{(\text{Block}, B)} s \rightsquigarrow s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, 0, \Psi, \Phi \vdash_m^{(\text{Grid}, 1)} \text{destruct in } s \rightsquigarrow \text{destruct in } s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Destruct-Grid} \\
1428 \\
1429 \\
1430 \quad \frac{\eta, \sigma, \Sigma, t, b, 0, \Psi, \Phi \vdash_m^{\pi} \text{destruct in } \text{skip} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi}{\eta, \sigma, \Sigma, t, b, 0, \Psi, \Phi \vdash_m^{\pi} \text{destruct in } \text{skip} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Destruct-Done} \\
1431 \\
1432 \\
1433 \quad \frac{\eta, \sigma, \Sigma, t, b, p \text{ mod } n, \Psi, \Phi \vdash_m^{(h, n)} s \rightsquigarrow s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, q \cdot n)} \text{group } q \ s \rightsquigarrow \text{group } q \ s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Group} \\
1434 \\
1435 \quad \frac{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{group } q \ \text{skip} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{group } q \ \text{skip} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Group-Done} \\
1436 \\
1437
\end{array}$$

A.3.3 *Thread Synchronization*. We define a **size** operation on perspectives to compute the size of a perspective (the number of individual compute resources sharing it). The operation is defined thusly:

$$\begin{array}{c}
1440 \\
1441 \quad \text{size(Thread, } n) = n \\
1442 \quad \text{size(Block, } n) = n \cdot T \\
1443 \quad \text{size(Grid, } n) = n \cdot B \cdot T \\
1444 \\
1445 \\
1446 \\
1447 \quad \frac{\Psi(\psi)(p) = 0}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{wait}_{\psi} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Sync-Wait-Done} \\
1448 \\
1449 \\
1450 \quad \frac{\Psi(\psi)(p) \neq 0}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{wait}_{\psi} \rightsquigarrow \text{wait}_{\psi} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Sync-Wait-Spin} \\
1451 \\
1452 \quad \frac{\Psi' = \Psi(\psi)[p \mapsto \Psi(\psi)(p) - 1]}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{dec}_{\psi} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Sync-Dec} \\
1453 \\
1454 \\
1455 \quad \frac{\Psi(\psi)(p) = 0 \quad \Psi' = \Psi(\psi)[p \mapsto \text{size}(\pi)]}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{init}_{\psi} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Sync-Init-Zero} \\
1456 \\
1457 \\
1458 \quad \frac{\Psi(\psi)(p) \neq 0}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{init}_{\psi} \rightsquigarrow \text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Sync-Init-Nonzero} \\
1459 \\
1460
\end{array}$$

A.3.4 *Asynchrony*.

$$\begin{array}{c}
1461 \\
1462 \quad \frac{\text{rename}(\eta, \sigma, \Sigma, x, y, (\text{Thread}, 1)), t, b, p, \Psi, \Phi \vdash_m^{(\text{Thread}, 1)} s \rightsquigarrow s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_{m'}^{(\text{Thread}, 1)} \text{async\_partition}_{\phi} x \text{ into } y \text{ in } s \rightsquigarrow \text{async\_partition}_{\phi} x \text{ into } y \text{ in } s' \dashv_m \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Async-Partition-Congr} \\
1463 \\
1464 \\
1465 \\
1466 \\
1467 \quad \frac{\Phi = \Phi'[\phi \mapsto \Phi'(\phi) \cup \{s\}]}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(\text{Thread}, 1)} \text{async\_partition}_{\phi} x \text{ into } y \text{ in skip} \rightsquigarrow (\text{async\_partition}_{\phi} x \text{ into } y \text{ in } s) \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Async-Partition-Unwind} \\
1468 \\
1469 \\
1470
\end{array}$$

|      |                                                                                                                                                                                                                                 |                               |
|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
| 1471 | $\Phi(\phi) = \emptyset$                                                                                                                                                                                                        | <b>S-Async-Partition-Done</b> |
| 1472 |                                                                                                                                                                                                                                 |                               |
| 1473 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\text{Thread}, 1} \text{async\_partition}_\phi x \text{ into } y \text{ in } \text{skip} \rightsquigarrow$                                                                 |                               |
| 1474 | $\text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi$                                                                                                                                                                         |                               |
| 1475 |                                                                                                                                                                                                                                 |                               |
| 1476 |                                                                                                                                                                                                                                 | <b>S-Async-Memcpy</b>         |
| 1477 |                                                                                                                                                                                                                                 |                               |
| 1478 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\text{Thread}, 1} \text{async\_memcpy}(x, y) \rightsquigarrow$                                                                                                             |                               |
| 1479 | $\text{skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi[\phi \mapsto \Phi(\phi) \cup \{\text{memcpy}(x, y)\}]$                                                                                                                   |                               |
| 1480 |                                                                                                                                                                                                                                 |                               |
| 1481 | $(\eta', \sigma', \Sigma') = \text{update}(\eta, \sigma, \Sigma, x, \text{get}(\eta, \sigma, \Sigma, y))$                                                                                                                       | <b>S-Memcpy</b>               |
| 1482 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi \text{memcpy}(x, y) \rightsquigarrow \text{skip} \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$                                                                          |                               |
| 1483 | A.3.5 <i>Variables and Memory.</i>                                                                                                                                                                                              |                               |
| 1484 |                                                                                                                                                                                                                                 |                               |
| 1485 | $\eta, \sigma, \Sigma \vdash_{\pi'}^\pi e \Downarrow v \quad \pi' \leq \pi$                                                                                                                                                     | <b>S-Decl</b>                 |
| 1486 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi x : \tau @ \pi' := e \text{ in } s \rightsquigarrow s \dashv_m \eta[x \mapsto^{\pi'} v], \sigma, \Sigma, \Psi, \Phi$                                                    |                               |
| 1487 |                                                                                                                                                                                                                                 |                               |
| 1488 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi \text{free } n \dashv_{m-n} \eta, \sigma, \Sigma, \Psi, \Phi$                                                                                                           | <b>S-Free</b>                 |
| 1489 |                                                                                                                                                                                                                                 |                               |
| 1490 |                                                                                                                                                                                                                                 | <b>S-Alloc-Local</b>          |
| 1491 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi x := \text{alloc Local } \tau n \text{ in } s \rightsquigarrow$                                                                                                         |                               |
| 1492 | $s; \text{free } (n \cdot \text{size}(\tau)) \dashv_{m+n \cdot \text{size}(\tau)} \eta[x \mapsto^\pi \langle x, n \rangle], \sigma, \Sigma, \Psi, \Phi$                                                                         |                               |
| 1493 |                                                                                                                                                                                                                                 |                               |
| 1494 | $\pi = (\text{Block}, 1)$                                                                                                                                                                                                       | <b>S-Alloc-Shared</b>         |
| 1495 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi x := \text{alloc Shared } \tau n \text{ in } s \rightsquigarrow$                                                                                                        |                               |
| 1496 | $s; \text{free } (n \cdot \text{size}(\tau)) \dashv_{m+n \cdot \text{size}(\tau)} \eta, \sigma[x \mapsto^\pi \langle x, n \rangle], \Sigma, \Psi, \Phi$                                                                         |                               |
| 1497 |                                                                                                                                                                                                                                 |                               |
| 1498 |                                                                                                                                                                                                                                 | <b>S-Alloc-Global</b>         |
| 1499 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi x := \text{alloc Global } \tau n \text{ in } s \rightsquigarrow$                                                                                                        |                               |
| 1500 | $s; \text{free } (n \cdot \text{size}(\tau)) \dashv_{m+n \cdot \text{size}(\tau)} \eta, \sigma, \Sigma[x \mapsto^\pi \langle x, n \rangle], \Psi, \Phi$                                                                         |                               |
| 1501 |                                                                                                                                                                                                                                 |                               |
| 1502 | $x \in \pi' \quad \eta, \sigma, \Sigma \quad \eta, \sigma, \Sigma \vdash_{\pi'}^\pi e \Downarrow v \quad (\eta', \sigma', \Sigma') = \text{update}(\eta, \sigma, \Sigma, v) \quad \pi' \leq \pi$                                | <b>S-Assn</b>                 |
| 1503 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi x = e \rightsquigarrow \text{skip} \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$                                                                                        |                               |
| 1504 |                                                                                                                                                                                                                                 |                               |
| 1505 | $\eta, \sigma, \Sigma, t, b, p \vdash_\pi^\pi e_1 \Downarrow \langle l, n \rangle \quad i < n \quad \pi' \leq \pi$                                                                                                              |                               |
| 1506 | $\eta, \sigma, \Sigma, t, b, p \vdash_\pi^\pi e_2 \Downarrow i \quad x \in \pi' \quad \eta, \sigma, \Sigma$                                                                                                                     |                               |
| 1507 | $\eta, \sigma, \Sigma \vdash_{\pi'}^\pi e_3 \Downarrow v \quad (\eta', \sigma', \Sigma') = \text{update}(\eta, \sigma, \Sigma, l+i, v)$                                                                                         | <b>S-Arr-Assn</b>             |
| 1508 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi e_1[e_2] = e_3 \rightsquigarrow \text{skip} \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$                                                                               |                               |
| 1509 |                                                                                                                                                                                                                                 |                               |
| 1510 | $s' = s[(y + c \cdot p)/y] \quad (\eta', \sigma', \Sigma') = \text{rename}(\eta, \sigma, \Sigma, x, y, \pi/c)$                                                                                                                  | <b>S-Partition</b>            |
| 1511 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi \text{partition}_\psi x y c \rightsquigarrow \text{init}_\psi; s'; \text{dec}_\psi; \text{wait}_\psi \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$                      |                               |
| 1512 |                                                                                                                                                                                                                                 |                               |
| 1513 | $(\eta', \sigma', \Sigma') = \text{rename}(\eta, \sigma, \Sigma, x, y, (h, n_1))$                                                                                                                                               | <b>S-Claim</b>                |
| 1514 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{(h, n_1 + n_2)} \text{claim}_\psi x \text{ into } y \text{ at } n_1 \text{ in } s \rightsquigarrow$                                                                        |                               |
| 1515 | $\text{init}_\psi; \text{split}(n_1, n_2); \{s'\}; \{\text{skip}\}; \text{dec}_\psi; \text{wait}_\psi \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$                                                                             |                               |
| 1516 |                                                                                                                                                                                                                                 |                               |
| 1517 | $(\eta', \sigma', \Sigma') = \text{rename}(\eta, \sigma, \Sigma, x, y, \downarrow \pi)$                                                                                                                                         |                               |
| 1518 | $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi \text{lower}_\psi x \text{ into } y \text{ in } s \rightsquigarrow \text{init}_\psi; s; \text{dec}_\psi; \text{wait}_\psi \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$ | <b>S-Lower</b>                |
| 1519 |                                                                                                                                                                                                                                 |                               |

## A.3.6 Control Flow.

$$\frac{\eta, \sigma, \Sigma \vdash_{\pi}^{\pi} e \Downarrow \text{true}}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{if } e \text{ then } s_1 \text{ else } s_2 \rightsquigarrow s_1 \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-If-True}$$

$$\frac{\eta, \sigma, \Sigma \vdash_{\pi}^{\pi} e \Downarrow \text{false}}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{if } e \text{ then } s_1 \text{ else } s_2 \rightsquigarrow s_2 \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-If-False}$$

$$\frac{}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{while } e \text{ do } s \rightsquigarrow \text{if } e \text{ then } (s; \text{while } e \text{ do } s) \text{ else skip} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-While}$$

$$\frac{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} s_1 \rightsquigarrow s'_1 \dashv_{m'}^{\pi'} \eta', \sigma', \Sigma', \Psi', \Phi'}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} s_1; s_2 \rightsquigarrow s'_1; s_2 \dashv_{m'}^{\pi'} \eta', \sigma', \Sigma', \Psi', \Phi'} \text{ S-Seq-First}$$

$$\frac{}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} \text{skip}; s_2 \rightsquigarrow s_2 \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi} \text{ S-Seq-Done}$$

$$\frac{\Sigma(f) = \{[x_1 : \tau_1, \dots, x_n : \tau_n], s\} \quad \sigma, \Sigma \vdash_{\pi}^{\pi} e_i \Downarrow v_i \quad m' \leq m}{\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^{\pi} f(e_1, \dots, e_n) \rightsquigarrow s \dashv_m \eta[x_i \mapsto v_i], \sigma, \Sigma, \Psi, \Phi} \text{ S-Function-Call}$$

## A.3.7 Expressions.

$$\frac{\pi < (\text{Grid}, 1) \quad \pi' \leq \pi}{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} \text{partition\_id} \Downarrow \pi / \pi' - 1} \text{ E-Partition-Id}$$

$$\frac{}{\eta, \sigma, \Sigma \vdash_{\pi}^{\pi} x \Downarrow \text{get}(\eta, \sigma, \Sigma, x)} \text{ E-Var}$$

$$\frac{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_1 \Downarrow \langle l, n \rangle \quad \eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_2 \Downarrow i \quad i < n \quad \pi' \leq \pi}{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_1[e_2] \Downarrow \text{get}(\eta, \sigma, \Sigma, l+i)} \text{ E-Arr-Access}$$

$$\frac{}{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} \text{Int} \Downarrow n} \text{ E-Int} \quad \frac{}{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} \text{Bool} \Downarrow b} \text{ E-Bool}$$

$$\frac{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_1 \Downarrow v_1 \quad \eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_2 \Downarrow v_2 \quad v = v_1 \text{ bop } v_2}{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_1 \text{ bop } e_2 \Downarrow v} \text{ E-Bop}$$

$$\frac{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_1 \Downarrow v_1 \quad \eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_2 \Downarrow v_2 \quad v = v_1 \text{ cmp } v_2}{\eta, \sigma, \Sigma \vdash_{\pi'}^{\pi} e_1 \text{ cmp } e_2 \Downarrow v} \text{ E-Cmp}$$

## A.4 Theorems and Proofs

Note that in this section we assume no out of bounds array accesses. In general Parallax (and by extension Prism) makes no guarantees about array out of bounds.

**A.4.1 More definitions.** As a premise to our type safety theorems, we need to assume we have a well-typed environment, written  $\Gamma \vdash \eta, \sigma, \Sigma$ . We define what this means inductively

$$\frac{}{\eta, \sigma, \Sigma \vdash n : \text{int}} \text{ V-Int} \quad \frac{}{\eta, \sigma, \Sigma \vdash b : \text{bool}} \text{ V-Bool} \quad \frac{}{\eta, \sigma, \Sigma \vdash f : \text{float}} \text{ V-Float}$$

$$\frac{\forall i < n, \eta, \sigma, \Sigma \vdash \text{get}(\eta, \sigma, \Sigma, x + i) : \tau}{\eta, \sigma, \Sigma \vdash \langle x, n \rangle : \tau[]^l} \text{ V-Array} \quad \frac{\Gamma, x_i : \pi \tau_i \vdash_m^{\pi} s \quad \Gamma \vdash \cdot, \cdot, \text{fns } \Sigma}{\eta, \sigma, \Sigma \vdash \{x_i : \tau_i, s\} : \text{Fun}(x_i : \tau_i, \pi, m)} \text{ V-Function}$$

$$\frac{}{\cdot \vdash \eta, \sigma, \Sigma} \text{ G-Empty} \quad \frac{\eta(x) =^{\pi} v \quad \eta, \sigma, \Sigma \vdash v : \text{int}}{\Gamma, x : \pi \text{ int} \vdash \eta, \sigma, \Sigma} \text{ G-Int}$$

$$\begin{array}{c}
 \frac{\eta(x) =^\pi v \quad \eta, \sigma, \Sigma \vdash v : \text{bool}}{\Gamma, x :^\pi \text{bool} \vdash \eta, \sigma, \Sigma} \text{ G-Bool} \quad \frac{\eta(x) =^\pi v \quad \eta, \sigma, \Sigma \vdash v : \text{float}}{\Gamma, x :^\pi \text{float} \vdash \eta, \sigma, \Sigma} \text{ G-Float} \\
 \\ 
 \frac{\eta(x) =^\pi v \quad \eta, \sigma, \Sigma \vdash v : \tau[]}{\Gamma, x :^\pi \tau[]^{\text{Local}} \vdash \eta, \sigma, \Sigma} \text{ G-Local} \quad \frac{\sigma(x) =^\pi v \quad \eta, \sigma, \Sigma \vdash v : \tau[]}{\Gamma, x :^\pi \tau[]^{\text{Shared}} \vdash \eta, \sigma, \Sigma} \text{ G-Shared} \\
 \\ 
 \frac{\Sigma(x) =^\pi v \quad \eta, \sigma, \Sigma \vdash v : \tau[]}{\Gamma, x :^\pi \tau[]^{\text{Global}} \vdash \eta, \sigma, \Sigma} \text{ G-Global} \quad \frac{\Sigma(x) =^\pi v \quad \eta, \sigma, \Sigma \vdash v : \text{Fun}(\Gamma', \pi, m)}{\Gamma, x :^\pi \text{Fun}(\Gamma', \pi, m) \vdash \eta, \sigma, \Sigma} \text{ G-Function}
 \end{array}$$

We can prove a couple simple lemmas about well-typed environments under operations like **rename**, **update**, and **get**.

LEMMA A.1. (*Well-typed get*) If  $\Gamma \vdash \eta, \sigma, \Sigma$  and  $x :^\pi \tau \in \Gamma$  then  $\eta, \sigma, \Sigma \vdash \text{get}(\eta, \sigma, \Sigma, x) : \tau$ .

LEMMA A.2. (*Well-typed rename*) If  $\Gamma, x :^\pi \tau \vdash \eta, \sigma, \Sigma$  then  $\Gamma, x :^\pi \tau, y : \tau' \vdash \text{rename}(\eta, \sigma, \Sigma, x, y, \tau')$ .

LEMMA A.3. (*Well-typed update*) If  $\Gamma \vdash \eta, \sigma, \Sigma$  and  $\eta, \sigma, \Sigma \vdash v : \tau$  then  $\Gamma, x :^\pi \tau \vdash \text{update}(\eta, \sigma, \Sigma, x, v)$ .

We also define a well-formedness precondition on  $p$  with respect to  $\pi$ :

$$(h, n) \vdash p ::= p < n$$

We also define well-formedness for the async stack:

$$\Gamma \vdash \Phi ::= \forall \phi, s \in \Phi(\phi), \Gamma \vdash_m^{(\text{Thread}, 1)} s$$

## A.5 Proofs

LEMMA A.4. (*Expression Safety*) If  $\Gamma \vdash^\pi e : \tau$  and  $\Gamma \vdash \eta, \sigma, \Sigma$  and  $\pi \vdash p$  and  $\pi' \leq \pi$ , then there is some  $v$  such that  $\eta, \sigma, \Sigma \vdash_{\pi'} e \Downarrow v$  and  $v : \tau$ .

PROOF. This proof proceeds by induction on the typing relation for expressions. Despite the fact that this property implies termination, we do not need a logical relation to prove it because the expression language is very simple.

The **T-Int**, **T-Float**, and **T-Bool** cases are trivial, using the rules **E-Int**, **E-Bool** and **E-Float** to compute values. In the case for **T-Partition-Id**, the  $\pi$  premises of the typing rules and our assumption that  $\pi' \leq \pi$  match the premises of the evaluation rule, so this rule is simple as well.

The cases for **T-Bop** and **T-Cmp** follow directly from the inductive hypotheses, assuming a valid and correctly implemented set of binary operators and comparators.

The only interesting cases are **T-Arr-Access** and **T-Arr-Access-Shared**.

In both cases our inductive hypotheses and inversion give us that  $\pi' \leq \pi$ , and  $e_1$  evaluates to a  $\langle x, n \rangle$ , and that all the values between  $x$  and  $x + n$  in the appropriate environment are typed at  $\tau$ . We also know that  $e_2$  evaluates to an integer  $i$ . We assume that all array accesses are in bounds, so  $i < l$ , which is sufficient to use the **E-Arr-Access** rule to complete this case, and the proof.  $\square$

LEMMA A.5. (*Expression Determinism*) If  $\eta, \sigma, \Sigma, t, b, p \vdash_\pi^\pi e \Downarrow v_1$  and  $\eta, \sigma, \Sigma, t, b, p \vdash_\pi^\pi e \Downarrow v_2$  then  $v_1 = v_2$ .

PROOF. Straightforward by induction on the semantic derivation.  $\square$

LEMMA A.6. (*Expression Well-Typedness*) If  $\Gamma \vdash_{\pi'}^\pi e : \tau$  and  $\Gamma \vdash \eta, \sigma, \Sigma$  and  $\pi \vdash p$ , and  $\eta, \sigma, \Sigma, t, b, p \vdash_{\pi'}^\pi e \Downarrow v$ , then  $v : \tau$ .

PROOF. By our expression safety lemma our well-typed expression must evaluate to a well-typed value  $v'$ . By our determinism lemma  $v'$  must be the same as  $v$ , so  $v$  is well-typed.  $\square$

LEMMA A.7. (*Statement Progress*) If  $\Gamma \vdash \eta, \sigma, \Sigma$  and  $\Gamma \vdash_m^\pi s$  and  $\pi \vdash p$ , then either  $s$  is **skip** or there is some  $s'$  such that  $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi s \rightsquigarrow s' \vdash_m'' \eta', \sigma', \Sigma, \Psi', \Phi'$

PROOF. This proceeds by induction on the typing derivation.

### Perspective Management Rules.

- Case **T-Split**:

In this case we have by our assumption that  $\pi \vdash p$  that  $p < n$ . We also have that  $n_1, n_2$  align to  $n$ , so  $n_1 + n_2 \leq n$ . There are three cases to consider, then: when  $p < n_1$ , when  $p \geq n_1$  and  $p < n_1 + n_2$ , and when  $p \geq n_1 + n_2$ .

In the first case, we have by our inductive hypothesis that  $s$  is either **skip** or that it can step in an  $(h, n_1)$  context. In the former case we can use the **S-Split-Left-Done** rule and in the latter we can use the **S-Split-Left rule**.

The second case is almost symmetric. The only additional work we have to do is to argue that  $(h, n_2) \vdash p - n_1$ , or equivalently that  $p - n_1 < n_2$ . This, however, is immediate from our assumption that  $p < n_1 + n_2$ .

In the last case, we just use the **S-Split-None** rule to step to **skip**.

- Case **T-Destruct**

By our inductive hypothesis, we know that  $s$  can step at  $\downarrow \pi$  if  $\downarrow \pi \vdash p$ . The  $\downarrow$  operation is only defined at **(Block, 1)** or **(Grid, 1)**, so we only need to consider the cases where  $\pi$  is one of those.

In the former case  $p$  becomes  $t \mathbf{mod} T$  while  $\downarrow \pi$  is **(Thread, T)**.  $t \mathbf{mod} T < T$  for any  $t$  so this satisfies the requirement that  $\pi \vdash p$ , which lets us use our inductive hypothesis:  $s$  is either **skip** or can step. If it can step, we can use this to satisfy the premise of **S-Destruct-Block** to step in this case. If it is **skip**, then we use the rule **S-Destruct-Done** to step instead.

The latter case is the same, except using the fact that  $b \mathbf{mod} B < B$  and the **S-Destruct-Grid** rule.

- Case **T-Group**

In this case we have by assumption that  $(h, q \cdot n) \vdash p$ , i.e., that  $p < q \cdot n$ .

In this case we have our IH that if  $(h, n) \vdash p'$  for some  $p'$ , then  $s$  is either **skip** or steps with  $(h, n)$  perspective with  $p'$  as our perspective ID.

We choose  $p'$  to be  $p \mathbf{mod} n$ . This is always  $< n$ , so  $(h, n) \vdash p'$ . This lets us use our IH to get that  $s$  is either **skip** (in which case we can use the **S-Group-Done** rule to step) or itself steps, which lets us use the **S-Group** rule to step.

### Thread Synchronization Rules.

- Case **T-Sync-Wait**

$\Psi(\psi)(p)$  is either zero or it is not. In the former case we use the **S-Sync-Wait-Done** rule and in the latter we use **S-Sync-Wait-Spin**.

- Case **T-Sync-Dec**

We use the **S-Sync-Dec** rule to step.

- Case **T-Sync-Init**

We use the **S-Sync-Init-Zero** or **S-Sync-Init-Nonzero** rules depending on whether  $\Psi(\psi)(p)$  is zero or not.

### Asynchrony Rules.

- Case **T-Async-Partition**

In this case, we have via our IH that if  $\Gamma, y : (\text{Thread}, 1) \text{ async } \tau[]^l \vdash \eta', \sigma', \Sigma'$ , then we can either step  $s$  with **(Thread, 1)** perspective under environments  $\eta'$ ,  $\sigma'$ , and  $\Sigma'$ , or  $s$  is **skip**.

1667 We have by assumption that  $\Gamma, x : (\text{Thread}, 1) \text{ async } \tau[]^l \vdash \eta, \sigma, \Sigma$ . By our environment renaming  
 1668 lemma, this gives us what we need to use our IH with  $(\eta', \sigma', \Sigma')$  as **rename** $(\eta, \sigma, \Sigma, x, y, (\text{Thread}, 1))$ .  
 1669 Thus  $s$  either steps or is **skip**. In the former case we can step with **S-Async-Partition-Congr**,  
 1670 and in the latter we can use either **S-Async-Partition-Unwind** or **S-Async-Partition-Done**  
 1671 depending on whether  $\Phi(\phi)$  is empty or not.

- Case **T-Async-Memcpy**

1672 Immediate via use of the **S-Async-Memcpy** rule.

- Case **T-Memcpy**

1673 Immediate via use of the **S-Memcpy** rule.

1674 *Memory Rules.*

- Case **T-Decl**

1675 Via our lemma about expression type safety and our hypothesis that  $e$  is well-typed, we obtain  
 1676 the premises necessary to use the **S-Decl** rule to step.

- Case **T-Arr-Assn**

1677 Each of  $e_1, e_2$  and  $e_3$  must evaluate to a well-typed value by the expression type safety lemma.  
 1678 In particular, both  $e_1$  evaluates to some  $\langle l, n \rangle$  and  $e_2$  evaluates to some  $i$ . We assume all array  
 1679 accesses are in bounds, so this is sufficient to use the **S-Arr-Assn** rule to step.

- Case **T-Arr-Assn-Shared**

1680 Same as previous case.

- Case **T-Free**

1681 Trivial via the **S-Free** rule.

- Case **T-Partition**

1682 Trivial via the **S-Partition** rule.

- Case **T-Claim**

1683 Trivial via the **S-Claim** rule.

- Case **T-Lower**

1684 Trivial via the **S-Lower** rule.

- Case **T-Alloc**

1685 We assume that  $l$  is not **Shared**, so we can use the **S-Alloc-Local** or **S-Alloc-Global** rule,  
 1686 depending on whether  $l$  is **Local** or **Global**.

- Case **T-Alloc-Shared**

1687 Trivial via the **S-Alloc-Shared** rule.

1688 *Control Rules.*

- Case **T-Skip**

1689 Trivial

- Case **T-While**

1690 Trivial, all while loops step via the **S-While** rule

- Case **T-If**

1691 By our proof of expression type safety, the expression  $e$  steps to either the boolean value true  
 1692 or false. We can thus use either the **S-If-True** or **S-If-False** rules to step.

- Case **T-Seq**

1693 In this case we know by our IH that  $s_1$  is either **skip** or can step. In the former case we use  
 1694 the **S-Seq-Done** rule and in the latter we use the **S-Seq-First** rule.

- Case **T-Function-Call**

1695 In this case we know by our expression safety lemma that each of the arguments will evaluate to  
 1696 a well-typed value. We also have by assumption that  $f$  has a function type, which by inversion

on the **V-Function** rule tells us that it is a closure type. Additionally our assumption that  $\Gamma \vdash \eta, \sigma, \Sigma$  tell us that  $\Sigma$  contains  $f$  at the same type that  $\Gamma$  does. These premises are sufficient to use the **S-Function-Call** rule.

□

**LEMMA A.8. (Statement Preservation)** If  $\Gamma \vdash \eta, \sigma, \Sigma$  and  $\Gamma \vdash_m^\pi s$  and  $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_{m'}^\pi s \rightsquigarrow s' \dashv_{m''} \eta', \sigma', \Sigma, \Psi', \Phi'$  and  $\pi \vdash p$  and  $m \geq m'$  and  $\Gamma \vdash \Phi$ , then there is some  $\Gamma'$  such that  $\Gamma \subseteq \Gamma'$  and  $\Gamma' \vdash_m^\pi s'$  and  $\Gamma' \vdash \eta', \sigma', \Sigma'$  and  $m \geq m''$  and  $\Gamma' \vdash \Phi'$ .

**PROOF.** We proceed by induction on the derivation of  $\Gamma \vdash_m^\pi s$ .

### Perspective Management Rules.

- Case **T-Split**

In this case we have by assumption that  $(h, n) \vdash p$ ,  $\Gamma \vdash \eta, \sigma, \Sigma$ , and  $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_{m'}^\pi \text{split}(n_1, n_2) \{s_1\} \{s_2\} \rightsquigarrow s' \dashv_{m''} \eta', \sigma', \Sigma, \Psi', \Phi'$ . Our inductive hypotheses give us that if for any  $p$ , if  $(h, n_1) \vdash p$  and  $s_1$  steps with perspective ID  $p$ , or if  $(h, n_2) \vdash p$  and  $s_2$  steps with perspective ID  $p$  then their results are well typed.

By inversion on our semantic derivation, we are in one of 5 cases.

In the **S-Split-Left** case  $p < n_1$  and  $s_1$  steps to  $s'_1$ . This is sufficient to tell us that  $s'_1$  is well-typed and the output environments of that relation  $\eta', \sigma'$ , and  $\Sigma'$  are all well-typed by  $\Gamma' \supseteq \Gamma$ , and that the memory is properly bounded by the typing rules.

We can thus use the **T-Split-Left** rule to conclude that the result of this case is well-typed.

The **S-Split-Left-Done** rule is trivial via the **T-Skip** rule.

The **Right** cases are symmetric, with the observation that when  $p \geq n_1$  and  $p < n_1 + n_2$  then  $p - n_1 < n_2$ .

The last **S-Split-None** rule is trivial via the **T-Skip** rule.

- Case **T-Destruct**

In this case we have by assumption that  $\downarrow \pi$  is defined, so  $\pi$  is either **(Block,1)** or **(Grid,1)**. We also assume that  $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_{m'}^\pi \text{destruct in } s \rightsquigarrow s' \dashv_{m''} \eta', \sigma', \Sigma, \Psi', \Phi'$ . We also have by our inductive hypothesis that for any  $p$  such that  $\downarrow \pi \vdash p$  and  $s''$  such that  $s$  steps to  $s''$  at  $p$ , then that step preserved well-typedness.

By inversion on the step relation, we are in one of three cases.

If the rule used **S-Destruct-Block**, then we know that  $s$  steps to  $s''$  and  $p$  is  $t \bmod T$ .  $(\text{Thread}, T) \vdash t \bmod T$  for any  $t$ , so we can use our inductive hypothesis to conclude that the  $s''$  stepped to by  $s$  is well-typed, as are its environments and memory usage. The **T-Destruct** rule then gives us our desired goal.

The **S-Destruct-Grid** case proceeds similarly, while the **S-Destruct-Done** case is trivial.

- Case **T-Group**

In this case we have by assumption that

$\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_{m'}^{(h,q,n)} \text{group } q \ s \rightsquigarrow s' \dashv_{m''} \eta', \sigma', \Sigma, \Psi', \Phi'$ . We have by our inductive hypothesis that  $s$  steps to  $s''$  at some perspective ID  $p'$  and  $(h, n) \vdash p'$ , then  $s''$  is well typed, as are the other outputs of that step.

By inversion on the step relation, we are in one of two cases. The **S-Group-Done** case is trivial, so we shall focus on the **S-Group** case. In this case we have that  $s$  steps to  $s''$  at perspective ID  $p \bmod n$ . It is always the case that  $(h, n) \vdash p \bmod n$  for any  $p$ , so we can use our inductive hypothesis to conclude that  $s''$  is well-typed, as are its output environments and memory usage. From there, it is a simple application of the **T-Group** rule to conclude that **group**  $q$   $s'$  is well-typed, and to finish the case.

**Thread Synchronization Rules.** These rules are all trivial: with one exception all thread synchronization primitives step to skip without changing environment or memory, and are thus obviously well-typed. **S-Sync-Wait-Spin** does not produce skip, but it steps to the same statement as we already assumed typechecks in the premise of the lemma, so is straightforward nonetheless.

If we wanted to say something about deadlock freedom we'd have more work here, but we aren't doing that, so these rules are easy.

### Asynchrony Rules.

- Case **T-Async-Partition** In this case we have that  $s$  is well-typed in a context where  $x$  has been renamed into  $y$ , with  $(\text{Thread},1)$  perspective. We also have that  $\Gamma, x :^{(\text{Thread},1)} \tau[]^l \vdash \eta, \sigma, \Sigma$  and  $\Gamma, x :^{(\text{Thread},1)} \tau[]^l \Phi$ .

By inversion, we are in one of three cases.

In the **S-Async-Partition-Done** case, we are done.

In the **S-Async-Partition-Congr** case, our inductive hypothesis gives us that there is some  $\Gamma'$  such that  $\Gamma, y :^{(\text{Thread},1)} \text{async } \tau[]^l \subseteq \Gamma'$  and  $\Gamma' \vdash \Phi'$  and  $\Gamma' \vdash \text{rename}(\eta, \sigma, \Sigma, x, y, (\text{Thread},1))$  via our well-typed renaming lemma. This lets us use the **T-Async-Partition** rule to check this case, with a choice of  $\Gamma'$  as  $\Gamma', x :^{(\text{Thread},1)} \tau[]^l$ .

In the **S-Async-Partition-Unwind** case, our assumption that  $\Phi$  is well-typed tells us that  $\Gamma, y :^{(\text{Thread},1)} \vdash \{\ (\text{Thread},1)_m s\}$ . Thus, we can use the **T-Async-Partition** rule to type this case.

- Case **T-Async-Memcpy**

In this case the statement and environment typing are trivial, we need only to show that the async stack remains well typed.

In this case we have that  $x$  and  $y$  have the same type at  $(\text{Thread},1)$ . This is sufficient for us to check  $x = y$  at  $(\text{Thread},1)$ , meaning that adding that instruction to the stack maintains its well-typedness.

- Case **T-Memcpy**

Immediate via use of the **S-Memcpy** rule. We just need to show that the environment remains well typed, which we know via our lemmas about **update** and **get**.

### Memory Rules.

- Case **T-Decl**

By inversion, we are using **S-Decl** rule for evaluation. Our well-typed expression lemma gives us that  $v$  is well-typed, so it follows from our assumptions and our lemmas about extending environments that the extended  $\eta$  and  $s$  are well-typed by  $\Gamma, x :^\pi \tau$ .

- Case **T-Free**

Trivial.

- Case **T-Alloc**

By inversion we are either in the **S-Alloc-Local** or **S-Alloc-Global** rules. In either case, we assume that  $\Gamma, x :^\pi \tau[]^l$  checks  $s$ , meaning we can use our extended environment lemmas and the **T-Seq** and **T-Free** rules to check these cases.

- Case **T-Alloc-Shared**

Same as previous case.

- Case **T-Partition**

In this case we have by assumption that  $\Gamma, y :^{(h,n/c)} \tau[]^l s$  and  $l$  is not local and  $c$  divides  $n$ . By inversion on our step relation we must be in the **S-Partition** case, so we have  $\eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi \text{partition}_\psi y cys \rightsquigarrow \text{init}_\psi; s'; \text{dec}_\psi; \text{wait}_\psi \dashv_m \eta', \sigma', \Sigma', \Psi, \Phi$  where  $s' = s[(y + c \cdot p)/y]$  and  $(\eta', \sigma', \Sigma') = \text{rename}(\eta, \sigma, \Sigma, x, y, \pi/c)$ . Via our well-typed renaming lemma contexts we know

that we can check the renamed environments in the extended environment  $\Gamma, y:\pi/c \tau[]^l, x:\pi\tau[]^l$ , and this context is also sufficient to check  $s'$  (via a substitution-preserves-typing lemma that is obvious). Using the **T-Skip** rule this is exactly what we need to show to complete this case, as the thread sync primitives check trivially via their typing rules.

- Case **T-Claim**

Essentially the same as **T-Partition**.

- Case **T-Lower**

Essentially the same as **T-Partition**.

### Control Rules.

- Case **T-If**

We have by assumption that **if**  $e$  **then**  $s_1$  **else**  $s_2$  steps to some  $s'$ , and by inversion we know that either  $e$  evaluates to **true** and  $s'$  is  $s_1$ , or  $e$  evaluates to **false** and  $s'$  is  $s_2$ .

In either case, our inductive hypotheses is sufficient to tell us that these are well-typed. In particular, in both cases our IHs tell us that the amount of memory used by stepping each branch of the **if** is less than the amount of memory computed by the type system for each branch. Because the whole **if** expression checks using the greater of the memory usage of  $m_1$  or  $m_2$  (i.e., the memory usage on each branch), the resulting usage for the whole conditional is also bounded by the type system.

- Case **T-Skip**

Trivial: **skip** does not step

- Case **T-Seq**

In this case we have that  $s_1$  and  $s_2$  are both well-typed (with  $m_1$  memory and  $m_2$  memory respectively), and  $m = \mathbf{max}(m_1, m_2)$ . We also have by inversion that  $s_1$  either steps to **skip** or  $s'_1$ , and our inductive hypothesis tells us that  $s'_1$  is well-typed.

In the former case we can use the **S-Seq-Done** rule to trivially finish the case. In the latter, our IH allows us to finish the case, since  $m_1$  is always  $\leq \mathbf{max}(m_1, m_2)$

- Case **T-While**

By inversion, we have that

$$\begin{aligned} & \eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi \mathbf{while } e \mathbf{ do } s \\ & \rightsquigarrow \mathbf{if } e \mathbf{ then } (s; \mathbf{while } e \mathbf{ do } s) \mathbf{ else skip } \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi. \end{aligned}$$

We also have by assumption that  $e$  and  $s$  are well-typed. With this information, through use of the **T-If**, **T-Seq**, **T-While**, and **T-Skip** rules, we can conclude that the result of this rule is also well-typed.

- Case **T-Function-Call**

By inversion we have that

$$\begin{aligned} & \eta, \sigma, \Sigma, t, b, p, \Psi, \Phi \vdash_m^\pi f(e_1, \dots, e_n) \\ & \rightsquigarrow \mathbf{call } s \mathbf{ with } (x_i : \pi_i \tau_i \mapsto v_i) @ \{\eta, \sigma, \Sigma, m'\} \dashv_m \eta, \sigma, \Sigma, \Psi, \Phi, \end{aligned}$$

and also that  $\Sigma(f) = \{[x_1 : \tau_1, \dots, x_n : \tau_n], s\}$ , and  $\eta, \sigma, \Sigma, t, b, p \vdash^\pi e_i \Downarrow v_i$ , and  $m' \leq m$ .

We also have from the premises of our case that  $f : \pi \mathbf{Fun}(x_1 : \pi_1 \tau_1, \dots, x_n : \pi_n \tau_n, \pi, m') \in \Gamma$ , and  $\Gamma \vdash^\pi e_i : \tau_i$ , and  $m' \leq m$ .

Our lemma for expression well-typedness tells us that that each  $v_i$  is a well-typed value, and our assumption that  $\Gamma \vdash \eta, \sigma, \Sigma$  tells us that  $\Sigma(f)$  is a well-typed function and thus that  $\mathbf{fn}s \Gamma, x_i : \pi_i \tau_i \vdash_{m'}^\pi s$ . We can take the union of this with  $\Gamma$  to produce  $\Gamma, x_i : \pi_i \tau_i \vdash_{m'}^\pi s$ , which clearly checks  $s$  and the output environments.

□

```

1863
1864     @cuda("global")
1865     @requires(grid[1], block[1], thread[1], smem=49000)
1866     def my_sgemm_kernel_1( M : int @grid[1],
1867                           N : int @grid[1],
1868                           K : int @grid[1],
1869                           alpha : float @grid[1],
1870                           A: ptr(const(float)) @grid[1],
1871                           B: ptr(const(float)) @grid[1],
1872                           beta : float @grid[1],
1873                           C: ptr(float) @grid[1],
1874                           block_size : int @grid[1]):
1875         with group(grid[1]):
1876             bid : int @block[1] = id()
1877             with partition(C, dimension=block[1], offset=0) as C_blk:
1878                 with group(block[1]):
1879                     tid : int @thread[1] = bid * block_size + id()
1880                     with partition(C_blk, dimension=thread[1], offset=0) as C_thrd:
1881                         x : const(int) @thread[1] = tid / N
1882                         y : const(int) @thread[1] = tid % N
1883                         with group(thread[1]):
1884                             tmp : float @thread[1] = 0
1885                             for i in range(0, K, 1):
1886                                 tmp += A[x * K + i] * B[i * N + y]
1887                             C_thrd[x * N + y] = alpha * tmp + beta * C[x * N + y]
1888
1889             return
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3159
3160
3161
3162
3163
3164
3165
3166
3167
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3179
3180
3181
3182
3183
3184
3185
3186
3187
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3297
3298
3299
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3387
3388
3389
3389
3390
3391
3392
3393
3394
3395
3396
3397
3397
3398
3399
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3498
3499
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3598
3599
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3698
3699
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3798
3799
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4098
4099
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4148
4149
415
```



```

1961 1
1962
1963
1964
1965 1 ccuda("device")
1966 2 requires(block[1], thread[32])
1967 3 def block_load(input : ptr(const(int)) block[1],
1968 4           output : ptr(int) thread[4],
1969 5           items_per_thread : int block[1]):
1970 6     with group(block[1]):
1971 7         tid : int thread[32] = id()
1972 8         with partition(input, dimension=thread[32], offset = tid * items_per_thread) as input_thrd:
1973 9             with group(thread[32]):
1974 10                 warp_load(input_thrd, output, items_per_thread)
1975 11
1976 12     return
1977
1978
1979
1980
1981
1982
1983
1984 1 ccuda("device")
1985 2 requires(thread[32])
1986 3 def warp_load(input : ptr(const(int)) thread[32],
1987 4           output : ptr(int) thread[1],
1988 5           items_per_thread : int thread[32]):
1989 6     with group(thread[32]):
1990 7         tid : int thread[1] = id()
1991 8         with partition(input, dimension=thread[1], offset = tid * items_per_thread) as input_thrd:
1992 9             with group(thread[1]):
1993 10                 thread_load(input_thrd, output, items_per_thread)
1994 11
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009

```

```
2010
2011 1 #include <cuda.h>
2012 2 #include <device_launch_parameters.h>
2013 3 #include <thrust/device_ptr.h>
2014 4 #include <thrust/thread_scope.h>
2015 5 #include <thrust/thread_block.h>
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
```

```
1 #include <cuda.h>
2 #include <device_launch_parameters.h>
3 def thread_store(input : ptr(const(int)) @ thread[1],
4                  output : ptr(int) @ thread[1],
5                  items_per_thread : int @ thread[1]):
6     with group(thread[1]):
7         for i in range(0, items_per_thread, 1):
8             output[i] = input[i]
9     return
```

```

2059
2060 1    __cuda("global")
2061 2    __attribute__ (( __launch_bounds__(128*3) ))
2062 3    __requires(grid[1], block[1], thread[1], warp[12], smem=227000)
2063 4    def cutlase_my_h100_match_no_tail(
2064 5        M: int @ grid[1],
2065 6        N: int @ grid[1],
2066 7        K: int @ grid[1],
2067 8        C: ptr(bf16) @ grid[1],
2068 9        tensorMapA: const(CUTensorMap) @ grid[1],
2069 10       tensorMapB: const(CUTensorMap) @ grid[1]):
2070 11
2071 12    with group(grid[1]):
2072 13        BM : constexpr(int) @ grid[1] = 128
2073 14        BN : constexpr(int) @ grid[1] = 256
2074 15        BK : constexpr(int) @ grid[1] = 64
2075 16        NUM_THREADS : constexpr(int) @ grid[1] = 128*3
2076 17        QSIZE : constexpr(int) @ grid[1] = 4
2077 18        NUM_SM : constexpr(int) @ grid[1] = 128
2078 19
2079 20        WGMMA_M : constexpr(int) @ grid[1] = 64
2080 21        WGMMA_K : constexpr(int) @ grid[1] = 16
2081 22        WGMMA_N : constexpr(int) @ grid[1] = BN
2082 23        num_consumers : constexpr(int) @ grid[1] = (NUM_THREADS / 128) - 1
2083 24        B_WG_M : constexpr(int) @ grid[1] = BM / num_consumers
2084 25
2085 26        TM : constexpr(int) @ grid[1] = 16
2086 27        TN : constexpr(int) @ grid[1] = 8
2087 28        schedule_block : int @ block[1] = id()
2088 29
2089 30        with partition(C, offset=0, dimension=block[1]) as block_C:
2090 31            with group(block[1]):
2091 32
2092 33                # Allocate SA
2093 34                # sA[QSIZE][BK*BM]
2094 35                sA_slot0 : shared(bf16[64* 128], align=128) @ block[1]
2095 36                sA_slot1 : shared(bf16[64* 128]) @ block[1]
2096 37                sA_slot2 : shared(bf16[64* 128]) @ block[1]
2097 38                sA_slot3 : shared(bf16[64* 128]) @ block[1]
2098 39
2099 40                # Allocate SB
2100 41                # sB[QSIZE][BK*BN]
2101 42                sB_slot0 : shared(bf16[64* 256], align=128) @ block[1]
2102 43                sB_slot1 : shared(bf16[64* 256]) @ block[1]
2103 44                sB_slot2 : shared(bf16[64* 256]) @ block[1]
2104 45                sB_slot3 : shared(bf16[64* 256]) @ block[1]
2105 46
2106 47
2107 48                num_blocks_k : const(int) @ block[1] = K / BK
2108 49                wg_idx : int @ warp[4] = id()
2109 50                blk_thrd_id : int @ thread[4] = id()
2110 51
2111 52
2112 53                tid : int @ thread[4] = id() % 128
2113 54                is_producer : bool @ warp[4] = wg_idx == 0
2114 55
2115 56                num_block_m : int @ block[1] = 0
2116 57                num_block_n : int @ block[1] = 0
2117 58
2118 59                with group(warp[4]):
2119 60                    if (is_producer):
2120 61                        pass
2121 62                    else:
2122 63                        wg_idx = wg_idx - 1
2123 64
2124 65                schedule_it : int @ block[1] = 0
2125 66                total_blocks_m : int @ block[1] = (((M + BM) - 1) / BM)
2126 67                total_blocks_n : int @ block[1] = (((N + BN) - 1) / BN)
2127 68                unsafe("assert(CEIL_DIV(M, BM)*TM == 0 & total_blocks_n%TN == 0);")
2128 69
2129 70                while(true):
2130 71                    num : int @ block[1] = ((schedule_it * NUM_SM) + schedule_block)
2131 72                    if (num >= (total_blocks_m * total_blocks_n)):
2132 73                        break
2133 74                    cur_tile : int @ block[1] = (num / (TM * TN))
2134 75                    cur_tile_pos : int @ block[1] = (num % (TM * TN))
2135 76                    num_block_m = (TM * (cur_tile / (total_blocks_n / TN)))
2136 77                    num_block_n = (TN * (cur_tile % (total_blocks_n / TN)))
2137 78                    num_block_m += (cur_tile_pos / TN)
2138 79                    num_block_n += (cur_tile_pos % TN)
2139 80                    schedule_it += 1
2140 81                    d : c_float[1][16][8] @ thread[1]
2141 82
2142 83                    with group(thread[4]):
2143 84                        d_ptr : ptr(float) @ thread[1]
2144 85                        # TODO: ANNOYING TO NOT HAVE TO SET EACH INDIVIDUAL TO 0.
2145 86                        unsafe("d_ptr = (float *)d;")
2146 87                        memset_wrapper(d_ptr, 0, 1 * 16 * 8 * 4)
2147 88
2148 89                    # Now remove these registers...
2149 90                    with group(warp[4]):
2150 91                        if (is_producer):
2151 92                            num_regs : constexpr(int) @ warp[4] = 24 if (num_consumers <= 2) else 32
2152 93                            warpgroup_realloc(tempalte_[num_regs])

```