

# 1 Exo-GPU: Safe, Imperative, User-schedulable Programming 2 for Tensor Cores

3 ANONYMOUS AUTHOR(S)

4 Modern GPUs require not only SIMD-style parallelism but also software-managed concurrency between  
5 compute and data movement to reach maximum performance. Performance engineers must reason about  
6 subdividing work into the hierarchy of computation resources (threads, warps, warpgroups, blocks, clusters),  
7 and, in many cases, also must use asynchronous tensor core and memcpy instructions on different levels  
8 of the memory hierarchy (registers, tensor core accumulators, shared memory, global memory). Unlike  
9 CPUs, where out-of-order execution is managed by hardware and hidden from programmers, GPUs expose  
10 explicit instruction reordering to software through these asynchronous instructions. Well-established GPU  
11 programming languages generally offer either direct low-level control without safety guarantees (e.g., CUDA  
12 C++ inline assembly or intrinsics) or easier-to-analyze, high-level abstractions (e.g., Triton’s tile-based model)  
13 that hide asynchronous instructions in the compiler backend, which may prevent performance engineers from  
14 maximizing performance by tuning critical details.

15 We propose Exo-GPU, an imperative, low-level language that creates minimal abstraction over CUDA. Our  
16 key idea is to treat parallelism and synchronization as mere annotations on sequential code rather than as  
17 fundamental control flow primitives, enabling verification that these constructs do not alter the program  
18 semantics. The benefit is two-fold: programmers can reason about code without hidden control flow or  
19 mutation, while allowing the Exo-GPU compiler to verify *sequential-parallel equivalence*—guaranteeing that  
20 parallel execution is functionally equivalent to its sequential interpretation. We used Exo-GPU to author GEMM  
21 kernels for the H100 GPU, using wmma, TMA, and split-k. Our kernels achieved over 80% of theoretical peak  
22 on large problem sizes, in some cases outperforming the vendor-provided CUBLAS library.

## 23 1 INTRODUCTION

24 Starting with Tensor Cores introduced in NVIDIA’s Volta generation of GPUs [13], the menu of  
25 tensor-specialized features in NVIDIA GPUs has expanded to encompass automated bulk move-  
26 ment of multidimensional data, distributed shared memory, and new asynchronous tensor cores  
27 supporting large-scale (up to  $64 \times 256$ ) matrix accumulation in a single instruction [18]. This  
28 poses ever-increasing challenges for authoring correct, top-performing CUDA code. Not only must  
29 algorithms be rewritten to migrate work from traditional SIMD-style code to target these specialized  
30 instructions, but also, these instructions expose explicit out-of-order instruction execution to the  
31 programmer, moving synchronization that was traditionally the responsibility of the hardware  
32 (e.g., through scoreboarding) to the programmer.

33 High-level kernel programming languages, such as Triton [24], provide a popular approach for  
34 taming this complexity. Triton’s tile-based, automatically-parallelized programming model reduces  
35 the risk of programmer error by moving responsibility for correct usage of bulk data movement and  
36 synchronization instructions into the compiler. However, in practice, groundbreaking algorithms,  
37 such as the original FlashAttention 3 [21], are often written directly in CUDA C++ for greater  
38 control. On the one hand, CUDA C++ provides direct control over features such as asynchronous  
39 data movement and warp specialization required for optimal GPU usage. On the other hand, CUDA  
40 C++ exposes programmers to a variety of potential bugs, which we broadly categorize as:

- 41 (1) *Dataflow Logic Bugs*: Program optimization usually requires changing the storage or com-  
42 putation order of intermediate values, e.g. caching data in a faster memory, or reordering  
43 instructions to hide latency. A flawed optimization, such as an indexing bug in copying a  
44 tile, or exchanging two statements that don’t commute safely, can change the functional  
45 behavior of the program.
- 46 (2) *Instruction Logic Bugs*: Accelerator instructions won’t work as intended if not used as  
47 specified by the hardware vendor. For example, wmma tensor core instructions require data  
48 in a certain input format and uniform execution by a full warpgroup (128 aligned threads).

- 50 (3) *Inter-Thread Synchronization Bugs*: Garden-varietiy synchronization bugs involve two threads  
 51 accessing the same memory location (with one access being a write) without ordering  
 52 enforced by synchronization.  
 53 (4) *Intra-Thread Synchronization Bugs*: If one thread issues an asynchronous (async) instruction,  
 54 and that *same thread* later issues another instruction (async or not) that operates on the same  
 55 memory location, overlapped or reordered instruction execution can undermine correctness.

56 Our goal is to provide a programming model that gives CUDA-like control *without* accepting the  
 57 risk of these bugs as a fact of life. Existing CUDA C++ libraries such as CuTe [14] and ThunderKittens  
 58 [22] provide close-to-the-metal abstractions that minimize the risk of logic bugs, particularly data  
 59 movement and instruction input format bugs; however, none of them support large-scale program  
 60 analysis and synchronization checking of arbitrary input programs.

61 As a starting point for the Exo-GPU language design, we postulate that programs with *sequential*  
 62 *semantics* are far easier to analyze, both by humans and by compilers. We define program behavior  
 63 using standard sequential semantics, treating language features for parallelism and synchronization—  
 64 including parallel for-loops across CUDA thread hierarchies and barrier statements—simply as  
 65 *annotations* on sequential code. This design provides two key benefits: programmers can intuitively  
 66 reason about their code without hidden control flow or side effects, and more importantly, compilers  
 67 can verify whether synchronizations are legal; that is, they can verify these annotations do not  
 68 alter the program behavior relative to its baseline sequential interpretation.

69 The Exo-GPU program, interpreted sequentially, serves as the semantic baseline—defining  
 70 *what* the user intends to compute. Parallelism annotations specify *which threads* execute which  
 71 statement instances, while synchronization constructs specify *how* the user intends to uphold  
 72 baseline behavior. This raises a question: what degree of safety and control does Exo-GPU provide?  
 73 We define safety as *sequential-parallel equivalence*—ensuring that the parallel interpretation of  
 74 a program matches the sequential one. Parallelism is explicitly controlled by the programmer’s  
 75 annotations, rather than being hidden and automated by the compiler. The Exo-GPU compiler’s  
 76 role is limited to *verifying* that the user-provided synchronization ensures equivalence.

77 Through collective analysis (Section 4), we annotate each statement with a *collective tiling*,  
 78 specifying which thread(s) execute each statement instance. This enables static analysis of the  
 79 thread assignment counts, ensuring instructions with thread convergence requirements (e.g. warp  
 80 MMA) are used correctly. To ensure sequential-parallel equivalence, we simulate execution on an  
 81 *abstract machine* (Section 5). The Exo-GPU program converts to an Abstract Machine program  
 82 where data variables store access histories (reads and mutations) with their thread and sync  
 83 instruction *visibility*, rather than numeric values. Synchronization statements become meaningful  
 84 only in the abstract machine, conditionally expanding the visibility of all prior recorded accesses.

85 We built Exo-GPU as an extension to Exo [7, 8], a user-schedulable language providing com-  
 86 posable, checked rewrite rules for transforming sequential programs. Exo lacks GPU support, and  
 87 we significantly extended its frontend language design, static analysis, semantics, and codegen to  
 88 enable safe parallel program transformation. Treating parallel constructs as annotations over a  
 89 sequential semantic baseline enabled us to reuse Exo’s over 60 verified rewrites as-is, which allowed  
 90 us to transform a simple sequential GEMM program into a complex GEMM kernel for the NVIDIA  
 91 Hopper architecture, with every step of transformation verified via a *chain of equivalence*: Exo  
 92 verified equivalence between the original  $p_1$  and the final  $p_n$  while ignoring all the synchronization  
 93 constructs, and Exo-GPU verified functional equivalence between the parallel and sequential inter-  
 94 pretations of  $p_n$ , checking the legality of the parallel and synchronous annotations (Section 2.3).  
 95 Combining Exo’s pre-existing checked rewrite rules with Exo-GPU’s new language features and  
 96 synchronization checking, we created a GEMM kernel achieving over 80% of theoretical peak for  
 97 large problem sizes, in some cases outperforming the vendor-supplied CUBLAS library (Section 6).

## 99    2 EXAMPLES AND SYSTEM OVERVIEW

### 100    2.1 Exo-GPU Language Overview

101    Exo-GPU is built on fundamental design goals of safe, performance transparent, and imperative  
 102    GPU programming. Unlike other GPU abstractions that hide complexity, we make threads and most  
 103    CUDA instructions explicit in the IR. We introduce three first-class frontend language features de-  
 104    rived from our design principles: (i) parallel loops that explicitly map work to blocks/warps/threads,  
 105    (ii) a distributed memory model with sharding and explicit memory space annotations, and (iii)  
 106    explicit synchronization, including fences, split-barriers, and asynchronous operations.  
 107

108    **Parallel loops.** Exo-GPU exposes three kinds of loops. (1)  $\text{seq}(c_{lo}, c_{hi})$  is a sequential loop that  
 109    iterates  $i$  from  $c_{lo}$  to  $c_{hi}-1$ . (2)  $\text{cuda\_tasks}(lo, hi)$  distributes iterations across CUDA clusters; on pre-  
 110    sm\_90 (H100) GPUs a cluster coincides with a single CTA (thread block). (3)  $\text{cuda\_threads}(lo, hi, \text{unit} =$   
 111     $\tau_u$ ) assigns iterations to subsets of threads inside the current cluster. The  $\text{unit}$  takes a parameter  
 112     $\tau_u$  that specifies the shape of the participating thread set but not concrete indices—for example,  
 113     $\tau_u = \text{cuda\_warp}$  denotes contiguous 32-thread groups with thread IDs 0–31, 32–63, 64–95, etc.  
 114    (inclusive), while  $\tau_u = \text{cuda\_thread}$  denotes a single thread (see Figure 11).

115    Each loop body, conditional branch, with-block, or proce-  
 116    dure defines a *program scope*. As part of *collective analysis* (Sec-  
 117    tion 4), we annotate each scope with a *collective tiling*  $\omega \in \Omega$ .  
 118    This describes a mapping between the control variable values  
 119    and the *thread collective* (set of threads) assigned to execute  
 120    statement instances in the scope (i.e. which threads execute  
 121    which loop iterations). We say that a scope is a  $\tau_u$ -scope when  
 122     $\tau_u$  accurately describes the thread sets produced by the mapping. In the inset figure, the out-  
 123    ermost parallel loop  $\text{cuda\_tasks}(0, 37)$  creates scope-1 (which is a cluster scope, and, assum-  
 124    ing  $\text{clusterDim}=1$ , also a CTA scope); nesting  $\text{cuda\_threads}(\dots, \text{unit}=\text{cuda\_warp})$  creates  
 125    scope-2 (warp scope); and nesting  $\text{cuda\_threads}(\dots, \text{unit}=\text{cuda\_thread})$  creates scope-3  
 126    (single-thread scope). Statement instances in scope-3 are executed by one thread.

127    **Distributed Memory:** Exo-GPU exposes explicit allocation statements for data and barrier  
 128    variables, annotated by memory type, e.g.  $x : \text{f32}[8] @ \text{CudaRmem}$  allocates a data array  $x$   
 129    of type f32 and size 8 in GPU register memory. Allocations can be *distributed*: a single logical  
 130    object may be sharded across threads at multiple CUDA levels (e.g., per-thread register shards or  
 131    per-CTA-in-cluster shared-memory shards). Threads must access only their owned shards except  
 132    via explicit communication instructions (e.g., warp shuffles) or TMA multicast. This sharding is  
 133    not annotated explicitly at the allocation, but deduced from the usage pattern of the variable; all  
 134    deductions must be consistent, as enforced by distributed memory analysis (Section 4.1).

135    Inset right illustrates the motivation for this design. If the  
 136    first loop was executed sequentially,  $\text{tmp}$  would hold  $A[31]$   
 137    after the first loop, so the second loop broadcasts that value  
 138    to every  $B[i]$ . Languages like CUDA that implicitly duplicate  
 139    local variables per thread would interpret the same code as a  
 140    vector copy ( $B[i] = A[i]$ ). While not necessarily error-prone,  
 141    this design would be inappropriate for our goal of defining program behavior with sequential  
 142    semantics. Hence, in Exo-GPU, storing a temporary value per-thread requires a sharded allocation  
 143    of  $\text{tmp}$ ; as written, the program is rejected by the analysis because a register-resident  $\text{tmp}$  cannot  
 144    be read by different threads without (for example) a warp shuffle. If a user intends to express a  
 145    vector copy ( $B[i] = A[i]$ ) through register  $\text{tmp}$ ,  $\text{tmp}$  requires explicit *distributed* allocation per  
 146    thread ( $\text{tmp} : \text{f32}[32]$ ) and the accesses must be per-thread ( $\text{tmp}[i]$ ).

```
for blk in cuda_tasks(0, 37):
    # scope-1 (CTA-scope)
    for w in cuda_threads(0, 4,
                           unit=cuda_warp):
        # scope-2 (warp-scope)
        for t in cuda_threads(0, 32,
                           unit=cuda_thread):
            # scope-3 (thread-scope)
```

```
tmp: f32 @ CudaRmem # Register
for i in cuda_threads(0, 32,
                      unit=cuda_thread):
    tmp = A[i]
for i in cuda_threads(0, 32,
                      unit=cuda_thread):
    B[i] = tmp
```

148   **Explicit Synchronization:** Many GPU instructions require additional explicit synchronization,  
 149   but overly conservative barriers can cancel the performance benefits of concurrency. Exo-GPU  
 150   provides two forms of synchronization: a blocking barrier via Fence, and split barriers via paired  
 151   Arrive/Await. As with other statements, all synchronization statements execute in program order.

152   A Fence causes all threads in the current  
 153   thread collective (warp, CTA, or cluster) to ren-  
 154   dezvous. At warp scope it maps to `__syncwarp`;  
 155   at CTA scope it maps to `__syncthreads` (as  
 156   shown in the right-hand code inset). A Fence  
 157   takes two synchronization-timeline parame-  
 158   ters,  $\tau_s^{\text{pre}}$  and  $\tau_s^{\text{post}}$  (Figure 16). These timelines  
 159   specify which memory operations (sync and/or  
 160   async) before and after the fence must be or-  
 161   dered, and the compiler emits the minimal  
 162   combination of synchronization and memory-fence instructions required to satisfy them. The  
 163   `cuda_in_order` timeline specifies that the effects of async instructions are *not* synchronized, but  
 164   only regular synchronous instructions. If we were to, for example, replace the loads into shared  
 165   memory with `cp.async` operations, the first parameter must be set to `Sm80_cp_async` so that the  
 166   copy operations preceding the fence are completed before execution proceeds.

167   Split barriers use a shared barrier-typed variable  $z$ , declared  
 168   as  $z: \text{barrier} @ \pi_b$ , where  $\pi_b$  selects the completion mecha-  
 169   nism (commit group, mbarrier, or cluster sync). Programmers  
 170   express a split barrier statement pair with  $\text{Arrive}(\tau_s^{\text{pre}}) \gg z$   
 171   and  $\text{Await}(z, \tau_s^{\text{post}})$ . The statement pairing is determined by  
 172   the shared barrier variable  $z$ . A matching pair orders memory op-  
 173   erations prior to `Arrive` (filtered by  $\tau_s^{\text{pre}}$ ) with those after `Await`  
 174   (filtered by  $\tau_s^{\text{post}}$ ). The right-hand inset illustrates replacing the  
 175   earlier Fence example with an `mbarrier`-based split barrier.

```
# Code at CTA-scope, with blockDim=128
buf: f32[128] @ CudaSmemLinear
for tid in cuda_threads(0, 128, unit=cuda_thread):
    buf[tid] = gmem[tid]
# __syncthreads
Fence(cuda_in_order, cuda_in_order)
for tid in cuda_threads(0, 128, unit=cuda_thread):
    accum: f32 # Each thread sums up buf[...]
    accum = 0
    for i in seq(0, 128):
        accum += buf[i]
```

```
for tid in cuda_threads(0, 128,
                       unit=cuda_thread):
    ...
bar: barrier @ CudaMbarrier
Arrive(cuda_in_order) >> bar
# ... insert work here
Await(bar, cuda_in_order)
for tid in cuda_threads(0, 128,
                       unit=cuda_thread):
    ...
```

## 177   2.2 Exo-GPU Examples with Erroneous Synchronization

178   To motivate our language design, we build up a series of increasingly subtle, hard-to-identify inter-  
 179   and intra-thread synchronization bugs; all would be rejected by synchronization checking (Sec-  
 180   tion 5). Variables prefixed with `some_*` are allocated outside the code snippet, `some_in_order_op`  
 181   represents non-async work, and `example_tma_multicast` and `example_wgmma` are simplified place-  
 182   holders for actual TMA (tile copy) and wgmma (tensor core) instructions.

183   The first example illustrates an inter-thread synchronization bug involving a cluster with 2 CTAs:

```
B: f32[2, 256, 32] @ CudaSmemLinear # Distributed mem: B[0,:,:] on CTA 0; B[1,:,:] on CTA 1
for cta in cuda_threads(0, 2, unit=cuda_cta_in_cluster):
    some_in_order_op(B[cta, :, :])
    Fence(cuda_in_order, cuda_in_order) # __syncthreads
example_tma_multicast(B[:, :, :], some_gmem[:, :, :]) # Copy some_gmem[:, :] to B[0,:,:] and B[1,:,:]
```

189   The `example_tma_multicast` instruction is implemented by threads in the 2 CTAs cooperating to  
 190   fill each others' shared memory; however, this cooperation does not entail implicit synchronization  
 191   between the two CTAs. Since the `Fence` statement, as placed in CTA-scope, only synchronizes  
 192   threads within a CTA, it is possible that threads in CTA 0 proceed to the multicast instruction,  
 193   filling the SMEM of CTA 1 (`B[1, :, :]`), while threads in CTA 1 are still reading from it while  
 194   executing `some_in_order_op`. Although the TMA instruction is async, this is not relevant to the  
 195   bug illustrated here, as the TMA appears after the in-order operations upon which it races with.

197 The next example illustrates an intra-thread synchronization bug:

```
198 D: f32[2, 2, 64, 256] @ Sm90_RmemMatrixD(64, 256) # Distributed: D[x,y,:,:] on CTA x, warpgroup y
199 for cta in cuda_threads(0, 2, unit=cta_in_cluster):
200     for wg in cuda_threads(0, 2, unit=cuda_warpgroup):
201         example_wgmma(D[cta, wg, :, :], some_A[:, :], some_B[:, :]) # D += matmul(A, B)
202         some_in_order_op(D[cta, wg, :, :])
```

203 This code parallelizes across CTAs and warpgroups, with each warpgroup accessing only its own  
 204 shard D[cta, wg, :, :]. Nevertheless, the code is flawed, as the async example\_wgmma instruction  
 205 isn't waited-for before some\_in\_order\_op attempts to access its output.

206 A more subtle bug arises when async instructions interact with inter-thread synchronization:

```
207 B: f32[2, 256, 32] @ Sm90_SmemSwizzled(128) # Distributed: B[x,:,:] on CTA x
208 D: f32[2, 2, 64, 256] @ Sm90_RmemMatrixD(64, 256) # Distributed: D[x,y,:,:] on CTA x, warpgroup y
209 cg: barrier[2, 2] @ CudaCommitGroup # Distributed: cg[x,y] for CTA x, warpgroup y
210 for cta in cuda_threads(0, 2, unit=cta_in_cluster):
211     for wg in cuda_threads(0, 2, unit=cuda_warpgroup):
212         some_in_order_op_1(B[cta, :, :])
213         example_wgmma(D[cta, wg, :, :], some_A[:, :], B[cta, :, :]) # D += matmul(A, B)
214         Arrive(wgmma_async) >> cg[cta, wg]
215         cs: barrier @ CudaClusterSync
216         Arrive(cuda_in_order) >> cs
217         for cta in cuda_threads(0, 2, unit=cta_in_cluster):
218             for wg in cuda_threads(0, 2, unit=cuda_warpgroup):
219                 Await(cs, cuda_in_order, 0)
220                 some_in_order_op_2(D[cta, wg, :, :])
221                 Await(cs, cuda_in_order, 0)
222                 example_tma_multicast(B[:, :, :], some_gmem[:, :, :]) # Copy some_gmem[:, :] to B[0,:,:] and B[1,:,:]
```

221 The statements connected in blue on the left illustrate safe synchronization in a situation similar  
 222 to our first example: the Arrive/Await pair using cs (cluster sync) ensures all accesses to B by  
 223 some\_in\_order\_op\_1 finish before any thread in the cluster overwrites B using TMA. The state-  
 224 ments connected in blue on the right illustrate safe synchronization in a situation similar to our  
 225 second example: the Arrive/Await pair using cg (commit group) ensures each warpgroup waits  
 226 for its example\_wgmma to finish before accessing D[cta, wg...] in some\_in\_order\_op\_2.

227 However, the bolded access to **B** by example\_wgmma is unsafe. Dotted red arrows illustrate  
 228 synchronization that does *not* happen. Because the Arrive on cs uses the timeline cuda\_in\_order,  
 229 the completion of the Arrive does not depend on the prior async wgmma instructions to retire.  
 230 Meanwhile, the Await on cg (commit group) occurs too late for it to interact transitively with the  
 231 prior cs (cluster sync) Arrive. Therefore, while all *non-async* accesses to B will retire before any  
 232 thread proceeds to the TMA multicast, this isn't the case for the async wgmma instructions, whose  
 233 B operand could be overwritten unexpectedly by another CTA multicasting to its input memory.  
 234 Subtle bugs like this can be difficult to spot in optimized CUDA code.

### 235 2.3 Optimization Process and Equivalence Checking

236 Exo-GPU programmers can either directly write optimized object code or apply a sequence of sched-  
 237 uling transformations to simple initial code. Both approaches grant full control over performance-  
 238 critical details. However, the scheduling approach offers significant practical advantages for per-  
 239 formance engineering: optimization strategies become reusable functions across similar kernels,  
 240 complex transformations decompose into sequences of verified primitives that enable rapid iteration  
 241 with correctness guarantees, and the compiler automatically handles complex index calculations  
 242 for tiling strategies—especially valuable for GPU optimization. Below, we describe Exo-GPU's  
 243 scheduling-based optimization process.

|     |                                                 |                  |                                              |                 |
|-----|-------------------------------------------------|------------------|----------------------------------------------|-----------------|
| 246 | $v : \text{Var} ::= x \in \mathbb{X}$           | data variable    |                                              |                 |
| 247 | $y \in \mathbb{Y}$                              | control variable |                                              |                 |
| 248 | $z \in \mathbb{B}$                              | barrier variable |                                              |                 |
| 249 | $r \in \mathbb{W}$                              | warp variable    |                                              |                 |
| 250 | $e : \text{DExpr} ::= d$                        | data value       | $c : \text{CExpr} ::= n$                     | integer         |
| 251 | $x   x[w^*]$                                    | reads            | $y$                                          | reads           |
| 252 | $e_1 + e_2   e_1 - e_2   e_1 * e_2   e_1 / e_2$ | compound expr    | $c_1 + c_2   c_1 - c_2   c * n$              | affine arith    |
| 253 | $e_z : \text{ZExpr} ::= z   z[w^*]$             | reads            | $c / n   c \bmod n$                          | quasi affine    |
| 254 | $e_w : \text{WExpr} ::= r = (n, n, n)$          | warp config      | $b : \text{BExpr} ::= t$                     | boolean         |
| 255 | $n \in \mathbb{Z}$                              | integer          | $b_1 \text{ and } b_2   b_1 \text{ or } b_2$ | logical ops     |
| 256 | $t \in \text{Bool}$                             | boolean          | $c_1 == c_2   c_1 < c_2   c_1 \leq c_2$      | relational ops  |
| 257 | $d \in \mathbb{D}$                              | data values      | $w : \text{WinCoord} ::= c$                  | point access    |
|     |                                                 |                  | $c_1..c_2$                                   | interval access |

Fig. 1. The syntax of variables and expressions of GPU IR. We use  $^*$  to mean 0 or more.

The scheduling flow in Exo-GPU begins with a simple initial procedure  $p_1$ . The programmer issues a series of *rewrite operations*  $R_1 \dots R_{N-1}$  to create transformed procedures  $p_2 \dots p_N$ .

$$\begin{array}{lll} \text{Rewrites: } & p_1 \xrightarrow{R_1} p_2 \dots \xrightarrow{R_{N-1}} p_N & (\text{Exo-GPU checks}) \quad p_N \\ \text{Behavior: } & \text{seq} [p_1] \equiv \text{seq} [p_2] \dots \equiv \text{seq} [p_N] & \equiv \text{par} [p_N] \end{array}$$

Exo's scheduling rewrite verifies functional equivalence of  $\text{seq} [p_1] \dots \text{seq} [p_N]$  under sequential semantics (denoted  $\text{seq} [\dots]$ ). Exo-GPU adds new scheduling operations for parallel and synchronization constructs, but these non-semantics-altering annotations are ignored under  $\text{seq} [\dots]$  checks. Then Exo-GPU performs additional verification on the final program  $p_N$ . Specifically, synchronization checking ensures  $\text{seq} [p_N] \equiv \text{par} [p_N]$ , where  $\text{par} [\dots]$  denotes parallel semantics. Transitively, this completes the *chain of equivalence* from  $\text{seq} [p_1] \equiv \dots \equiv \text{seq} [p_N] \equiv \text{par} [p_N]$ , guaranteeing that  $p_1$ 's behavior is preserved throughout all transformations.

Equivalence checking relies on three categories of reasoning:

**Exo Rewrites:** Exo's existing rewrites implement dependency analysis-based equivalence checking, which guards against *dataflow logic bugs*. For example, `reorder_stmts`, which reorders two statements  $s_1; s_2$ , is only permitted when the effects of the statements commute.

**Instruction Substitution:** Explicit instruction selection rewrite (replace) lets programmers substitute a code block with an accelerator instruction call. Exo-GPU models instructions via their *behavior* (sequential semantics), per-parameter *memory type* (physical memory and layout), expected collective unit  $\tau_u$ , and *async* annotations. Exo's unification verifies that the replaced code block's behavior matches the instruction, while Exo-GPU's static analysis ensures appropriate threads execute it (Section 4.1); together preventing *instruction logic errors*.

**Exo-GPU Synchronization Check:** Since parallel loops and synchronization statements (Fence, `Arrive`, and `Await`) are ignored during sequential checks, synchronization checking must ensure that their usage is valid and does not alter program behavior or introduce race conditions. We verify this via interpretation on the Abstract Machine (Section 5), which protects the program against both *inter- and intra-thread synchronization bugs* for selected concrete problem sizes.

### 3 GPU IR

GPU IR is the frontend language of Exo-GPU, and examples in Section 2 are written in GPU IR.

#### 3.1 Variables and Expressions

Figure 1 defines the expression syntax of GPU IR. GPU IR explicitly distinguishes between control, data, barrier, and warp expressions and is a static control program. We denote the sets of data, control, barrier, and warp variables by  $\mathbb{X}, \mathbb{Y}, \mathbb{B}, \mathbb{W}$ , respectively.  $x \in \mathbb{X}$  is a data variable,  $y \in \mathbb{Y}$

is a control variable,  $z \in \mathbb{B}$  is a barrier variable, and  $r \in \mathbb{W}$  is a warp variable. The truth values  $\text{Bool} \triangleq \{\text{true}, \text{false}\}$ . are ranged over by a metavariable  $t$ , and  $n$  is the metavariable for  $\mathbb{Z} \triangleq \{\dots, -1, 0, 1, 2, \dots\}$ . The metavariable  $d \in \mathbb{D}$  ranges over data values, including 8-, 32-, and 64-bit integers as well as single- and double-precision floating-points.

Control expressions CExpr encompass integer, scalar read, and (quasi-) affine arithmetic operations. Boolean expressions BExpr define basic logical and relational operations on control expressions. Data expressions DExpr are similarly defined, but operations need not be affine. Barrier Expressions ZExpr are only allowed to read, but not composed. Window coordinates  $w^*$  (we use  $.\ast$  to denote a tuple) used in data and barrier read accesses are specified as a tuple of point-wise or interval control expressions, expressing a windowed access to multi-dimensional arrays. Scalar read is syntactic sugar for an array access of  $w^* = \langle \rangle$  (empty tuple).

### 3.2 Types

All types are fully defined in Appendix A. This section provides a general introduction to the different types in GPU IR. Figure 10 defines data and barrier types. Data types  $\tau_x$  specify precisions for data variables  $\mathbb{X}$ , such as f32 and i32. Barrier types  $\tau_z$  distinguish a non-explicitly guarded barrier from an explicitly guarded barrier ( $\text{barrier}(z)$ ); the explicitly guarded barrier case is required only for mbarriers. A *collective type* ( $\delta \in \Delta$ ) defines a tuple consisting of a domain and a box, each representing  $d$ -dimensional natural numbers  $((\mathbb{N}^d, \mathbb{N}^d))$ . Together, these define a  $d$ -level thread hierarchy within a thread cluster, and a selection of a number of elements on each level. An argument to `cuda_threads` loop  $\tau_u$  is parameterized by a collective type  $\delta$ , which is defined in Figure 11. Figures 12–14 classify three categories of memory spaces:  $\pi_d$  for data memories (such as `CudaRmem` for CUDA registers),  $\pi_z$  for barrier completion mechanisms (such as `mbarrier` or `commit_group`), and  $\pi_w$  for special window aliasing (such as `CUtensorMap`). Similar to  $\tau_u$ ,  $\pi_d$  is parameterized by  $\delta$ .

A qualitative timeline  $\tau_q$  annotates each *runtime* buffer access. We define distinct  $\tau_q$  to distinguish between buffer accesses that require different synchronization mechanisms or memory fences to resolve hazards. For example, we distinguish non-async copies from `cp.async` copies (which require `cp.async.wait_all` or similar to wait for), and we distinguish register and SMEM operands of `wmma` instructions (which require `wmma.fence` and `fence.proxy.async`, respectively). Section 5 will formally define how a runtime buffer access instance  $x[i_1, \dots, i_n]$  tagged with  $q_1$  states that, at the program point and thread, the access to  $x$  occurs under  $q_1$ . Moreover, the sync timelines SyncTL mentioned in Section 2.1, which parameterize synchronization statements, are actually defined as compositions of qualitative timelines (QualTL), as shown in Figure 16.

### 3.3 Programs

Exo-GPU supports two types of procedures,  $p : \text{Proc}$ , which models CPU functions (possibly containing CUDA kernel launches), and  $g : \text{Instr}$ , which models hardware instructions for either CPU or CUDA. Both procedure types share three common components: a statement body that defines sequential behavior, a tuple of control parameters  $y^*$ , and a tuple of data parameters  $x^*$ . Each data parameter in Proc carries two annotations:  $\tau_x$ , specifying the required data precision, and  $\pi_d$ , specifying the required memory type for the passed argument.

Instruction parameters ( $\tau_a$ ) also carry both  $\tau_x$  and  $\pi_d$  annotations, as well as hardware-specific information, namely: out-of-order, convergence, and qualitative timeline information, which all affect only abstract machine semantics for checking (Section 5). The  $\tau_a$  also contains a distributed collective units tuple ( $\tau_u^*$ ) specifying the CUDA thread hierarchy at which each array dimension is sharded (checked by distributed memory analysis, Section 4.1). The instruction ( $g$ ) contains annotations for  $t$  (a CPU/CUDA flag), a collective unit  $\tau_u$  specifying the convergent threads required, and  $\pi_z$  and  $\tau_u^*$ .

|     |                                                                                      |                                                                                                   |
|-----|--------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| 344 | $s : \text{Stmt} ::= s_1; s_2$                                                       | sequencing                                                                                        |
| 345 | for $y$ in seq( $c_{\text{lo}}, c_{\text{hi}}$ ) do $s$                              | sequential loop                                                                                   |
| 346 | for $y$ in cuda_tasks( $c_{\text{lo}}, c_{\text{hi}}$ ) do $s$                       | cuda tasks                                                                                        |
| 347 | for $y$ in cuda_threads( $c_{\text{lo}}, c_{\text{hi}}, \text{unit}=\tau_u$ ) do $s$ | cuda threads                                                                                      |
| 348 | with CudaDeviceFunction( $n, e_w^*$ ) do $s$                                         | CUDA device function block                                                                        |
| 349 | with CudaWarps( $n, n, r$ ) do $s$                                                   | Warp Specialization Block                                                                         |
| 350 | if $b$ then $s$                                                                      | guard                                                                                             |
| 351 | $p(c^*, e^*)$                                                                        | non-instruction subprocedure call                                                                 |
| 352 | $g(c^*, e^*) \gg e_z$                                                                | instruction subprocedure call                                                                     |
| 353 | $x[c^*] = e$                                                                         | write                                                                                             |
| 354 | $x[c^*] += e$                                                                        | reduce                                                                                            |
| 355 | $x = x[c^*] @ \pi_w$                                                                 | window statement                                                                                  |
| 356 | $x : \tau_x[c^*] @ \pi_d$                                                            | Data alloc                                                                                        |
| 357 | $z : \tau_z[c^*] @ \pi_z$                                                            | Barrier alloc                                                                                     |
| 358 | free $x$                                                                             | Data free                                                                                         |
| 359 | free $z$                                                                             | Barrier free                                                                                      |
| 360 | Fence( $\tau_z, \tau_s$ )                                                            | Non-split barriers                                                                                |
| 361 | Arrive( $\tau_s, n$ ) $\gg e_z^*$                                                    | Arrive                                                                                            |
| 362 | Await( $e_z, \tau_s, n$ )                                                            | Await                                                                                             |
| 363 | $\tau_a : \text{Arg} ::= \tau_x$                                                     | data type (precision)                                                                             |
| 364 | $\pi_d$                                                                              | data memory type                                                                                  |
| 365 | out_of_order $\in \text{Bool}$                                                       | out-of-order execution flag                                                                       |
| 366 | convergent $\in \text{Bool}$                                                         | implicit thread sync flag                                                                         |
| 367 | $\tau_u^*$                                                                           | distributed collective units                                                                      |
| 368 | initial_qual_tl $\in \text{QualTL}$                                                  | initial qualitative timeline                                                                      |
| 369 | ext_qual_tl $\in \mathcal{P}(\text{QualTL})$                                         | extended qualitative timeline set                                                                 |
| 370 | atomic_qual_tl $\in \mathcal{P}(\text{QualTL})$                                      | atomic qualitative timeline set                                                                   |
| 371 | $p : \text{Proc} ::= \text{proc } y^*, (x : \tau_x @ \pi_d)^* \text{ do } s$         | $g : \text{Instr} ::= \text{proc } y^*, (x : \tau_a)^*, t, \tau_u, \pi_z, \tau_u^* \text{ do } s$ |

Fig. 2. The syntax of GPU IR statements.

### 3.4 Statements

Figure 2 defines statements  $s \in \text{Stmt}$ . A statement can sequence ( $s_1; s_2$ ), iterate over loops (seq, cuda\_tasks, or cuda\_threads), and guard (if  $b$  then  $s$ ). Data effects include indexed writes  $x[c^*] = e$ , reductions  $x[c^*] += e$ , and window creation  $x = x[c_{\text{lo}} : c_{\text{hi}}] @ \pi_w$ . CudaDeviceFunction and CudaWarps control generation and selection of CUDA threads. Synchronization statements Fence, Arrive, and Await get lowered to CUDA synchronization and proxy fence instructions.

**CUDA Kernel Structure:** All Exo-GPU with `CudaDeviceFunction(...)`:

```
for cta_m in cuda_tasks(...):
    for cta_n in cuda_tasks(...):
        # Device task starts here
        A_smem: f32[128, 32] @ CudaSmemLinear
        # ...
```

code is compiled as CPU code by default. A CudaDeviceFunction block specifies a CPU-side launch of a CUDA kernel. Its body must contain only a single statement: a nest of one or more cuda\_tasks loops; cuda\_tasks loops must not appear elsewhere. Each iteration of the innermost cuda\_tasks loop comprises a device task and is assigned to one CUDA cluster for execution. The *device scope* of a statement is CPU if outside of a CudaDeviceFunction block, and GPU if inside. At CPU scope, all loops must have loop mode seq. The arguments of `CudaDeviceFunction( $n, e_w^*$ )` respectively define the `clusterDim` and, indirectly, the `blockDim` of the CUDA clusters launched. Each warp expression  $e_w : r = (n, n, n)$  defines a group of warps named  $r$ , with the first  $n$  parameter being the number of warps. If nonzero, the latter two  $n$  parameters specify that the warps will adjust the register count down or up, respectively, using a `setmaxnreg.dec` or `setmaxnreg.inc` instruction [18]. We infer that `blockDim` is 32 times the total number of warps named.

393   **Thread Control:** Having described the  
 394   cuda\_tasks loops (inter-cluster parallelism),  
 395   we now describe cuda\_threads loops and  
 396   with CudaWarps (intra-cluster parallelism). A  
 397   cuda\_threads loop takes its the executing  
 398   threads in flight and sub-divides it, assigning  
 399   one subdivision to execute each loop iteration  
 400   (possibly with left-over, inactivated threads).  
 401   The unit parameter adjusts the number of  
 402   threads per iteration. The CudaWarps( $n, n, r$ ) block restricts its body statement to execute only with  
 403   threads residing in the warp variable named by  $r$  (which is defined in the CudaDeviceFunction);  
 404   the numeric arguments further restrict the warps selected.

405   **Proc & Instruction Calls:** Calls to  $p : \text{Proc}$  and  $g : \text{Instr}$  are syntactically similar, distinguished  
 406   only by the callable's type. Exo-GPU's static analysis enforces that calls to  $p : \text{Proc}$  only appear at  
 407   CPU scope, while calls to  $g : \text{Instr}$  appear at CPU or CUDA scope as appropriate for  $g$ . For CUDA  
 408   instructions, calls to  $g$  must only appear at  $\tau_u$ -scope, where  $\tau_u$  is the collective unit specified for  $g$ .  
 409   Additionally, each data parameter must have the correct precision  $\tau_x$  and memory type  $\pi_d$ , and the  
 410   trailing barrier (if required) must have the correct  $\pi_z$ .

411   **Synchronization Statements:** The user specifies synchronization between threads and/or  
 412   between async accelerator instructions by inserting synchronization statements. They take *synchronization timeline* (SyncTL) parameters, defined as compositions of qualitative timelines and a  
 413   transitivity (trnstv?) flag (Figure 16). The transitivity flag controls interaction between synchronization statements, detailed in Section 5.

414   Each  $\tau_s : \text{SyncTL}$  contains two QualTL sets: the full timeline set (indicated by "full" in (Figure  
 415 ), and a temporal timeline set (indicated by "full" or "temp."). A Fence( $\tau_s^{\text{pre}}, \tau_s^{\text{post}}$ ), or paired  
 416 Arrive( $\tau_s^{\text{pre}}, \_$ ) and Await( $\_, \tau_s^{\text{post}}$ ,  $\_$ ) statements, resolves hazards between two accesses to the  
 417 same array element when: (1) the first access precedes synchronization and uses a QualTL from  
 418  $\tau_s^{\text{pre}}$ 's full timeline set, and (2) the second access follows synchronization and uses a QualTL from  
 419  $\tau_s^{\text{post}}$ 's full timeline set (for reads) or temporal timeline set (for writes). In hardware terms, the  
 420 distinction between the full and temporal timeline sets is due to memory fences. If the prior value of  
 421 a buffer element is *overwritten*, without being read, we require only that the overwrite is temporally  
 422 ordered after prior accesses to that buffer element, and memory fences may be elided.

423   **Alloc & Free:** An allocation statement begins the lifetime of a new data or barrier variable. For  
 424   data allocations, the data memory  $\pi_d$  determines the physical backing memory (registers (RMEM),  
 425   shared memory (SMEM), or global memory (GMEM)) and the mapping between multidimensional  
 426   tensor coordinates and 1D memory offsets (Figure 12). It also carries an implicit collective type  $\delta$ ,  
 427   specifying which level of the thread hierarchy at which the physical memory type is allocated (e.g.  
 428   CTA for shared memory allocations); this is consumed by distributed memory analysis (Section 4.1).  
 429   For barrier allocations, the BarrierComp parameter  $\pi_z$  controls the completion mechanism used  
 430   for CUDA code translated from Arrive and Await statements (Figure 13). The explicitly-guarded  
 431   barrier type is relevant only for certain configurations of mbarrier usage. A free statement ends  
 432   the lifetime of a data or barrier variable. Currently, we forbid the user from inserting free statements  
 433   manually; these are added automatically in a separate compiler pass.

434   **Window Statement:** A window statement  $x_{\text{win}} = x_{\text{data}}[w^*] @ \pi_w$  defines an alias to  $x_{\text{data}}$ . A  
 435   window into  $x_{\text{win}}$  can be passed as an instruction argument which requires special window type  
 436    $\pi_w$ . Currently, we only use this feature to construct CUTensorMap objects [18] required for TMA  
 437   instructions; we parameterize the CUTensorMap type with its swizzle mode and box shape.

```
my_warp_config = [
  CudaWarpConfig("producer", 1, 40, 0),
  CudaWarpConfig("unused", 3, 40, 0),
  CudaWarpConfig("consumer", 8, 0, 232)]
with CudaDeviceFunction(clusterDim=2,
  warp_config=my_warp_config):
```

Fig. 3. Launching clusters of 2 CTAs each, with 1 "producer", 3 "unused", and 8 "consumer" warps per CTA (total blockDim=384). The consumer warps have 232 registers per thread, and the others only 40.

## 4 COLLECTIVE ANALYSIS

Collective analysis is a static, forward dataflow analysis over a GPU IR procedure. Its results are consumed by (1) a memory analysis, (2) code generation, and (3) an abstract-machine interpreter. The analysis annotates every GPU lexical scope with a collective tiling  $\omega \in \Omega$ . By default, statements inherit their parent's tiling unchanged; a new tiling is derived only at: device entry (CudaDeviceFunction), thread loops (cuda\_threads), and warp-specialization blocks (CudaWarps). The purpose of this analysis is to statically reason about the mapping between work and the threads *within* clusters, so here we ignore cuda\_tasks loops, which distribute work across clusters. We uniquely identify a thread within a cluster by its *natural thread index*; a set of such indices comprises a *thread collective*  $\mu \in \mathbb{T}$ , where  $\mathbb{T} \triangleq \mathcal{P}(\mathbb{N})$ .

**Definition 4.1** (Natural thread index). The *natural thread index* for a CUDA thread is `cluster_ctarank * blockDim.x + threadIdx.x`. (Exo-GPU only parallelizes on the x dimension).

**Definition 4.2** (Collective tiling). A *collective tiling*  $\omega \in \Omega$  for an  $M$ -dimensional domain is an  $M$ -tuple of *dimension descriptors*

$$\omega = \langle \mathcal{D}_0, \dots, \mathcal{D}_{M-1} \rangle \quad \text{with} \quad \mathcal{D}_m = (D_m, O_m),$$

where  $D_m \in \mathbb{N}$  is the extent of dimension  $m$  and  $O_m \in \mathcal{O}^*$  is an ordered list of (possibly empty) *dimension operators*  $O : \mathbb{Y} \times \mathbb{N}^3$ . We write the thread pitch of dimension  $m$  as  $P_m \triangleq \prod_{i=m+1}^{M-1} D_i$  where  $P_{M-1} = 1$ .

The dimensions of a collective tiling  $\omega$  capture how a cluster's threads are partitioned into hierarchical collectives (cluster, block, warp). We support *reshape* to introduce ad-hoc additional hierarchy (e.g. pairs of warps). From there, each dimension's dimension operators summarize the control variables (identified by  $y \in \mathbb{Y}$ ) that iterate on that dimension. The thread pitch of the dimension describes the distance, in units of natural thread indices, between adjacent elements (e.g. a CTA dimension would have thread pitch `blockDim`).

During forward dataflow analysis, new collective tilings are derived for the three special statements mentioned above, while all other statements inherit their parent's tiling. It produces a child collective tiling through the following transformation:

$$\text{derive}_\omega : (\omega^{\text{env}}, \delta^{\text{stmt}}, y, \text{lo}, \text{hi}, \text{tileCount}) \rightarrow \omega'$$

where  $\omega^{\text{env}}$  is the parent scope's collective tiling and  $\delta^{\text{stmt}}$  is the collective type from the statement. The (optional) iterator  $y$  and bounds  $[\text{lo}, \text{hi}]$  describe a regular slice of linear thread space;  $\text{tileCount} \in \mathbb{N}$  is the number of tiles for that slice.

The `cuda_threads` loop and `CudaWarps` block use  $\text{derive}_\omega$  to create a new collective tiling  $\omega$  based on the one annotating the parent scope. The `CudaDeviceFunction` block derives  $\omega$  based on launch parameters:  $\omega$  is 1-D if `clusterDim=1` (domain  $D_0=\text{blockDim}$ ), otherwise 2-D with  $(D_0, D_1)=(\text{clusterDim}, \text{blockDim})$ ; all operator lists start empty.

The implementation of  $\text{derive}_\omega$  proceeds in two steps:

- (1) *Shape alignment*. Make  $\omega^{\text{env}}$  and  $\delta^{\text{stmt}}$  compatible by reshaping via hierarchical splits so that their thread pitches  $P_m$  align.
- (2) *Single-dimension refinement*. After alignment, at most one dimension  $m$  is newly tiled; the derivation appends exactly one operator to  $O_m$ .

Once a tiling  $\omega$  is derived, it is compiled into a *thread mapping*:  $f_\omega : \Sigma \rightarrow \mathbb{T}$ , where  $\Sigma \triangleq (\mathbb{Y} \rightarrow \mathbb{Z})$  which maps a control environment  $\sigma \in \Sigma$  to a runtime thread collective  $\mu \in \mathbb{T}$ . Equivalently, let  $\mathcal{F} \triangleq \{f \mid f : \Sigma \rightarrow \mathbb{T}\}$  and  $f_\omega \in \mathcal{F}$ . During conversion, each Abstract Machine statement  $s^\#$  is annotated with its thread mapping  $f_{s^\#} \in \mathcal{F}$ . In Section 5.3 we write  $f_{s^\#}$  for the thread mapping associated with statement  $s^\#$ .

## 491 4.1 Distributed Memory Analysis

492 *Distributed memory analysis* is a static, program-wide consistency check that ensures every use of  
 493 CUDA-scoped data and barrier variables is compatible with their memory definitions. Each memory  
 494 type  $\pi_d$  specifies a collective type  $\delta_{\pi_d}$ . The analysis validates that allocation and use agree on which  
 495 threads “own” which memory slices, indexing respects that ownership, and all uses induce the  
 496 same mapping from logical indices to CUDA thread identifiers. Analysis runs in four steps:

- 497 (1) **Desugaring instruction call.** To expose the implicit collective requirement encoded by  
 498 each hardware instruction to the collective analysis in **step 2**, we rewrite instruction calls  
 499  $g(\dots)$  into explicit loops. Each arguments to the instruction call are wrapped around with  
 500 `cuda_threads`, with their collective units specified by  $g.r_u^*$  in the instruction specification.
- 501 (2) **Collective analysis.** Run collective analysis and annotate each scope with  $\omega \in \Omega$ .
- 502 (3) **Triple collection.** For each data or barrier variable, record the *access triple* of all of use  
 503 sites. We use  $\mathcal{T}_v$  to denote a set of those triples for a variable  $v$ ; these sets for all the data and  
 504 barrier variables are the sole inputs to **step 4**. Each triple  $(n^*, \omega, c^*)$  denotes: the subdivided  
 505 dimensions  $n^*$ , the use-site tiling  $\omega$ , and observed index coordinates  $c^*$ . For each use site:
  - 506 • **Data.** Compute  $n^*$  from the memory’s collective type  $\delta_{\pi_d}$  and the use-site tiling  $\omega$ .
  - 507 • **Barriers.** Treat all dimensions as subdivided (every axis participates in distribution).
- 508 (4) **Triple analysis.** For each CUDA-scoped data variable  $v$ , we require: (i) consistency of the  
 509 allocation w.r.t. the memory’s  $\delta_{\pi_d}$ , (ii) consistency of data or barrier accesses across sites  
 510 (by checking all triples in  $\mathcal{T}_v$ ), and (iii) all triples in  $\mathcal{T}_v$  must derive one and only one thread  
 511 pitch tuple that summarizes how incrementing logical indices advances natural thread IDs.

512 As a result, distributed memory analysis rejects programs that (i) mismatch allocation and accesses  
 513 with their memory declaration, (ii) inconsistently index the same data or barrier across sites, or  
 514 (iii) rely on missing/extra CUDA-thread iterators. Passing guarantees that the program’s memory  
 515 behavior is coherent with its collective structure and declared memory types.

## 517 4.2 Code Generation to CUDA Code

518 **Lowering `cuda_threads` Loops:** The corre-    `if (int y = f(blockIdx, threadIdx); g(y)) {`  
 519 spondence between parallel loops and threads                `lower(s); }`

520 has thus far been expressed in terms of thread mappings of type  $\Sigma \rightarrow \mathbb{T}$ , which infer a set of  
 521 active threads from the control environment  $\sigma$ . However, the generated CUDA C++ implements the  
 522 inverse of this mapping, inferring the value of a control variable  $y$  from the thread index instead.  
 523 Each Exo-GPU loop of the form `for y in cuda_threads(lo, hi, unit=t_u) do s` lowers to  
 524 CUDA C++ as shown in the inset right. Here,  $f$  is a function of CUDA’s built-in `blockIdx` and  
 525 `threadIdx` variables, and  $g$  is a boolean condition on  $y$  that disables threads not assigned to execute  
 526 any iteration of the body. Where possible,  $f$  is 0 and  $g$  is 1, to facilitate uniform branching.

527 **Warp Specialization:** When a device function uses multiple warp variables  $r_1, \dots, r_W \in \mathbb{W}$ , we  
 528 generate  $W$ -many CUDA C++ code paths from a single Exo-GPU device function, each specialized  
 529 for (and executed by) the threads assigned to one warp variable. Each path uses PTX’s `setmaxnreg`  
 530 instruction to vary per-thread register counts, and omits all code paths guarded by a `with CudaWarp`  
 531 block holding code not intended for the current warp. This allows the PTX assembler to statically  
 532 verify that register-heavy instructions (e.g. `wgmma`) don’t appear on low-register code paths.

533 **Persistent Kernels and `cuda_tasks` Loops:** The generated CUDA C++ device functions imple-  
 534 ment a persistent kernel design that launches exactly the number of thread block clusters needed  
 535 to saturate the device. Following the approach of CUTLASS grouped kernel schedulers [16], we  
 536 implement a C++ task generator object that generates a series of device task coordinates for thread  
 537 block clusters to execute. The current implementation orders tasks lexicographically and assigns  
 538 them round-robin to thread block clusters for execution. While we leave it as future work to

|     |                                                             |                                                                             |                                            |
|-----|-------------------------------------------------------------|-----------------------------------------------------------------------------|--------------------------------------------|
| 540 | $\tau_v : VL ::=$                                           | fully_ordered   temporally_ordered<br>  unordered   atomic_only   invisible | visibility levels                          |
| 541 | $e^\# : Expr^\# ::= e_d^\# \mid e_z^\#$                     | where $e_d^\# \triangleq x[w^*]$ $e_z^\# \triangleq z[w^*]$                 | barrier and data accesses                  |
| 542 | $s^\# : Stmt^\# ::= s_1^\# ; s_2^\#$                        |                                                                             | sequencing                                 |
| 543 | for $y$ in loop( $c_{lo}, c_{hi}$ ) do $s^\#$               |                                                                             | loop                                       |
| 544 | if $b$ then $s^\#$                                          |                                                                             | guard                                      |
| 545 | IncTaskID                                                   |                                                                             | increment the task ID                      |
| 546 | Alloc( $e^\#$ )                                             |                                                                             | Updates SyncEnv for Alloc                  |
| 547 | RecordRead( $\tau_v, t, e^\#, \tau_q, \tau_q^*, e_z^\#$ )   |                                                                             | Updates SyncEnv with Read                  |
| 548 | RecordMutate( $\tau_v, t, e^\#, \tau_q, \tau_q^*, e_z^\#$ ) |                                                                             | Updates SyncEnv with Mutate                |
| 549 | ClearReads( $e^\#$ )                                        |                                                                             | Remove state added by RecordRead           |
| 550 | ClearMutates( $e^\#$ )                                      |                                                                             | Remove state added by RecordMutate         |
| 551 | Fence( $t, \tau_q^*, \tau_q^*, \tau_q^*$ )                  |                                                                             | Non-split barriers                         |
| 552 | Arrive( $t, \tau_q^*, e_z^\#$ )                             |                                                                             | Arrive                                     |
| 553 | Await( $e_z^\#, \tau_q^*, \tau_q^*, n$ )                    |                                                                             | Await                                      |
| 554 | CheckReads( $\tau_v, t, e^\#, \tau_q^*$ )                   |                                                                             | Checks visibility of Reads in SyncEnv      |
| 555 | CheckMutates( $\tau_v, t, e^\#, \tau_q^*$ )                 |                                                                             | Checks visibility of Mutates in SyncEnv    |
| 556 | CheckBarrier( $e_z^\#$ )                                    |                                                                             | Checks the consistency of arrive and await |
| 557 | $p^\# : Proc^\# ::= proc y^*$                               |                                                                             | arguments are control-only                 |
| 558 | do $s^\#$                                                   |                                                                             |                                            |

559 Fig. 4. Syntax of Abstract Machine IR (AMIR). Variables, types, and expressions ( $c, b, w$ ) use the same definition  
560 as GPU IR. Visibility levels are denoted by the type  $\tau_v \in VL$  and range over the five constructors listed above.

561 provide a mechanism for programmers to customize this ordering, such customization would be  
562 unconstrained by frontend’s quasi-affine indexing restrictions since cuda\_tasks loops impose no  
563 semantic ordering, thereby enabling techniques like Morton swizzling.

564 **Shared Memory Allocations:** Device functions generated by Exo-GPU use only dynamic shared  
565 memory for SMEM allocation. The Exo-GPU compiler deduces the required SMEM allocation size  
566 and statically assigns offsets into this allocation for each SMEM-backed variable. Variables with  
567 non-overlapping lifetimes may share the same memory locations; to ensure safe aliasing, we  
568 prohibit any thread in the cluster from passing an SMEM free statement until all reads and writes  
569 to the freed variable have retired, as enforced by the Abstract Machine (Section 5). This is necessary  
570 because freed SMEM may be immediately reused by any thread in the cluster. We note that the  
571 compiler’s aliasing may be suboptimal, and the cluster-wide safety requirement may be overly  
572 conservative (e.g., for CTA-local variables). Providing programmers explicit control over SMEM  
573 lifetime and aliasing is left as future work.

## 5 ABSTRACT MACHINE

575 To guarantee sequential-parallel equivalence, Exo-GPU performs synchronization checking via  
576 Abstract Machine interpretation. This section introduces the AMIR grammar (Section 5.1), the trans-  
577 lation from GPU IR (Section 5.2), and the AMIR execution semantics (Section 5.3). We translate GPU  
578 IR into AMIR solely for validation: the frontend language remains simple, while synchronization  
579 reasoning is factored into explicit AMIR operations that are orthogonal to value computation.

### 5.1 Abstract Machine Grammar

583 Figure 4 defines AMIR grammar. It follows GPU IR for control  $c$ , booleans  $b$ , window coordi-  
584 nates  $w$ , and all variable/type definitions (Figures 1, 2). AMIR defines only windowed access  
585 expressions, written  $e_d^\# \triangleq x[w^*]$  (data access) and  $e_z^\# \triangleq z[w^*]$  (barrier access). AMIR introduces  
586 visibility levels  $\tau_v \in VL$ , which are totally ordered:  $\text{invisible} < \text{atomic\_only} < \text{unordered} <$   
587  $\text{temporally\_ordered} < \text{fully\_ordered}$ .

| GPU IR                                                                                                                                                          | Abstract Machine IR                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 589 $s_1; s_2$                                                                                                                                                  | $T_s(s_1); T_s(s_2)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 590 $\text{for } y \text{ in seq}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } s$                                                                                  | $\text{for } y \text{ in loop}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } T_s(s)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 591 $\text{for } y \text{ in cuda_tasks}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } s$                                                                           | $\text{for } y \text{ in loop}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } (\text{IncTaskID}; T_s(s))$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 592 $\text{for } y \text{ in cuda_threads}(c_{\text{lo}}, c_{\text{hi}}, \tau_u) \text{ do } s$                                                                 | $\text{for } y \text{ in loop}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } T_s(s)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 593 $x[c^*] \diamond e \text{ where } \diamond \in \{=, +\}$                                                                                                    | $\begin{aligned} &;_{e_1^{\#} \in T_{\text{ex}}(e)} (\text{CheckMutates}(\tau_v^{\text{pre}}, \text{false}, e_1^{\#}, \{T_{\text{qin}}(e_1^{\#})\})) \\ &\quad \text{RecordRead}(\tau_v^{\text{post}}, \text{false}, e_1^{\#}, T_{\text{qin}}(e_1^{\#}), \emptyset, \emptyset) \\ &;_{e_2^{\#} \in T_{\text{ex}}(x[c^*])} (\text{CheckReads}(\tau_v^{\text{pre}}, \text{false}, e_2^{\#}, \{T_{\text{qin}}(e_2^{\#})\})) \\ &\quad \text{CheckMutates}(\tau_v^{\text{pre}}, \text{false}, e_2^{\#}, \{T_{\text{qin}}(e_2^{\#})\}) \\ &\quad \text{ClearReads}(e_2^{\#}); \\ &\quad \text{ClearMutates}(e_2^{\#}); \\ &\quad \text{RecordMutate}(\tau_v^{\text{post}}, \text{false}, e_2^{\#}, T_{\text{qin}}(e_2^{\#}), \emptyset, \emptyset); \end{aligned}$ |
| 594 $, \text{ where } \tau_v^{\text{post}} = \text{fully-ordered and}$                                                                                          | $\tau_v^{\text{pre}} = \text{fully-ordered if } +=, \text{ else temporally-ordered}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 595 $602$                                                                                                                                                       | $603$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 604 $\text{if } b \text{ then } s$                                                                                                                              | $\text{if } b \text{ then } T_s(s)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 605 $\text{with CudaDeviceFunction}(n, e_w^*) \text{ do } s$                                                                                                    | $\text{Fence}(\text{true}, \text{Qcs} \cup \{\text{cpu\_in\_order\_qual}\}, \text{Qcs}, \text{Tcs});$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 606 $T_s(s);$                                                                                                                                                   | $\text{Fence}(\text{true}, \text{Qcs}, \text{Qcs}, \text{Tcs});$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 607 $\text{, where Qcs} \triangleq \text{T}_{\text{qfull}}(\text{cuda\_stream\_sync}), \text{Tcs} \triangleq \text{T}_{\text{qtmp}}(\text{cuda\_stream\_sync})$ | $\text{, where Qcs} \triangleq \text{T}_{\text{qfull}}(\text{cuda\_stream\_sync}), \text{Tcs} \triangleq \text{T}_{\text{qtmp}}(\text{cuda\_stream\_sync})$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 608 $\text{with CudaWarp}(n, n, r) \text{ do } s$                                                                                                               | $\text{if true then } T_s(s)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 609 $611$                                                                                                                                                       | $612$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 610 $g(c^*, e^*) \gg e_z$                                                                                                                                       | $\begin{aligned} &;_{e^{\#} \in T_{\text{ex}}(e_z)} \text{RecordRead}(\text{fully-ordered}, \text{false}, e^{\#}, \emptyset, \emptyset, T_e(e_z)) \\ &;_{i=0}^{\cdot \text{len}(e^*)} T_{\text{arg}}(g.p_i, e_i, e_z) \end{aligned}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 611 $613$                                                                                                                                                       | $614$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 612 $x : \tau_x[c^*] @ \pi_d$                                                                                                                                   | $\text{Alloc}(x[c^*])$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 613 $z : \tau_z[c^*] @ \pi_z$                                                                                                                                   | $\text{Alloc}(z[c^*])$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| 614 $616$                                                                                                                                                       | $617$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 615 $\text{free } x$                                                                                                                                            | $\begin{aligned} &;_{e^{\#} \in T_{\text{smem}}(x[:*])} (\text{CheckReads}(\tau_v^{\text{free}}, \text{false}, e^{\#}, \{T_{\text{qin}}(e^{\#})\})) \\ &\quad \text{CheckMutates}(\tau_v^{\text{free}}, \text{false}, e^{\#}, \{T_{\text{qin}}(e^{\#})\}); \end{aligned}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 616 $618$                                                                                                                                                       | $619$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 617 $\text{free } z$                                                                                                                                            | $\begin{aligned} &\text{CheckBarriers}(T_e(z[:*])); \\ &;_{e^{\#} \in T_{\text{smem}}(z[:*])} (\text{CheckReads}(\tau_v^{\text{free}}, \text{false}, e^{\#}, \{T_{\text{qin}}(e^{\#})\})) \\ &\quad \text{CheckMutates}(\tau_v^{\text{free}}, \text{false}, e^{\#}, \{T_{\text{qin}}(e^{\#})\}); \end{aligned}$                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 618 $621$                                                                                                                                                       | $622$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 619 $\text{Fence}(\tau_s^{\text{pre}}, \tau_s^{\text{post}})$                                                                                                   | $\text{Fence}(T_{\text{tr}}(\tau_s^{\text{pre}}), \text{T}_{\text{qfull}}(\tau_s^{\text{pre}}), \text{T}_{\text{qfull}}(\tau_s^{\text{post}}), \text{T}_{\text{qtmp}}(\tau_s^{\text{post}}))$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 620 $623$                                                                                                                                                       | $624$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 621 $\text{Arrive}(\tau_s, n) \gg e_z^*$                                                                                                                        | $\begin{aligned} &;_{e^{\#} \in T_{\text{ex}}(e_z^*)} \text{RecordRead}(\text{fully-ordered}, \text{false}, e^{\#}, \text{T}_{\text{qarr}}(\tau_s), \emptyset, \emptyset) \\ &\quad \text{Arrive}(T_{\text{tr}}(\tau_s), \text{T}_{\text{qfull}}(\tau_s), T_e(e_z^*)); \end{aligned}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 622 $625$                                                                                                                                                       | $626$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 623 $\text{Await}(e_z, \tau_s, n)$                                                                                                                              | $\begin{aligned} &;_{e^{\#} \in T_{\text{ex}}(e_z)} \text{RecordRead}(\text{fully-ordered}, \text{false}, e^{\#}, T_{\text{qin}}(e^{\#}), \emptyset, \emptyset) \\ &\quad \text{Await}(T_e(e_z), \text{T}_{\text{qfull}}(\tau_s), \text{T}_{\text{qtmp}}(\tau_s), n); \end{aligned}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 624 $627$                                                                                                                                                       | $628$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 625 $629$                                                                                                                                                       | <p>Fig. 5. <math>T_s : \text{Stmt} \rightarrow \text{Stmt}^{\#}</math> (statement conversion). All <i>conversion</i> functions use the <math>T_{-}</math> prefix (Figures 18–20). <math>\tau_v^{\text{free}} = \text{temporally-ordered}</math></p>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 626 $630$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 627 $631$                                                                                                                                                       | <p>AMIR has no hardware-instruction procedures (<math>g</math> in GPU IR) and no subprocedure calls. Procedures <math>p^{\#}</math> take control-only arguments. The statement (Figure 4) consists of: (i) structural statements—sequencing, loop, and guard (ii) recording statements that update the synchronization environment SyncEnv-Alloc, RecordRead, RecordMutate, Fence, Arrive, and Await; and (iii) checking statements that query SyncEnv and fail if the required visibility obligations are not met; otherwise they are no-ops: CheckBarrier, CheckReads, and CheckMutates.</p>                                                                                                                                                                |
| 628 $632$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 629 $633$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 630 $634$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 631 $635$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 632 $636$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 633 $637$                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

## 638 5.2 Conversion from GPU IR to Abstract Machine IR

639 We describe how GPU IR expressions (Section 5.2.1), statements (Section 5.2.2), and instruction  
 640 arguments (Section 5.2.3) are converted. Before conversion, we apply two trivial preprocessing  
 641 steps to GPU IR: (i) Inline all window statements, and (ii) Inline all subprocedure calls that occur  
 642 outside any CudaDeviceFunction block. Throughout this section (and in figures) we use the  
 643 following notation. For a finite, ordered variables  $R = \langle r_0, \dots, r_{k-1} \rangle$  and a statement template  
 644  $S(\cdot), ;_{i=0}^k S(x) \triangleq S(r_0) ; S(r_1) ; \dots ; S(r_{k-1})$ . If  $R = \emptyset$ , the result is no-op. The intent is  
 645 purely left-to-right sequential composition at AMIR conversion time. When unambiguous, we write  
 646  $f(A) = \{f(a) \mid a \in A\}$  for the pointwise image of a set.  
 647

648 5.2.1 *Expression Conversion.* Figure 17 defines  $T_e : \text{DExpr} \cup \text{ZExpr} \rightarrow \mathcal{P}(\text{Expr}^\#)$ , which extracts  
 649 the windowed AMIR accesses  $e^\# \in \{x[w^*], z[w^*]\}$ : scalars  $d$  contribute nothing ( $\emptyset$ ); a bare name  $x$   
 650 or  $z$  yields the degenerate accesses  $x[\langle \rangle]$  or  $z[\langle \rangle]$ ; explicit indexings  $x[w^*]$  and  $z[w^*]$  are preserved;  
 651 and for arithmetic  $e_1 \text{ op } e_2$  the sets union,  $T_e(e_1) \cup T_e(e_2)$ .  $T_e$  is set-valued (order and duplicates are  
 652 irrelevant), leaves the value expression unchanged, and merely supplies the converted  $\text{Expr}^\#$  used  
 653 by RecordRead/RecordMutate and subsequent visibility checks.

654 5.2.2 *Statement Conversion.* Figure 5 gives a structural translation  $T_s : \text{Stmt} \rightarrow \text{Stmt}^\#$  from GPU  
 655 IR to AMIR. Helper functions referenced by Figure 5 are defined in Figure 18. Sequencing and  
 656 conditionals are preserved, while all loops are normalized to the abstract loop form. Warp/block  
 657 scopes are dropped and replaced with a guard whose condition is `true`. Assignments  $x[c^*] \diamond e$   
 658 ( $\diamond \in \{=, +=\}$ ) are lowered to explicit read/write effects. We expand source and destination with  $T_e$   
 659 and remove sync-exempt expressions via  $T_{\text{ex}}$ . *Sync-exempt memories* are those whose accesses can  
 660 be assumed safe without checking: CudaGridConstant, CudaClusterSync, and CudaCommitGroup.  
 661 *Shared memories* are the following: CudaBasicSmem, CudaSmemAtomicity16B, CudaSmemLinear,  
 662 Sm90\_SmemSwizzled(B), or CudaMbarrier.

663 Hardware instruction calls ignore control arguments and pair the instruction's parameter definition  
 664 and the arguments via  $T_{\text{arg}}$ . The entry and exit of CudaDeviceFunction is bracketed with fences  
 665 with `cuda_stream_sync`, capturing both the CPU-to-GPU fence implied by the kernel launch, and  
 666 the serialization of the CUDA kernel with respect to prior and subsequent CUDA kernels and  
 667 API calls (Note that Exo-GPU currently uses only one CUDA stream). During the conversion, we  
 668 record the scope of each statement using a global environment  $\text{Scope} : \text{Stmt}^\# \rightarrow \{\text{CPU}, \text{CUDA}\}$ .  
 669 Statements outside the CudaDeviceFunction block are CPU-scoped, while those inside are GPU-  
 670 scoped. The scope mapping is used in Definition 5.2. Fence, Arrive, and Await translate a sync  
 671 timeline  $(\tau_s)$  into a qualitative timeline set  $(\tau_q^*)$ .  $T_{\text{tr}}$  queries whether a sync timeline is transitive;  
 672  $T_{\text{qfull}}$  and  $T_{\text{qtmp}}$  materialize the full and temporal QualTL sets, respectively. Arrive selects its arrival  
 673 qualitative timeline via  $T_{\text{qarr}}$  to sync-check the barrier itself, thereby preventing use-after-free bugs.  
 674

675 5.2.3 *Instruction Argument Conversion.* Figure 19 expands the conversion  $T_{\text{arg}}(g.p_i, e_i, e_z)$  from a  
 676 GPU IR instruction call argument to the AMIR statements. The auxiliary projections and predicates  
 677 referenced in the figure— $T_{\text{ext}}$ ,  $T_{\text{atom}}$ ,  $T_{\text{init}}$ ,  $T_{\text{ooo}}$ , and  $T_{\text{cvg}}$ —are defined in Figure 20. For each non-sync  
 678 exempt expression  $e^\# \in T_{\text{ex}}(e_i)$ , the conversion emits a sequence of checks and effect-recording  
 679 operations which depends on two properties of the parameter  $g.p_i$ : its *access mode* (read-only vs.  
 680 write-only/read-write) and whether it is *atomic*. The access mode of  $g.p_i$  is determined by a simple  
 681 intraprocedural static analysis over instruction body (behavior)  $g.s$ : if  $g.p_i$  is only read (and never  
 682 written or reduced), it is classified as read-only; if it is written but never read or reduced, it is  
 683 write-only; otherwise it is read-write. The atomic case is selected exactly when  $T_{\text{atom}}(g.p_i) \neq \emptyset$ .

684 5.2.4 *Program Conversion.* The GPU IR's program  $\text{proc } y^*, (x : \tau_x)^* ; \text{do } s$  is translated into an  
 685 AMIR program  $\text{proc } y^* ; \text{do } s^\#$  where only control arguments are retained. Non-control arguments  
 686

$$\begin{array}{c}
687 \quad \frac{}{\langle n, \sigma \rangle \Downarrow_{\mathbb{Z}} n} \text{C-INT} \quad \frac{\sigma(y) = n}{\langle y, \sigma \rangle \Downarrow_{\mathbb{Z}} n} \text{C-READ} \quad \frac{\langle c_1, \sigma \rangle \Downarrow_{\mathbb{Z}} n_1 \quad \langle c_2, \sigma \rangle \Downarrow_{\mathbb{Z}} n_2 \quad \text{op} \in \{+, -, *, /\}}{\langle c_1 \text{ op } c_2, \sigma \rangle \Downarrow_{\mathbb{Z}} (n_1 \text{ op } n_2)} \text{C-AFF} \\
688 \\
689 \\
690 \quad \frac{}{\langle t, \sigma \rangle \Downarrow_{\text{Bool}} t} \text{B-CONST} \quad \frac{\langle b_1, \sigma \rangle \Downarrow_{\text{Bool}} t_1 \quad \langle b_2, \sigma \rangle \Downarrow_{\text{Bool}} t_2 \quad \text{op} \in \{\text{and}, \text{or}\}}{\langle b_1 \text{ op } b_2, \sigma \rangle \Downarrow_{\text{Bool}} \overline{\text{op}}(t_1, t_2)} \text{B-LOG} \\
691 \\
692 \\
693 \quad \frac{\langle c_1, \sigma \rangle \Downarrow_{\mathbb{Z}} n_1 \quad \langle c_2, \sigma \rangle \Downarrow_{\mathbb{Z}} n_2 \quad \text{op} \in \{=, <, \leq\}}{\langle c_1 \text{ op } c_2, \sigma \rangle \Downarrow_{\text{Bool}} (n_1 \text{ op } n_2)} \text{B-REL} \quad \frac{\langle c, \sigma \rangle \Downarrow_{\mathbb{Z}} i}{\langle c, \sigma \rangle \Downarrow_I \{i\}} \text{W-POINT} \\
694 \\
695 \\
696 \quad \frac{\langle c_1, \sigma \rangle \Downarrow_{\mathbb{Z}} i_1 \quad \langle c_2, \sigma \rangle \Downarrow_{\mathbb{Z}} i_2}{\langle c_1..c_2, \sigma \rangle \Downarrow_I \{i \in \mathbb{Z} \mid i_1 \leq i \leq i_2\}} \text{W-RANGE} \quad \frac{\forall j \in \{1..n\}. \langle w_j, \sigma \rangle \Downarrow_I W_j}{\langle w_1, \dots, w_n, \sigma \rangle \Downarrow_{I^n} W_1 \times \dots \times W_n} \text{W-TUPLE} \\
697 \\
698 \\
699 \quad \frac{a \in \mathbb{X} \cup \mathbb{B} \quad \langle w_1, \dots, w_n, \sigma \rangle \Downarrow_{I^n} \mathbf{W}}{\langle a[w_1, \dots, w_n], \sigma \rangle \Downarrow_{\text{Acc}} \{a \mapsto \vec{i} \mid \vec{i} \in \mathbf{W}\}} \text{E#-Acc} \\
700 \\
701 \\
702 \quad \text{Fig. 6. The big-step operational semantics for expressions.} \\
703
\end{array}$$

$(x : \tau_x)^*$  are ignored during the conversion. Control variables are preserved, since GPU IR and AMIR share the same control variables set  $\mathbb{Y}$ .  $T_s$  is applied to the program body to convert  $s$  to  $s^*$ .

### 5.3 Abstract Machine Semantics

**Definition 5.1** (Control–Value Environment). Let  $\mathbb{Z}$  be the set of control value and  $\mathbb{Y}$  the set of control variables.  $\sigma \in \Sigma \triangleq \mathbb{Y} \rightarrow \mathbb{Z}$ .

**Definition 5.2** (Global Threads Collective). Let  $\mathbb{I}$  be the set of task IDs and  $\mathbb{T}$  the set of all (local) thread IDs (thread collective). The set of global thread IDs is  $\mathbb{G} \triangleq \mathbb{I} \times \mathbb{T}$ . For each statement  $s^*$ , thread mapping  $f_{s^*} : \Sigma \rightarrow \mathbb{T}$  returns the set of local threads that execute  $s^*$  under  $\sigma$  ( $f_{s^*}$  defined in Section 4). The scope mapping scope : Stmt $^* \rightarrow \{\text{CPU}, \text{GPU}\}$  is initialized during the statement conversion to indicate whether each statement belongs to the CPU or GPU scope. Given  $\sigma$  and  $i \in \mathbb{I}$ , define the corresponding global threads

$$g_{s^*}(\sigma, i) \triangleq \begin{cases} \{(i, t) \mid t \in f_{s^*}(\sigma)\} & \text{if scope}(s) = \text{GPU} \\ \mathbb{G} & \text{otherwise} \end{cases}$$

**Definition 5.3** (Synchronization Environment). Let  $\mathbb{B}$  be the set of barrier names and  $\mathbb{X}$  the set of data variables. Define

$$\mathbb{P} : \mathbb{B} \rightarrow \mathbb{Z}^n \rightarrow \mathcal{P}(\mathbb{N}) \quad \text{and} \quad \mathbb{S} : \mathbb{G} \rightarrow \text{QualTL} \rightarrow \text{VL},$$

where  $\mathbb{P}(b, c^*)$  is the set of observed arrive counts at barrier access  $b[c^*]$ , and  $\mathbb{S}(g, \tau_q)$  is the visibility level for the thread with global ID  $g \in \mathbb{G}$  on a qualitative timeline  $\tau_q$  (Figure 4). Let

$$\mathbb{R} \triangleq \text{QualTL} \times \mathbb{S} \times \mathbb{P}$$

be the set of *visibility records*, written  $(q, f, p)$ . The synchronization environment is the map

$$\rho : (\mathbb{X} \cup \mathbb{B}) \rightarrow \mathbb{Z}^n \rightarrow \mathcal{P}(\mathbb{R}) \times \mathcal{P}(\mathbb{R}) \times \mathbb{N}^2,$$

sending a name and index to a tuple of (read visibility record (r):  $\mathbb{R}$ , mutate visibility record (m):  $\mathbb{R}$ , barrier count (b):  $n^2$ ).

The semantics follows a standard big-step style where judgments take the form  $\langle e^*, \sigma \rangle \Downarrow v$  for expressions and  $\langle s^*, i, \sigma, \rho \rangle \Downarrow i', \sigma', \rho'$  for statements, indicating that expression  $e^*$  evaluates to value  $v$  in state  $\sigma$ , and statement  $s^*$  transforms state  $\sigma$  to  $\sigma'$ ,  $\rho$  to  $\rho'$ , and  $i$  to  $i'$ , respectively.

|     |                                                                                                                                                                          |                                                                                                                                                                           |
|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 736 |                                                                                                                                                                          |                                                                                                                                                                           |
| 737 | $\exists (\_, f, \_) \in \rho(W_{e^*}).r ((t = \text{false} \wedge \exists g \in g_{s^*}(\sigma, \iota), \forall \tau_q \in \tau_q^* \mid f(g, \tau_q) \not\leq \tau_v)$ |                                                                                                                                                                           |
| 738 | $\langle e^#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^*}$                                                                                                            | $\vee (t = \text{true} \wedge \forall g \in g_{s^*}(\sigma, \iota), \forall \tau_q \in \tau_q^* \mid f(g, \tau_q) \not\leq \tau_v)) \Downarrow_{\text{Bool}} \text{true}$ |
| 739 |                                                                                                                                                                          | $\text{CHECKREADS}$                                                                                                                                                       |
| 740 |                                                                                                                                                                          | $\langle \text{CheckReads}(\tau_v, t, e^#, \tau_q^*), \iota, \sigma, \rho \rangle \Downarrow \text{error}$                                                                |
| 741 | $\exists (\_, f, \_) \in \rho(W_{e^*}).m ((t = \text{false} \wedge \exists g \in g_{s^*}(\sigma, \iota), \forall \tau_q \in \tau_q^* \mid f(g, \tau_q) \not\leq \tau_v)$ |                                                                                                                                                                           |
| 742 | $\langle e^#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^*}$                                                                                                            | $\vee (t = \text{true} \wedge \forall g \in g_{s^*}(\sigma, \iota), \forall \tau_q \in \tau_q^* \mid f(g, \tau_q) \not\leq \tau_v)) \Downarrow_{\text{Bool}} \text{true}$ |
| 743 |                                                                                                                                                                          | $\text{CHECKMUTATES}$                                                                                                                                                     |
| 744 |                                                                                                                                                                          | $\langle \text{CheckMutates}(\tau_v, t, e^#, \tau_q^*), \iota, \sigma, \rho \rangle \Downarrow \text{error}$                                                              |
| 745 | $\langle e_z^#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e_z^#}$                                                                                                        | $(\exists (a, w) \in \rho(W_{e_z^*}).b \mid a \neq w) \Downarrow_{\text{Bool}} \text{true}$                                                                               |
| 746 |                                                                                                                                                                          | $\text{CHECKBARRIER}$                                                                                                                                                     |
| 747 |                                                                                                                                                                          | $\langle \text{CheckBarrier}(e_z^#), \iota, \sigma, \rho \rangle \Downarrow \text{error}$                                                                                 |

Fig. 7. Big-step rules for access visibility and barrier consistency. Successful checks leave  $(\iota, \sigma, \rho)$  unchanged; any violation reduces the configuration to error state.

5.3.1 *Expression semantics.* Figure 6 gives big-step judgements  $\langle e, \sigma \rangle \Downarrow_{\tau} v$ . Integer and boolean literals evaluate to themselves (C-INT, B-CONST); variables read from the store (C-READ). Arithmetic evaluates both operands to integers and then applies  $\text{op} \in \{+, -, *, /\}$  (C-AFF). Boolean connectives apply the usual truth functions for and/or (B-LOG), and comparisons  $\text{op} \in \{==, <, \leq\}$  compare integer results to yield booleans (B-REL). Windows denote index sets  $\mathcal{I}$ : a point yields  $\{i\}$  (W-POINT); a range  $c_1..c_2$  yields the inclusive set  $\{i \in \mathbb{Z} \mid i_1 \leq i \leq i_2\}$  (W-RANGE); and an  $n$ -tuple of windows denotes the Cartesian product  $W_1 \times \dots \times W_n$  (W-TUPLE). The only nonstandard part is E#-Acc: given  $a \in \mathbb{X} \cup \mathbb{B}$  and a window tuple  $\mathbf{W}$ , it *unpacks* the window into the set of concrete accesses  $\{\{a \mapsto \vec{i} \mid \vec{i} \in \mathbf{W}\} \mid \vec{i} \in \mathbf{W}\}$  of type  $\mathbb{X} \cup \mathbb{B} \rightarrow \mathbb{Z}^*$ .

5.3.2 *Statement semantics.* Figure 21 gives big-step operational semantics for a loop, sequencing, an explicit task-ID increment, and a guard. For  $\text{for } y \text{ in loop}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } s^*$ , the bounds evaluate in  $\sigma$  to integers  $m$  and  $n$ ; the loop ranges over the *half-open* interval  $[m, n]$ : when  $m < n$  (LOOPSTEP) the body executes once with  $y \mapsto m$ , producing  $(\iota', \sigma', \rho')$ , and the loop recurs on  $(m+1) \dots n$ ; when  $m \geq n$  (LOPDONE) it terminates with no effect. Sequencing (SEQ) threads the triple  $(\iota, \sigma, \rho)$  from the first statement into the second. The guard (IF-TRUE/IF-FALSE) evaluates  $b$  in  $\sigma$  and either executes  $s^*$  or behaves as no-op. The only non-standard rule, INCTASKID, increments the task identifier ( $\iota' = \iota + 1$ ) while leaving  $\sigma$  and  $\rho$  unchanged.

Figure 22 shows the rest of statement semantics. Alloc evaluates each extent  $c_i$  to an integer  $n_i$  (under  $\Downarrow_{\mathbb{Z}}$ ), forms the index set  $I = \{(v_1, \dots, v_k) \mid 0 \leq v_i < n_i\}$ , and initializes the per-access visibility record for  $x$  by updating  $\rho$  so that, for every  $\mathbf{n} \in I$ ,  $\rho(x, \mathbf{n}) = (\emptyset, \emptyset, (0, 0))$ ; the thread id  $\iota$  and store  $\sigma$  are unchanged. ClearReads first resolves the accessed locations via  $\langle e^#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^*}$  and, for each  $(x, \mathbf{n}) \in W_{e^*}$ , replaces  $\rho(x, \mathbf{n}) = (r, m, b) \in \mathbb{R}$  with  $(\emptyset, m, b)$ . ClearMutates is analogous but replaces it with  $(r, \emptyset, b)$ . RECORDREAD and RECORDMUTATE first evaluate the accessor  $e^*$  and  $e_z^*$  to access sets  $W_{e^*}$  and  $W_{e_z^*}$  (via  $\Downarrow_{\text{Acc}}$ ). They then update the visibility map  $\rho$  at each key  $(x, \mathbf{n}) \in W_{e^*}$  using the new record constructor  $\mathcal{R}$  defined in Appendix B.

Fence, Arrive, and Await update  $\rho$  solely via  $\text{lift}(\rho, \lambda)$  (defined in Appendix B), which applies  $\lambda$  pointwise to both the read and mutate visibility sets. In the following,  $\mathcal{W}$  decides when augmentation is permitted,  $\mathcal{A}$  raises visibility to *fully\_ordered* or *temporally\_ordered*. For  $\text{Fence}(t, \tau_q^{\text{pre}*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*})$ ,  $\lambda(r) = \mathcal{A}(r, g^{\text{exec}*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*})$  iff the witness  $\mathcal{W}(t, g^{\text{exec}*}, \tau_q^{\text{pre}*}, r)$  holds, otherwise  $\lambda(r) = r$ ; thus visibility is raised by joining  $f \vee f_{\text{aug}}$ .  $\text{Arrive}(t, \tau_q^{\text{pre}*}, e_z^{**})$  and  $\text{Await}(e_z^*, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}, n)$  are defined similarly in Appendix B, updating the access visibility appropriately.



Fig. 8. Performance by varying problem size on a sm\_90a GPU.

5.3.3 *Checks.* Figure 7 gives three pure checks that either signal error or leave  $(\iota, \sigma, \rho)$  unchanged. CHECKREADS evaluates  $e^\#$  to access  $W_{e^\#}$  and raises error when some  $(\_, f, \_) \in \rho(W_{e^\#}).r$  fails to supply visibility at least  $\tau_v$  for all timelines  $\tau_q \in \tau_q^*$ . CHECKMUTATES is analogous but ranges over mutate visibility record  $\rho(W_{e^\#}).m$ . CHECKBARRIER evaluates  $e_z^\#$  and signals error iff some  $(a, w) \in \rho(W_{e_z^\#}).b$  has  $a \neq w$ , i.e., the recorded *arrive* and *await* counts disagree.

5.3.4 *Program semantics.* Finally, abstract machine executes the program proc  $y^*$  ; do  $s^\#$ . The synchronization check is exposed to the user, who invokes the abstract machine interpreter with concrete integer values corresponding to the control variables  $y^*$ . The control-value environment  $\sigma : \mathbb{Y} \rightarrow \mathbb{Z}$  is initialized based on the user's input. The statement  $s^\#$  is then evaluated according to the semantics defined previously. Note that the task id  $\iota$  is initialized to 0, and the synchronization  $\rho$  is initially empty.

## 6 PERFORMANCE RESULTS

We implemented tf32-precision GEMM kernels for the sm\_90a architecture, using wmma and TMA instructions. We also implemented a full-precision fp32 GEMV kernel using warp shuffles and scalar math [12]; this did not use the accelerator instructions. We compared the Exo-GPU kernels to cuBLAS's cublasSgemm, cublasSgemmStridedBatched, and cublasSgemv functions, all with the

834 math mode set to CUBLAS\_TF32\_TENSOR\_OP\_MATH. Our sm\_90a-compatible test hardware contained  
 835 an NVIDIA H100 80GB SXM5 GPU and an AMD EPYC 9454 48-Core Processor, and we used nvcc  
 836 Build cuda\_12.3.r12.3/compiler.33492891\_0 to compile the CUDA C++ sources generated  
 837 by Exo-GPU. For GEMM, a compute-bound workload, theoretical peak is 494.5 TFLOPS [17]. For  
 838 GEMV, a memory-bound workload, we derive the theoretical peak from the memory bandwidth:  
 839  $3.35 \frac{\text{TB}}{\text{s}} \frac{2 \text{ FLOP}}{4 \text{ B}} = 1.675 \frac{\text{TFLOP}}{\text{s}}$ , with one multiply and one add done for each 4 byte tf32 element loaded  
 840 from the matrix (the bandwidth from loading the vector is amortized).

841 We tested each problem size and hardware combination over 105 iterations: 5 warm-up iterations,  
 842 and 100 timed iterations, with the pcg3d hash algorithm [9] used to generate “random” input  
 843 matrices. Each iteration tested, in a randomized order, the cubLAS kernel and all applicable Exo-  
 844 GPU kernel variants for that problem size. In particular, we tested non-split-k and split-k variants of  
 845 the GEMM kernel, with the split-k kernels using cp.reduce.async.bulk (TMA) and split factors  
 846 of 1, 2, 4, 8, and 16. For each (problem size, hardware) combination, we collected 100 time samples  
 847 per tested kernel and reported the mean of the inter-quartile range (middle 50 samples). For each  
 848 plot in Figure 8, we report the best-performing Exo-GPU kernel for each problem size (plotted as  
 849 “Exo-GPU (specialized)”) and the single best-performing Exo-GPU kernel (“Exo-GPU (one kernel)”).

850 For large, square matrix GEMM problem sizes, we see very similar performance between Exo-  
 851 GPU and CUBLAS; this is expected, as Exo-GPU is able to express all optimizations required for a  
 852 peak-performance sm\_90a gemm: warp specialization, cluster & TMA multicast support, wmma  
 853 support, and split-barrier synchronization using commit group, mbarrier, and cluster sync mecha-  
 854 nisms. For more unusual GEMM problem sizes, such as small matrices and K=65536 workloads,  
 855 we see more variation between Exo-GPU and CUBLAS, possibly due to different problem size  
 856 thresholds for observing wave quantization effects. For GEMV, Exo-GPU and CUBLAS follow a  
 857 similar performance trend, with CUBLAS having an almost-consistent performance advantage.

858 **Abstract Machine performance.** We re-  
 859 port the performance of the synchronization  
 860 checking alone, which is implemented as an  
 861 interpreter for the abstract machine (Section 5).  
 862 We benchmarked using only one thread of a laptop  
 863 with an “11th Gen Intel(R) Core(TM)  
 864 i7-11850H @ 2.50GHz” CPU. In Figure 9, we  
 865 report times for concrete square matrix prob-  
 866 lem sizes for the GEMV kernel and the sm\_90a  
 867 GEMM kernel described earlier (both the split-k  
 868 and non-split-k versions). The batch size L is 1.  
 869 We use one untimed warmup run for each prob-  
 870 lem size, and report the average of five timed  
 871 runs, as reported by the timeit module of Python 3.10.12.

872 We observe that runtimes are linear in the number of memory operations interpreted. The  
 873 number of memory operations is cubic in the problem size for GEMM, and quadratic for GEMV.  
 874 We see this pattern followed cleanly by the measured trend (e.g. 2/16/128 seconds for GEMM for  
 875 1024/2048/4096 problem size), indicating no unexpected overhead for large problem sizes. The  
 876 bump for small GEMM problem sizes is due to imperfect division by the per-cluster tile size. Even if  
 877 the expected problem size for a deployed kernel is large, during development of an Exo-GPU kernel,  
 878 it generally suffices to test only with a small problem size (e.g. 768) for immediate feedback on  
 879 bugs, with the option of testing the finalized kernel with the true, larger problem size for additional  
 880 confidence.



Fig. 9. Sync-check runtime by problem size.

## 883 7 RELATED WORK

### 884 7.1 GPU Kernel Authoring Languages

885 While CUDA offers fine-grained control, it remains low-level and error-prone. To address this,  
 886 various approaches have emerged at different abstraction levels.  
 887

888 *Higher-level abstractions:* Tile-level DSLs like Triton [24] and Pallas [1] abstract away low-  
 889 level details by enabling programmers to express computations at the block level, with compilers  
 890 automatically handling the thread-level mapping. However, these abstractions can hide performance-  
 891 critical details. When Triton performance fell far behind state-of-the-art for Blackwell GPUs, they  
 892 had to invent Gluon, a lower-level language that gives users direct control over tile layouts, memory  
 893 allocation, data movement, and asynchrony, essentially abandoning the higher-level automation  
 894 to regain the fine-grained control necessary for peak performance. Rather than starting with  
 895 automation that may later prove inadequate, Exo-GPU deliberately maintains a thin abstraction  
 896 layer over CUDA. Performance-critical constructs, including control flow and data copies across  
 897 the memory hierarchy, are made explicit statements in Exo-GPU’s imperative object code.

898 *Template-based libraries:* CUTLASS [15] and ThunderKittens [22] are C++ template-based li-  
 899 braries that raise the abstraction of GPU programming while preserving performance. Instead of  
 900 hiding CUDA behind rigid abstractions, they provide composable, architecture-aware primitives  
 901 that integrate with raw CUDA code. This approach lets programmers work at the tile level when  
 902 convenient, yet seamlessly drop to thread-level operations when fine control is needed. CuTe [14],  
 903 introduced in CUTLASS 3.0, offers flexible abstraction for defining tensor layouts and transforma-  
 904 tions. While these libraries are effective engineering toolkits, they do not provide safety guarantees  
 905 or whole-program analyses for asynchronous instructions and synchronization statements.

906 Achieving both peak performance and safety remains a fundamental challenge. Existing ap-  
 907 proaches either provide limited safety guarantees or lack support for modern GPU features. For  
 908 instance, Descend [11] introduces Rust-like borrow checking to ensure race freedom but does not  
 909 support key GPU features such as tensor cores and split barriers. The aforementioned tile- and  
 910 template-based languages are practical tools, yet do not offer safety guarantees. This motivated the  
 911 collective analysis (Section 4) and synchronization checking via abstract machine (Section 5).

### 912 7.2 User-schedulable Languages

913 User-schedulable languages (USLs) describe optimization as rewriting programs into new pro-  
 914 grams that compute the same result, but execute much faster. Rather than relying on opaque  
 915 compiler heuristics, programmers directly control these rewrites by specifying scheduling oper-  
 916 ations. Halide [19, 20] popularized the idea of USLs, and subsequent Halide-inspired languages  
 917 explored different target applications and abstraction for a given domain [2, 3, 5–7, 10, 23, 25, 26, 28].  
 918 For instance, TVM offers accelerator abstraction through `tensorize`, TACO separates tensor defi-  
 919 nitions from sparse formats, and Taichi abstracts memory layout and management with `fields`.  
 920

921 Several systems have extended user-scheduling to support modern GPU features. TVM’s Ten-  
 922 sorIR [4] provides intrinsics for manipulating low-level constructs like split barriers, while Cy-  
 923 press [27] lets programmers describe computations as tasks and map them to different resources.  
 924 Exo-GPU differs from these approaches in two key ways. First, Exo-GPU exposes low-level control  
 925 to the programmer; there is little hidden control flow in GPU IR. In this sense, Exo-GPU follows a  
 926 WYSIWYG (what you see is what you get) philosophy: the structure of the program directly reflects  
 927 its execution. This contrasts with Cypress, which hides warp specialization as an internal com-  
 928 piler optimization, and with TensorIR, which focuses on automatically generating schedules with  
 929 lowering-based intrinsics. Second, Exo-GPU guarantees sequential-parallel equivalence through a  
 930 synchronization checker that detects data races, hazards, and other concurrency bugs.

932 

### 7.3 Exo vs. Exo-GPU

933 Exo-GPU fundamentally extends existing the Exo language by introducing parallel semantics. While  
 934 Exo [7, 8] is a high-performance USL that achieves peak performance on many vector machines  
 935 and accelerators, it was designed to exploit only the *implicit parallelism* exposed by hardware.  
 936 Exo’s scheduling reasoning operates exclusively under a sequential execution model. This approach  
 937 suffices for vector machines and matrix accelerators because CPU frontends maintain instruction  
 938 ordering; even with out-of-order execution and reorder buffers, the hardware guarantees that  
 939 parallel execution produces results equivalent to sequential execution. Therefore, the compiler  
 940 needs only to verify program equivalence when interpreting object code sequentially. Even within  
 941 this sequential framework, Exo’s optimizations could exploit implicit hardware parallelism through  
 942 techniques like software pipelining [7, 8].

943 In contrast, GPUs expose *explicit data- and instruction-parallelism* to software. Unlike traditional  
 944 CPUs and accelerators that maintain the illusion of sequential execution through their frontend,  
 945 modern GPUs shift this responsibility entirely to software. For Exo, this shift requires reasoning  
 946 about parallel execution at the object code level and establishing equivalence not just within  
 947 programming models, but across sequential and parallel paradigms. We address this fundamental  
 948 challenge through two key contributions. First, we extend Exo’s object code and scheduling language  
 949 to expose user-controlled parallel loops and synchronization features. Second, we provide safety  
 950 guarantees by establishing sequential-parallel equivalence through the abstract machine. This  
 951 completes the equivalence chain from the simple, initial Exo program through to its optimized,  
 952 parallel implementation in Exo-GPU, ensuring correctness at every transformation step.  
 953

954  
955 

## 8 LIMITATIONS AND FUTURE WORK

956 To optimize Hopper GEMMs (Figure 8), we had to apply two  
 957 rewrite steps that were out-of-scope of Exo’s sequential safety  
 958 checks. These patterns arise when modeling TMA’s asynchronous  
 959 copy instruction, which supports default behavior for out-  
 960 of-bounds reads and writes. Inset right figure shows a minimal  
 961 example that illustrates the problem. Even though removing the  
 962 first guard is safe since the `tmp[6]` and `tmp[7]` are not being read by C in the second loop thanks to  
 963 the guard, Exo’s dependency-based checks treat these writes as semantically relevant and cannot  
 964 prove they are observationally redundant. We marked these transformations as *unsafe* in our sched-  
 965 ules, despite being semantically sound. To address this issue, we need to extend Exo’s sequential  
 966 scheduling rewrites with value-sensitivity.  
 967

968 Our ability to reason about parallelism is incomplete in several ways. Our programming model  
 969 and output CUDA synchronization code is generated based on our understanding of PTX as  
 970 documented by NVIDIA in natural language [18]. We do not show that our generated code is  
 971 correct with respect to any formal CUDA semantics. We currently do not support Blackwell  
 972 Tensor Cores (tcgen05), nor do we support platforms not providing CUDA C++ and inline PTX  
 973 support. Finally, our current synchronization checking approach – translation to abstract machine  
 974 IR followed by software interpretation – supports only concrete problem sizes and limits compiler  
 975 performance. This approach may be improved with static analysis of the abstract machine program.  
 976

977 Nevertheless, we believe that our approach—exposing detailed control to programmers while  
 978 verifying its safety—offers a promising path forward for GPU programming. By treating parallelism  
 979 and synchronization as analyzable annotations on sequential code, Exo-GPU provides both the  
 transparency and static guarantees that high-performance GPU kernels demand.

```
tmp : f32[8] @ tmp_mem
for i in seq(0, 8):
    if i < 6: # removing is safe
        tmp[i] = 0
    for i in seq(0, 8):
        if i < 6:
            C[i] = tmp[i]
```

## 981 REFERENCES

- [1] The JAX authors. 2024. Pallas: a JAX kernel language. <https://docs.jax.dev/en/latest/pallas/index.html>
- [2] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: expressing locality and independence with logical regions. In *SC Conference on High Performance Computing Networking, Storage and Analysis, SC '12* (Salt Lake City, UT, USA). IEEE, Piscataway, NJ, USA, 66. <https://doi.org/10.1109/SC.2012.71>
- [3] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In *Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation* (Carlsbad, CA, USA) (OSDI'18). USENIX Association, Berkeley, CA, USA, 579–594. <http://dl.acm.org/citation.cfm?id=3291168.3291211>
- [4] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2023. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. In *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2* (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 804–817. <https://doi.org/10.1145/3575693.3576933>
- [5] Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodík, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In *PACT '20: International Conference on Parallel Architectures and Compilation Techniques, Virtual Event, GA, USA, October 3–7, 2020*. Vivek Sarkar and Hyesoon Kim (Eds.). ACM, 71–82. <https://doi.org/10.1145/3410463.3414632>
- [6] Yuanning Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. Taichi: a language for high-performance computation on spatially sparse data structures. *ACM Trans. Graph.* 38, 6 (2019), 201:1–201:16. <https://doi.org/10.1145/3355089.3356506>
- [7] Yuka Ikarashi, Gilbert Louis Bernstein, Alex Reinking, Hasan Genc, and Jonathan Ragan-Kelley. 2022. Exocompilation for Productive Programming of Hardware Accelerators. In *Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation* (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 703–718. <https://doi.org/10.1145/3519939.3523446>
- [8] Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert Louis Bernstein, and Jonathan Ragan-Kelley. 2025. Exo 2: Growing a Scheduling Language. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1* (Rotterdam, Netherlands) (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 426–444. <https://doi.org/10.1145/3669940.3707218>
- [9] Mark Jarzynski and Marc Olano. 2020. A PCC3D Family of Hash Functions. <https://github.com/markjarzynski/PCC3D>
- [10] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. *Proceedings of the ACM on Programming Languages* 1, OOPSLA (oct 2017), 1–29. <https://doi.org/10.1145/3133901>
- [11] Bastian Köpcke, Sergei Gorlatch, and Michel Steuwer. 2024. Descend: A Safe GPU Systems Programming Language. *Proc. ACM Program. Lang.* 8, PLDI, Article 181 (June 2024), 24 pages. <https://doi.org/10.1145/3656411>
- [12] Bruce Lee. 2024. CUDA HGEMV. [https://github.com/Bruce-Lee-LY/cuda\\_hgemv](https://github.com/Bruce-Lee-LY/cuda_hgemv)
- [13] NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture. <https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf>
- [14] NVIDIA. 2025. CuTe. <https://docs.nvidia.com/cutlass/media/docs/cpp/cute>
- [15] NVIDIA. 2025. CUTLASS 4.2.1. <https://github.com/NVIDIA/cutlass>
- [16] NVIDIA. 2025. CUTLASS grouped kernel schedules. [https://docs.nvidia.com/cutlass/media/docs/cpp/grouped\\_scheduler.html](https://docs.nvidia.com/cutlass/media/docs/cpp/grouped_scheduler.html)
- [17] NVIDIA. 2025. H100 GPU. <https://www.nvidia.com/en-us/data-center/h100/>
- [18] NVIDIA. 2025. Parallel Thread Execution ISA Version 9.0. <https://docs.nvidia.com/cuda/parallel-thread-execution/>
- [19] Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman P. Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. *ACM Trans. Graph.* 31, 4 (2012), 32:1–32:12. <https://doi.org/10.1145/2185520.2185528>
- [20] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman P. Amarasinghe, and Frédo Durand. 2018. Halide: decoupling algorithms from schedules for high-performance image processing. *Commun. ACM* 61, 1 (2018), 106–115. <https://doi.org/10.1145/3150211>
- [21] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arXiv:2407.08608 [cs.LG] <https://arxiv.org/abs/2407.08608>
- [22] Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. 2024. ThunderKittens: Simple, Fast, and Adorable AI Kernels. arXiv:2410.20399 [cs.LG] <https://arxiv.org/abs/2410.20399>
- [23] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In *2017 IEEE/ACM International Symposium on Code Generation and Optimization*

- 1030 (CGO). IEEE, Piscataway, NJ, USA, 74–85. <https://doi.org/10.1109/CGO.2017.7863730>
- 1031 [24] Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural  
1032 network computations. In *Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and  
1033 Programming Languages* (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY,  
1034 USA, 10–19. <https://doi.org/10.1145/3315508.3329973>
- 1035 [25] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodidis, Priya Goyal, Zachary DeVito, William S. Moses,  
1036 Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-  
1037 Performance Machine Learning Abstractions. arXiv:1802.04730 [cs.PL]
- 1038 [26] Anand Venkat, Tharindu Rusira, Raj Barik, Mary Hall, and Leonard Truong. 2019. SWIRL: High-performance many-  
1039 core CPU code generation for deep neural networks. *The International Journal of High Performance Computing  
1040 Applications* 33, 6 (2019), 1275–1289. <https://doi.org/10.1177/1094342019866247>
- 1041 [27] Rohan Yadav, Michael Garland, Alex Aiken, and Michael Bauer. 2025. Task-Based Tensor Computations on Modern  
1042 GPUs. *Proc. ACM Program. Lang.* 9, PLDI, Article 163 (June 2025), 25 pages. <https://doi.org/10.1145/3729262>
- 1043
- 1044
- 1045
- 1046
- 1047
- 1048
- 1049
- 1050
- 1051
- 1052
- 1053
- 1054
- 1055
- 1056
- 1057
- 1058
- 1059
- 1060
- 1061
- 1062
- 1063
- 1064
- 1065
- 1066
- 1067
- 1068
- 1069
- 1070
- 1071
- 1072
- 1073
- 1074
- 1075
- 1076
- 1077
- 1078

## 1079 A GPU IR TYPE DEFINITIONS

---

|      |                                                  |                                                |
|------|--------------------------------------------------|------------------------------------------------|
| 1081 | $\tau_x : \text{DataType} ::= \text{f32}$        | 32-bit floating point                          |
| 1082 | f64                                              | 64-bit floating point                          |
| 1083 | f16                                              | 16-bit floating point                          |
| 1084 | i32                                              | 32-bit integer                                 |
| 1085 | ui16                                             | 16-bit unsigned integer                        |
| 1086 | i8                                               | 8-bit integer                                  |
| 1087 | ui8                                              | 8-bit unsigned integer                         |
| 1088 | $\tau_z : \text{BarrierType} ::= \text{barrier}$ | non-explicitly guarded barrier                 |
| 1089 |                                                  | $\text{barrier}(z)$ explicitly guarded barrier |

---

1090 Fig. 10. List of variable types

1091  
1092

---

|      |                                                          | domain                                                  | box                                |
|------|----------------------------------------------------------|---------------------------------------------------------|------------------------------------|
| 1094 | $\tau_u : \text{CollUnit} ::= \text{standalone\_thread}$ | (1,)                                                    | (1,)                               |
| 1095 | $n_1 * \text{cuda\_thread}$                              | (blockDim,)                                             | (n <sub>1</sub> ,)                 |
| 1096 | $\text{cuda\_quadpair}$                                  | (blockDim/16, 16)                                       | (2, 4)                             |
| 1097 | $n_1 * \text{cuda\_warp}$                                | (blockDim,)                                             | (n <sub>1</sub> * 32,)             |
| 1098 | $n_1 * \text{cuda\_warpgroup}$                           | (blockDim,)                                             | (n <sub>1</sub> * 128,)            |
| 1099 | $n_1 * \text{cuda\_threads\_strided}(n_2, n_3)$          | (blockDim/n <sub>3</sub> , n <sub>3</sub> )             | (n <sub>1</sub> , n <sub>2</sub> ) |
| 1100 | $n_1 * \text{cuda\_warp\_in\_cluster}$                   | (clusterDim, blockDim)                                  | (n <sub>1</sub> , 32)              |
| 1101 | $n_1 * \text{cuda\_cta\_in\_cluster}$                    | (clusterDim * blockDim,)                                | (n <sub>1</sub> * blockDim,)       |
| 1102 | $\text{cuda\_cluster}$                                   | (clusterDim * blockDim,)                                | (clusterDim * blockDim,)           |
| 1103 | $n_1 * \text{cuda\_cta\_in\_cluster\_strided}(n_3)$      | (ClusterDim/n <sub>3</sub> , n <sub>3</sub> , blockDim) | (n <sub>1</sub> , 1, blockDim)     |
| 1104 | $n_1 * \text{cuda\_warp\_in\_cluster\_strided}(n_3)$     | (clusterDim/n <sub>3</sub> , n <sub>3</sub> , blockDim) | (n <sub>1</sub> , 1, 32)           |
| 1105 | $\text{cuda\_agnostic\_sub\_cta}$                        | (clusterDim, blockDim)                                  | (1, _)                             |
|      | $\text{cuda\_agnostic\_intact\_cta}$                     | (clusterDim, blockDim)                                  | (_, blockDim)                      |

---

1106 Fig. 11. Collective units defined in the frontend language. These are parameterized by a collective type  $\delta$ ,  
1107 which consists of the *domain* and *box* shown here.1108  
1109

---

|      |                                                               |                                                             |
|------|---------------------------------------------------------------|-------------------------------------------------------------|
| 1110 | $\pi_d : \text{DataMemory} ::= \text{CudaBasicDeviceVisible}$ | base type for CUDA-visible allocations                      |
| 1111 | CudaBasicSmem                                                 | base type for CUDA SMEM allocations                         |
| 1112 | CudaDeviceVisibleAtomicity16B                                 | base type for 16B-atomicity layout CUDA-visible allocations |
| 1113 | CudaDeviceVisibleLinear                                       | base type for linear-order CUDA-visible allocations         |
| 1114 | CudaGridConstant                                              | CUDA grid constant memory                                   |
| 1115 | CudaGmemLinear                                                | CUDA global memory, linear-order                            |
| 1116 | CudaSmemAtomicity16B                                          | Any CUDA shared memory with 16B-atomicity layout            |
| 1117 | CudaSmemLinear                                                | CUDA shared memory with linear-order layout                 |
| 1118 | CudaRmem                                                      | Per-thread CUDA registers                                   |
| 1119 | Sm80_RmemMatrixA(M, K)                                        | Matrix tile for sm_80+ warp MMA A operand                   |
| 1120 | Sm80_RmemMatrixB(N, K)                                        | Matrix tile for sm_80+ warp MMA B operand                   |
| 1121 | Sm80_RmemMatrixD(M, N)                                        | Matrix tile for sm_80+ MMA accumulator operands             |
| 1122 | Sm90_SmemSwizzled(B)                                          | B-byte swizzled shared memory, in TMA/wmma-accepted format  |
| 1123 | Sm90_RmemMatrixA(M)                                           | Register tile for wmma A operand (not implemented)          |
|      | Sm90_RmemMatrixD(M, N)                                        | Register tile for wmma D operand                            |

---

1124 Fig. 12. CUDA Memory Types – Data Allocations (pre-existing Exo Memory not listed)

1125  
1126  
1127

---

```

1128  $\pi_z : \text{BarrierComp} ::= \text{CudaDeviceBarrier}$  base type for CUDA-allocated barriers
1129 | CudaMbarrier CUDA mbarrier
1130 | CudaCommitGroup CUDA commit_group mechanism
1131 | CudaClusterSync barrier.cluster sync

```

---

Fig. 13. CUDA Barrier (completion) mechanisms

---

```

1135  $\pi_w : \text{SpecialWindow} ::= \text{Sm90\_tensorMap(swizzle, *smem\_box)}$  CUtensorMap parameterized by swizzle mode and box size
1136

```

---

Fig. 14. CUDA Special Window Types (for Window Statement)

---

|                                                          |                                                             |
|----------------------------------------------------------|-------------------------------------------------------------|
| $\tau_q : \text{QualTL} ::= \text{cpu\_in\_order\_qual}$ | Ordinary CPU reads/writes (cpu)                             |
| $\text{cpu\_cuda\_stream\_qual}$                         | Stream-ordered CUDA API calls (strm)                        |
| $\text{cuda\_in\_order\_rmem\_qual}$                     | non-asynchronous CUDA instrs' accesses to registers (cuda1) |
| $\text{cuda\_in\_order\_ram\_qual}$                      | non-asynchronous CUDA instrs' accesses to SMEM/GMEM (cuda2) |
| $\text{Sm80\_cp\_async\_qual}$                           | non-bulk cp.async memory accesses (Sm80)                    |
| $\text{tma\_to\_smem\_async\_qual}$                      | cp.async.bulk GMEM reads, SMEM writes (tmaS)                |
| $\text{tma\_to\_gmem\_async\_qual}$                      | cp.async.bulk SMEM reads, GMEM writes/reduces (tmaG)        |
| $\text{wmma\_async\_rmem\_a\_qual}$                      | wmma.mma_async reads from A register parameter (wgA)        |
| $\text{wmma\_async\_rmem\_d\_qual}$                      | wmma.mma_async accesses to D (accumulator) (wgD)            |
| $\text{wmma\_async\_smem\_qual}$                         | wmma.mma_async access to SMEM, A or B (wgS)                 |
| $\text{wmma\_zero\_qual}$                                | special case used to model the wmma scale-d parameter (wg0) |

---

Fig. 15. List of qualitative timelines

| $\tau_s : \text{SyncTL}$ | trnstv? | cpu   | strm  | cuda1 | cuda2 | Sm80  | tmaS  | tmaG  | wgA   | wgD   | wgS   | wg0   |
|--------------------------|---------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| empty_sync_tl            |         |       |       |       |       |       |       |       |       |       |       |       |
| cpu_in_order             | y       | full  |       |       |       |       |       |       |       |       |       |       |
| cuda_stream_sync         | y       |       | full  |
| cuda_in_order            | y       |       | temp. | full  | full  | temp. |
| cuda_temporal            |         |       | temp. |
| Sm80_cp_async            |         |       |       |       |       | full  |       |       |       |       |       |       |
| Sm80_generic             |         |       |       |       |       | full  | temp. | temp. | temp. | temp. | temp. | temp. |
| tma_to_smem_async        |         |       |       |       |       |       | temp. | temp. | temp. | temp. | temp. | temp. |
| tma_to_gmem_async        |         |       |       |       |       |       | full  | full  | full  | full  | full  | full  |
| wmma_async_smem          |         |       |       |       |       |       |       |       |       |       |       | full  |
| wmma_fence_1             |         |       |       |       |       |       |       |       | full  | full  |       |       |
| wmma_fence_2             |         |       |       |       |       |       |       |       | full  | full  |       |       |
| wmma_async               |         |       |       |       |       |       |       |       | full  | full  | full  |       |
| cuda_async_proxy         |         |       |       |       |       |       |       |       |       | full  |       |       |
| cuda_async_proxy_wmma    |         |       |       |       |       |       |       |       | full  | full  | full  |       |
| cuda_generic_and_async   |         | temp. | full  | full  |       | full  | full  | full  | temp. | temp. | full  | temp. |

Fig. 16. SyncTL are defined as composition of QualTL.

## B ABSTRACT MACHINE CONVERSION AND SEMANTICS DEFINITIONS

A new record Constructor  $\mathcal{R}$  is used to define RecordRead and RecordMutate in Figure 22. RECORDREAD and RECORDMUTATE first evaluate the accessor  $e^\#$  and  $e_z^\#$  to access sets  $W_{e^\#}$  and  $W_{e_z^\#}$  (via  $\Downarrow_{\text{Acc}}$ ). They then update the visibility map  $\rho$  at each key  $(x, n) \in W_{e^\#}$  using the new record constructor  $\mathcal{R}$  defined below: RECORDREAD adds it to the *read* visibility set, setting  $\rho'(x, n).r = \rho(x, n).r \cup \mathcal{R}(\dots)$  while leaving  $\rho(x, n).m$  and  $\rho(x, n).b$  unchanged; RECORDMUTATE adds it to the *mutate* visibility

1177 set, setting  $\rho'(x, \mathbf{n}).m = \rho(x, \mathbf{n}).m \cup \mathcal{R}(\dots)$  while preserving  $\rho(x, \mathbf{n}).r$  and  $\rho(x, \mathbf{n}).b$ . In both cases,  
 1178 neither the control  $i$  nor the store  $\sigma$  changes.

1179 **Definition B.1** (New Record Constructor  $\mathcal{R}$ ). Define  $\mathcal{R}$  as follows:

$$1180 \quad \mathcal{R} : VL \times Bool \times QualTL \times \mathcal{P}(QualTL) \times Expr^\# \times \rho \times \mathbb{G} \longrightarrow \mathcal{P}(\mathbb{R})$$

$$1182 \quad \mathcal{R}(\tau_v, t, \tau_q, \tau_q^*, E^\#, \rho, G) = \begin{cases} \{(\tau_q, S_G, P_{E^\#, \rho})\}, & t = \text{true}, \\ \{(\tau_q, S_{\{g\}}, P_{E^\#, \rho}) \mid g \in G\}, & t = \text{false}, \end{cases}$$

1185 where, for any  $H \subseteq \mathbb{G}$ ,  $S_H \in \mathbb{S}$  and  $P_{E^\#, \rho} \in \mathbb{P}$  are given pointwise by

$$1187 \quad S_H(g', \tau_q') = \max_{VL} \left( \begin{cases} \text{atomic\_only}, & \tau_q' \in \tau_q^*, \\ \text{invisible}, & \text{otherwise} \end{cases}, \begin{cases} \tau_v, & g' \in H \wedge \tau_q' = \tau_q, \\ \text{invisible}, & \text{otherwise} \end{cases} \right),$$

$$1190 \quad P_{E^\#, \rho}(z, \mathbf{n}) = \begin{cases} \{a\}, & (z, \mathbf{n}) \in E^\# \text{ where } (a, \_) = \rho(z, \mathbf{n}).b, \\ \emptyset, & \text{otherwise}. \end{cases}$$

1192 Fence, Arrive, and Await (Figure 22) update  $\rho$  solely via  $\text{lift}(\rho, \lambda)$  (defined below), which applies  
 1193  $\lambda$  pointwise to both the read and mutate visibility sets. For  $\text{Fence}(t, \tau_q^{\text{pre}*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*})$ ,  $\lambda(r) =$   
 1194  $\mathcal{A}(r, g^{\text{exec}*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*})$  iff the witness  $\mathcal{W}(t, g^{\text{exec}*}, \tau_q^{\text{pre}*}, r)$  holds, otherwise  $\lambda(r) = r$ ; thus  
 1195 visibility is raised by joining  $f \vee f_{\text{aug}}$ . For  $\text{Arrive}(t, \tau_q^{\text{pre}*}, e_z^{\#*})$ , the accessed sets yield a unique  
 1196 home barrier  $\{(z, \mathbf{n})\} = \bigcap_i W_{e_z^{\#*}}$ ; guarded by  $\mathcal{W}$ ,  $\lambda$  adds the existing arrive count  $a$  from  $\rho(z, \mathbf{n})$   
 1197 into  $r.p$  at every  $(z, \mathbf{n}) \in \mathbf{W}$ , and home barrier's arrive count is incremented by  $(a' + 1, w')$ .  
 1198 For  $\text{Await}(e_z^{\#*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}, n)$ , with  $(a, w) = \rho(z, \mathbf{n}).b$ , compute Max and New as in the rule; set  
 1199  $\lambda(r) = \mathcal{A}(r, g_s^*(\rho), \tau_q^{\text{full}*}, \tau_q^{\text{temp}*})$  iff some observed arrive count  $i \in r.p(z, \mathbf{n})$  satisfies  $i \leq \text{Max}$ ,  
 1200 else  $\lambda(r) = r$ ; after lift, update the wait count to  $(a', \text{New})$ . In short,  $\mathcal{W}$  decides when augmentation  
 1201 is permitted,  $\mathcal{A}$  raises visibility to fully\_ordered or temporally\_ordered.

1204 **Definition B.2** (Lift). For  $\lambda : \mathbb{R} \rightarrow \mathbb{R}$ , define  $\text{lift}(\rho, \lambda)$  pointwise by

$$1205 \quad \text{lift}(\rho, \lambda)(x, v) = (\{\lambda(r) \mid r \in R\}, \{\lambda(r) \mid r \in M\}, (a, w)) \quad \text{where } (R, M, (a, w)) = \rho(x, v).$$

1207 **Definition B.3** (Synchronization helpers). Let  $r = (\tau_q^{\text{orig}}, \mathcal{f}, p) \in \mathbb{R}$  with  $\mathcal{f} \in \mathbb{S}$ . For  $\mathcal{f}_1, \mathcal{f}_2 \in \mathbb{S}$ , define  
 1208 the pointwise join  $(\mathcal{f}_1 \vee \mathcal{f}_2)(g, \tau_q) = \max_{VL} \{\mathcal{f}_1(g, \tau_q), \mathcal{f}_2(g, \tau_q)\}$ .

1209 (i) *Witness  $\mathcal{W}$* . For  $t \in \text{Bool}$ ,  $g^{\text{exec}*} \subseteq \mathbb{G}$ , and  $\tau_q^{\text{pre}*} \subseteq \text{QualTL}$ , set

$$1211 \quad \mathcal{W}(t, g^{\text{exec}*}, \tau_q^{\text{pre}*}, r) = \begin{cases} \exists g \in g^{\text{exec}*} \exists \tau_q \in \tau_q^{\text{pre}*} : \mathcal{f}(g, \tau_q) \geq \text{unordered}, & t = \text{true}, \\ (\tau_q^{\text{orig}} \in \tau_q^{\text{pre}*}) \wedge \exists g \in g^{\text{exec}*} : \mathcal{f}(g, \tau_q^{\text{orig}}) \geq \text{unordered}, & t = \text{false}. \end{cases}$$

1214 (ii) *Augment  $\mathcal{A}$* . Given  $g^{\text{exec}*} \subseteq \mathbb{G}$  and  $\tau_q^{\text{full}*}, \tau_q^{\text{temp}*} \subseteq \text{QualTL}$ , define

$$1215 \quad \mathcal{A}(r, g^{\text{exec}*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}) = (\tau_q^{\text{orig}}, \mathcal{f} \vee \mathcal{f}_{\text{aug}}, p),$$

$$1217 \quad \mathcal{f}_{\text{aug}}(g, \tau_q) = \begin{cases} \text{fully\_ordered}, & g \in g^{\text{exec}*} \wedge \tau_q \in \tau_q^{\text{full}*}, \\ \text{temporally\_ordered}, & g \in g^{\text{exec}*} \wedge \tau_q \in \tau_q^{\text{temp}*} \setminus \tau_q^{\text{full}*}, \\ \text{invisible}, & \text{otherwise}. \end{cases}$$

1221

1222

1223

1224

1225

|      | <i>Data expressions</i> $e \in \text{DExpr}$               | $\mathcal{P}(\text{Expr}^\#)$ |
|------|------------------------------------------------------------|-------------------------------|
| 1226 | $d$                                                        | $\emptyset$                   |
| 1227 | $x$                                                        | $\{x[\langle\rangle]\}$       |
| 1228 | $x[w^*]$                                                   | $\{x[w^*]\}$                  |
| 1229 | $e_1 \text{ op } e_2 \quad (\text{op} \in \{+, -, *, /\})$ | $T_e(e_1) \cup T_e(e_2)$      |
| 1230 | <hr/>                                                      |                               |
| 1231 | <i>Barrier expressions</i> $e_z \in \text{ZExpr}$          |                               |
| 1232 | $z$                                                        | $\{z[\langle\rangle]\}$       |
| 1233 | $z[w^*]$                                                   | $\{z[w^*]\}$                  |
| 1234 | <hr/>                                                      |                               |
| 1235 |                                                            |                               |

Fig. 17. Expression conversion  $T_e : \text{DExpr} \cup \text{ZExpr} \rightarrow \mathcal{P}(\text{Expr}^\#)$  collects the distinct reads in an expression and returns them as a set of AMIR reads ( $\text{Expr}^\#$ ).

|      | Type                                                                      | Definition                                                                                                                                                                                                                                                                 |
|------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1240 | $T_{\text{smem}} : \text{Expr} \rightarrow \mathcal{P}(\text{Expr}^\#)$   | $T_{\text{smem}}(e) \triangleq \{e^\# \in T_e(e) \mid \text{isSmem}(e^\#)\}.$                                                                                                                                                                                              |
| 1241 | $T_{\text{ex}} : \text{Expr} \rightarrow \mathcal{P}(\text{Expr}^\#)$     | $T_{\text{ex}}(e) \triangleq \{e^\# \in T_e(e) \mid \neg \text{isEx}(e^\#)\}.$                                                                                                                                                                                             |
| 1242 | $\text{isEx} : \text{Expr}^\# \rightarrow \text{Bool}$                    | $\text{isEx}(e^\#) \triangleq \text{true iff } e^\# \text{ is annotated as sync-exempt memory; false otherwise.}$                                                                                                                                                          |
| 1243 | $\text{isSmem} : \text{Expr}^\# \rightarrow \text{Bool}$                  | $\text{isSmem}(e^\#) \triangleq \text{true iff } e^\# \text{ is annotated as shared memory; false otherwise.}$                                                                                                                                                             |
| 1244 | $T_{\text{tr}} : \text{SyncTL} \rightarrow \text{Bool}$                   | $T_{\text{tr}}(\tau_s) \triangleq \text{true iff } \tau_s \text{ is transitive; false otherwise.}$                                                                                                                                                                         |
| 1245 | $T_{\text{qfull}} : \text{SyncTL} \rightarrow \mathcal{P}(\text{QualTL})$ | $T_{\text{qfull}}(\tau_s) \triangleq \{\tau'_q \in \text{QualTL} \mid \tau'_q \text{ annotated as "full" in } \tau_s\}.$                                                                                                                                                   |
| 1246 | $T_{\text{qtmp}} : \text{SyncTL} \rightarrow \mathcal{P}(\text{QualTL})$  | $T_{\text{qtmp}}(\tau_s) \triangleq \{\tau'_q \in \text{QualTL} \mid \tau'_q \text{ annotated as "temporal" or "full" in } \tau_s\}.$                                                                                                                                      |
| 1247 | $T_{\text{qarr}} : \text{SyncTL} \rightarrow \text{QualTL}$               | $T_{\text{qarr}}(\tau_s) \triangleq \begin{cases} \text{Sm80_cp_async_qual}, & \text{if Sm80_cp_async_qual} \in T_{\text{qfull}}(\tau_s), \\ \text{cuda_in_order_qual}, & \text{otherwise.} \end{cases}$                                                                   |
| 1248 | $T_{\text{qin}} : \text{Expr}^\# \rightarrow \text{QualTL}$               | $T_{\text{qin}}(e^\#) \triangleq \begin{cases} \text{cpu_in_order_qual}, & \text{if } e^\# \text{ is host/CPU scoped,} \\ \text{cuda_in_order_rmem_qual}, & \text{if } e^\# \text{ is register memory,} \\ \text{cuda_in_order_ram_qual}, & \text{otherwise.} \end{cases}$ |
| 1249 | <hr/>                                                                     |                                                                                                                                                                                                                                                                            |
| 1250 |                                                                           |                                                                                                                                                                                                                                                                            |
| 1251 |                                                                           |                                                                                                                                                                                                                                                                            |
| 1252 |                                                                           |                                                                                                                                                                                                                                                                            |
| 1253 |                                                                           |                                                                                                                                                                                                                                                                            |
| 1254 |                                                                           |                                                                                                                                                                                                                                                                            |

Fig. 18. Helper functions used by Fig. 5 statement conversion. All sets are finite.

1255  
1256  
1257  
1258  
1259  
1260  
1261  
1262  
1263  
1264  
1265  
1266  
1267  
1268  
1269  
1270  
1271  
1272  
1273  
1274

| 1275 | Case for $g.p_i$                | Abstract Machine IR                                                                                                                                  |
|------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1276 | Non-atomic <i>read-only</i>     | $\vdots_{e^\# \in T_{\text{ex}}(e_i)} \left( \text{CheckMutates}(\text{full\_ordered}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{ext}}(g.p_i)); \right.$ |
| 1277 |                                 | $\left. \text{RecordRead}(\tau_v^{\text{post}}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{init}}(g.p_i), \emptyset, T_e(e_z)) \right)$                   |
| 1278 | Non-atomic <i>non-read-only</i> | $\vdots_{e^\# \in T_{\text{ex}}(e_i)} \left( \text{CheckMutates}(\tau_v^{\text{pre}}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{ext}}(g.p_i)); \right.$  |
| 1279 |                                 | $\text{CheckReads}(\text{temporal\_ordered}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{ext}}(g.p_i));$                                                   |
| 1280 |                                 | $\text{ClearReads}(e^\#);$                                                                                                                           |
| 1281 |                                 | $\text{ClearMutates}(e^\#);$                                                                                                                         |
| 1282 |                                 | $\left. \text{RecordMutate}(\tau_v^{\text{post}}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{init}}(g.p_i), \emptyset, T_e(e_z)) \right)$                 |
| 1283 |                                 | where $\tau_v^{\text{pre}}$ = temporal_ordered if $g.p_i$ is write-only, else full_ordered.                                                          |
| 1284 | Atomic                          | $\vdots_{e^\# \in T_{\text{ex}}(e_i)} \left( \text{CheckMutates}(\text{atomic\_only}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{ext}}(g.p_i)); \right.$  |
| 1285 |                                 | $\text{CheckReads}(\text{temporal\_ordered}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{ext}}(g.p_i));$                                                   |
| 1286 |                                 | $\text{ClearReads}(e^\#); (\text{note lack of ClearMutates})$                                                                                        |
| 1287 |                                 | $\left. \text{RecordMutate}(\tau_v^{\text{post}}, T_{\text{cvg}}(g.p_i), e^\#, T_{\text{init}}(g.p_i), T_{\text{atom}}(g.p_i), T_e(e_z)) \right)$    |
| 1288 |                                 |                                                                                                                                                      |
| 1289 |                                 |                                                                                                                                                      |
| 1290 |                                 |                                                                                                                                                      |
| 1291 |                                 |                                                                                                                                                      |

Fig. 19. Argument conversion  $T_{\text{arg}}(g.p_i, e_i, e_z)$ .  $\tau_v^{\text{post}}$  = unordered if  $T_{\text{ooo}}(g.p_i)$ , else full\_ordered.

| 1292 | Type                                                                   | Definition                                                                                                |
|------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| 1293 | $T_{\text{ext}} : \text{Expr} \rightarrow \mathcal{P}(\text{QualTL})$  | $T_{\text{ext}}(e) \triangleq \{ \tau_q \in \text{QualTL} \mid \tau_q \in e.\text{ext\_qual\_tl} \}.$     |
| 1294 | $T_{\text{atom}} : \text{Expr} \rightarrow \mathcal{P}(\text{QualTL})$ | $T_{\text{atom}}(e) \triangleq \{ \tau_q \in \text{QualTL} \mid \tau_q \in e.\text{atomic\_qual\_tl} \}.$ |
| 1295 | $T_{\text{init}} : \text{Expr} \rightarrow \text{QualTL}$              | $T_{\text{init}}(e) \triangleq e.\text{initial\_qual\_tl}.$                                               |
| 1296 | $T_{\text{ooo}} : \text{Expr} \rightarrow \text{Bool}$                 | $T_{\text{ooo}}(e) \triangleq e.\text{out\_of\_order}$                                                    |
| 1297 | $T_{\text{cvg}} : \text{Expr} \rightarrow \text{Bool}$                 | $T_{\text{cvg}}(e) \triangleq e.\text{convergent}$                                                        |
| 1298 |                                                                        |                                                                                                           |
| 1299 |                                                                        |                                                                                                           |
| 1300 |                                                                        |                                                                                                           |
| 1301 |                                                                        |                                                                                                           |

Fig. 20. Helper functions used by Fig. 19 argument conversion.

$$\begin{array}{c}
 \langle c_{\text{lo}}, \sigma \rangle \Downarrow_{\mathbb{Z}} m \quad \langle c_{\text{hi}}, \sigma \rangle \Downarrow_{\mathbb{Z}} n \quad m < n \quad \langle s, i, \sigma[y \mapsto m], \rho \rangle \Downarrow i', \sigma', \rho' \\
 \langle \text{for } y \text{ in } (m+1) \dots n \text{ do } s^{\#}, i', \sigma', \rho' \rangle \Downarrow i'', \sigma'', \rho'' \\
 \hline
 \langle \text{for } y \text{ in loop}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } s^{\#}, i, \sigma, \rho \rangle \Downarrow i'', \sigma'', \rho'' \quad \text{LoopSTEP}
 \end{array}$$
  

$$\frac{\langle c_{\text{lo}}, \sigma \rangle \Downarrow_{\mathbb{Z}} m \quad \langle c_{\text{hi}}, \sigma \rangle \Downarrow_{\mathbb{Z}} n \quad m \geq n}{\langle \text{for } y \text{ in loop}(c_{\text{lo}}, c_{\text{hi}}) \text{ do } s^{\#}, i, \sigma, \rho \rangle \Downarrow i, \sigma, \rho} \quad \text{LoopDONE}$$
  

$$\frac{\langle s_1^{\#}, i, \sigma, \rho \rangle \Downarrow i', \sigma', \rho' \quad \langle s_2^{\#}, i', \sigma', \rho' \rangle \Downarrow i'', \sigma'', \rho''}{\langle s_1^{\#}; s_2^{\#}, i, \sigma, \rho \rangle \Downarrow i'', \sigma'', \rho''} \quad \text{SEQ} \quad \frac{i' = i + 1}{\langle \text{InctaskID}, i, \sigma, \rho \rangle \Downarrow i', \sigma, \rho} \quad \text{INCTASKID}$$

$$\frac{\langle b, \sigma \rangle \Downarrow_{\text{Bool}} \text{true} \quad \langle s^{\#}, i, \sigma, \rho \rangle \Downarrow i', \sigma', \rho'}{\langle \text{if } b \text{ then } s^{\#}, i, \sigma, \rho \rangle \Downarrow i', \sigma', \rho'} \quad \text{IF-TRUE} \quad \frac{\langle b, \sigma \rangle \Downarrow_{\text{Bool}} \text{false}}{\langle \text{if } b \text{ then } s^{\#}, i, \sigma, \rho \rangle \Downarrow i, \sigma, \rho} \quad \text{IF-FALSE}$$

Fig. 21. Big-step semantics for a loop, sequencing, an explicit task-ID increment, and a guard.

|      |                                                                                                                                                                                                                                                             |
|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1324 | $c^* = (c_1, \dots, c_k) \quad \langle c_i, \sigma \rangle \Downarrow_{\mathbb{Z}} n_i \text{ for } i = 1..k$                                                                                                                                               |
| 1325 | $I = \{ (v_1, \dots, v_k) \mid 0 \leq v_i < n_i \} \quad \rho' = \rho[(x, \mathbf{n}) \mapsto (\emptyset, \emptyset, (0, 0))]_{\mathbf{n} \in I}$ ALLOC                                                                                                     |
| 1326 | $\langle \text{Alloc}(x[c^*]), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho'$                                                                                                                                                                 |
| 1327 |                                                                                                                                                                                                                                                             |
| 1328 | $\langle e^\#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^\#} \quad \rho' = \rho[(x, \mathbf{n}) \mapsto (\emptyset, \rho(x, \mathbf{n}).m, \rho(x, \mathbf{n}).b)]_{(x, \mathbf{n}) \in W_{e^\#}}$ CLEARREADS                                             |
| 1329 | $\langle \text{ClearReads}(e^\#), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho'$                                                                                                                                                              |
| 1330 |                                                                                                                                                                                                                                                             |
| 1331 | $\langle e^\#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^\#} \quad \rho' = \rho[(x, \mathbf{n}) \mapsto (\rho(x, \mathbf{n}).r, \emptyset, \rho(x, \mathbf{n}).b)]_{(x, \mathbf{n}) \in W_{e^\#}}$ CLEARMUTATES                                           |
| 1332 | $\langle \text{ClearMutates}(e^\#), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho'$                                                                                                                                                            |
| 1333 |                                                                                                                                                                                                                                                             |
| 1334 |                                                                                                                                                                                                                                                             |
| 1335 | $\langle e^\#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^\#} \quad \langle e_z^\#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e_z^\#}$                                                                                                                     |
| 1336 | $\rho' = \rho[(x, \mathbf{n}) \mapsto (\rho(x, \mathbf{n}).r \cup \mathcal{R}(\tau_v, t, \tau_q, \tau_q^*, W_{e_z^\#}, \rho, g_{s^\#}(\sigma, \iota)), \rho(x, \mathbf{n}).m, \rho(x, \mathbf{n}).b)]_{(x, \mathbf{n}) \in W_{e^\#}}$ RECORDREAD            |
| 1337 | $\langle \text{RecordRead}(\tau_v, t, e^\#, \tau_q, \tau_q^*, e_z^\#), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho'$                                                                                                                         |
| 1338 |                                                                                                                                                                                                                                                             |
| 1339 | $\langle e^\#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e^\#} \quad \langle e_z^\#, \sigma \rangle \Downarrow_{\text{Acc}} W_{e_z^\#}$                                                                                                                     |
| 1340 | $\rho' = \rho[(x, \mathbf{n}) \mapsto (\rho(x, \mathbf{n}).r, \rho(x, \mathbf{n}).m \cup \mathcal{R}(\tau_v, t, \tau_q, \tau_q^*, W_{e_z^\#}, \rho, g_{s^\#}(\sigma, \iota)), \rho(x, \mathbf{n}).b)]_{(x, \mathbf{n}) \in W_{e^\#}}$ RECORDMUTATE          |
| 1341 | $\langle \text{RecordMutate}(\tau_v, t, e^\#, \tau_q, \tau_q^*, e_z^\#), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho'$                                                                                                                       |
| 1342 |                                                                                                                                                                                                                                                             |
| 1343 | $\lambda = (r \mapsto \mathcal{A}(r, g_{s^\#}(\sigma, \iota), \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}) \text{ if } \mathcal{W}(t, g_{s^\#}(\sigma, \iota), \tau_q^{\text{pre}*}, r) \text{ else } r)$ FENCE                                            |
| 1344 | $\langle \text{Fence}(t, \tau_q^{\text{pre}*}, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \text{lift}(\rho, \lambda)$                                                                             |
| 1345 |                                                                                                                                                                                                                                                             |
| 1346 |                                                                                                                                                                                                                                                             |
| 1347 | $\langle \langle e_{z1}^\#, \dots, e_{zk}^\# \rangle, \sigma \rangle \Downarrow_{\text{Acc}} W_{e_{z1}^\#}, \dots, W_{e_{zk}^\#} \quad \{(z, \mathbf{n})\} = \bigcap_{i=1}^k W_{e_{zi}^\#} \quad \mathbf{W} = \bigcup_{i=1}^k W_{e_{zi}^\#}$                |
| 1348 |                                                                                                                                                                                                                                                             |
| 1349 | $\lambda = (r \mapsto (r.q, r.f, r.p[(z, \mathbf{n}) \mapsto r.p(z, \mathbf{n}) \cup \{a\}]_{z, \mathbf{n} \in \mathbf{W}}^{(a_\sim) = \rho(z, \mathbf{n})}) \text{ if } \mathcal{W}(t, g_{s^\#}(\sigma, \iota), \tau_q^{\text{pre}*}, r) \text{ else } r)$ |
| 1350 | $\rho' = \text{lift}(\rho, \lambda) \quad (R', M', (a', w')) = \rho'(z, \mathbf{n}) \quad \rho'' = \rho'[(z, \mathbf{n}) \mapsto (R', M', (a' + 1, w'))]$ ARRIVE                                                                                            |
| 1351 | $\langle \text{Arrive}(t, \tau_q^{\text{pre}*}, e_z^{\#*}), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho''$                                                                                                                                   |
| 1352 |                                                                                                                                                                                                                                                             |
| 1353 | $\langle e_z^\#, \sigma \rangle \Downarrow_{\text{Acc}} \{(z, \mathbf{n})\} \quad (a, w) = \rho(z, \mathbf{n}).b \quad \text{Max} \triangleq \text{if } n \geq 0 \text{ then } a - (n + 1) \text{ else } w + n + 1$                                         |
| 1354 | $\text{New} \triangleq \text{if } n \geq 0 \text{ then } \max(w, \text{Max} + 1) \text{ else } w + 1$                                                                                                                                                       |
| 1355 |                                                                                                                                                                                                                                                             |
| 1356 | $\lambda = (r \mapsto \mathcal{A}(r, g_{s^\#}(\rho), \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}) \text{ if } \exists i. i \leq \text{Max} \wedge i \in r.p(z, \mathbf{n}) \text{ else } r)$                                                               |
| 1357 | $\rho' = \text{lift}(\rho, \lambda) \quad (R', M', (a', \_) = \rho'(z, \mathbf{n}) \quad \rho'' = \rho'[(z, \mathbf{n}) \mapsto (R', M', (a', \text{New}))]$ AWAIT                                                                                          |
| 1358 | $\langle \text{Await}(e_z^\#, \tau_q^{\text{full}*}, \tau_q^{\text{temp}*}, n), \iota, \sigma, \rho \rangle \Downarrow \iota, \sigma, \rho''$                                                                                                               |
| 1359 |                                                                                                                                                                                                                                                             |
| 1360 | Fig. 22. The big-step operational semantics for Abstract Machine statements with sync environment updates.                                                                                                                                                  |
| 1361 |                                                                                                                                                                                                                                                             |
| 1362 |                                                                                                                                                                                                                                                             |
| 1363 |                                                                                                                                                                                                                                                             |
| 1364 |                                                                                                                                                                                                                                                             |
| 1365 |                                                                                                                                                                                                                                                             |
| 1366 |                                                                                                                                                                                                                                                             |
| 1367 |                                                                                                                                                                                                                                                             |
| 1368 |                                                                                                                                                                                                                                                             |
| 1369 |                                                                                                                                                                                                                                                             |
| 1370 |                                                                                                                                                                                                                                                             |
| 1371 |                                                                                                                                                                                                                                                             |
| 1372 |                                                                                                                                                                                                                                                             |