

# nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures

Hui Guo\*, Qihang Zheng<sup>†</sup>, Chenghai Huo, Dongliang Guo, Haoqi Yang, Yang Zhang

Canaan Inc.

## Abstract

The efficient deployment of large language models (LLMs) is hindered by memory architecture heterogeneity, where traditional compilers suffer from fragmented workflows and high adaptation costs. We present nncase, an open-source, end-to-end compilation framework designed to unify optimization across diverse targets. Central to nncase is an e-graph-based term rewriting engine that mitigates the phase ordering problem, enabling global exploration of computation and data movement strategies. The framework integrates three key modules: Auto Vectorize for adapting to heterogeneous computing units, Auto Distribution for searching parallel strategies with cost-aware communication optimization, and Auto Schedule for maximizing on-chip cache locality. Furthermore, a buffer-aware Codegen phase ensures efficient kernel instantiation. Evaluations show that nncase outperforms mainstream frameworks like MLC LLM and Intel IPEX on Qwen3 series models and achieves performance comparable to the hand-optimized llama.cpp on CPUs, demonstrating the viability of automated compilation for high-performance LLM deployment. The source code is available at <https://github.com/kendryte/nncase>.

## 1 Introduction

As large language models (LLMs) such as GPT-4 [35] and Llama [44] scale beyond the trillion-parameter mark, they impose exponential demands on computation and storage. This surge presents a fundamental challenge: navigating the heterogeneity of memory architectures. Contemporary deployment landscapes are bifurcated into distributed systems—spanning multi-server clusters and specialized accelerators (e.g., NVIDIA H100 arrays, Cerebras WSE [7, 28])—and single-node setups on standalone devices. Despite their topological differences, these architectures share critical constraints: (1) Deep Memory Hierarchies: Ranging from KB-level SRAM/L1 caches to GB-level global HBM-/DRAM. (2) Heterogeneous computation Units: The coexistence of scalar, vector (e.g. AVX-512 [40]) and matrix (e.g.

Intel AMX [47]) units. (3) The Memory Wall: With the memory bandwidth growing only 20% annually versus a 100% surge in compute power, data movement has become the primary bottleneck for LLM inference.

Efficient deployment requires orchestrating the synergy between memory and computation, specifically targeting hierarchical reuse and heterogeneous unit adaptation. However, conventional compilers often necessitate distinct optimization pipelines for uniform and non-uniform memory architectures. This dichotomy introduces prohibitive adaptation costs and creates technical barriers to global optimization.

To bridge this gap, we present nncase, an open-source end-to-end compilation framework designed to unify LLM deployment across heterogeneous memory architectures. The cornerstone of nncase is its "unified distributed compilation paradigm." By modeling all targets via a Non-Uniform Memory Access (NUMA) abstraction, nncase decouples the compilation workflow from physical topology, achieving a "compile once, adapt everywhere" capability. The framework is underpinned by an e-graph-based term rewriting engine that employs equality saturation. Unlike the greedy strategies of traditional deep learning compilers, our engine circumvents the phase ordering problem, exploring a comprehensive design space without compromising semantic integrity. Furthermore, by integrating a Roofline-based cost model, nncase performs multi-objective optimization to balance memory footprint, latency, and communication overhead.

Using the capacity of the e-graph to represent equivalence, nncase enables three distinct optimization passes:

**1) Auto Vectorize:** Addressing the challenge of mapping tensors to various hardware units, this module introduces MetaPackOperation and FoldNopPack rules. It simultaneously generates multiple packed layout candidates within the e-graph, dynamically adjusting packing factors to eliminate redundant layout permutations. This allows the compiler to strike an optimal balance between data layout conversion and computing unit saturation.

**2) Auto Distribution:** To manage data parallelism, we adopt the SBP (Split, Broadcast, Partial-value) [58] abstraction. Operating on the principle that "nodes with consistent SBP attributes are equivalent," this module embeds the dis-

\*Email: sunnycase@live.cn

<sup>†</sup>Email: zhengqihang0915@qq.com

tributed strategy search space directly into the e-graph. It automatically balances communication costs against computational efficiency, adapting seamlessly to both multi-device clusters and single-node logical distributions.

**3) Auto Schedule:** Addressing the complexity of on-chip memory management, we introduce the nncase Tensor Template (NTT) Library. This library abstracts hardware primitives into atomic microkernels ( $\mu$ kernels). We then decompose scheduling into structural optimization (e.g., Loop Fusion) through Monte Carlo Tree Search (MCTS) [33] and parametric optimization (e.g., Tiling) through Mixed-Integer Nonlinear Programming (MINLP) [5]. This hierarchical approach achieves register-level efficiency while avoiding the prohibitive search times of exhaustive methods.

We evaluated nncase using the Qwen3 [56] model family (0.6B and 1.7B) on an AMD Ryzen 9 5900X platform in single-core and multi-core configurations. Experimental results demonstrate that nncase significantly outperforms mainstream frameworks such as MLC LLM [34] and Intel IPEX [20], while matching or surpassing the performance of the hand-optimized library llama.cpp [13].

In summary, the contributions of this paper include the following:

- We design and open-source **nncase**, a unified compilation framework that bridges the gap between uniform and non-uniform memory architectures.
- We propose an e-graph-based Auto Vectorize mechanism that resolves tensorization challenges across diverse compute units.
- We develop an Auto Distribution method that embeds distributed strategy search into the e-graph, enabling topology-agnostic parallelism.
- We introduce a hierarchical Auto Schedule system backed by the NTT Library, automating the optimization of on-chip memory reuse and instruction throughput.

## 2 Background and Motivation

### 2.1 Data Layout Optimization

Data layout optimization remains a fundamental bottleneck for deep learning compilers targeting heterogeneous architectures. Modern processors have evolved from purely scalar designs to sophisticated hybrids integrating both Vector Processing Units (e.g., AVX-512, NEON [15, 40]) and dedicated Tensor Processing Units (e.g., Intel AMX [47], Arm SME [51], Apple AMX [52]). To harness this heterogeneous compute power, industry optimization strategies generally fall into two paradigms: Kernel-Level Optimization and Graph-Level Layout Propagation.

**Kernel-Level Optimization.** This paradigm treats layout optimization as a local problem isolated to individual operators. Representative frameworks, including vendor libraries (e.g., Intel MKL [21], Arm Compute Library [4]) and traditional compilers (e.g., Halide [36], TVM [9]), focus on aligning data with hardware intrinsics. For element-wise operations, loop vectorization is employed to match SIMD lane width; for compute-intensive operators like GEMM or Convolution, Just-In-Time (JIT) packing mechanisms (pioneered by GotoBLAS [16]) are used to mitigate cache associativity conflicts. *Critique:* While locally optimal, this approach suffers from layout thrashing. By failing to exploit layout reusability across the computation graph, it incurs redundant packing and unpacking overheads at operator boundaries, significantly degrading end-to-end inference latency.

**Graph-Level Layout Propagation.** To address the redundancy of local packing, recent deep learning compilers employ global analysis to coordinate data layouts across the graph. Frameworks like TPP-MLIR [14] and ALT [55] utilize term rewriting systems to propagate preferred layouts between upstream and downstream nodes, fusing layout transformations via constant folding or dead code elimination. *Critique:* Although these methods reduce transformation overheads, they lack the granularity to navigate the Vector-Tensor trade-off in modern hybrid architectures. When Vector Units (requiring 1D layouts) and Tensor Units (requiring blocked 2D layouts) coexist, the optimal strategy is not static. Existing approaches often fail to model the complex cost trade-off between the acceleration gains of 2D tensorization and the overhead of 1D-to-2D layout permutation, leading to sub-optimal execution plans in mixed-precision or mixed-unit scenarios.

### 2.2 Distributed Strategy Search

The distributed strategy search is the cornerstone of unlocking the full potential of multi-core/multi-chip clusters. Its primary objective is to partition large-scale LLM models and map them onto hardware nodes, seeking a global optimal balance between computational efficiency, memory utilization, and communication overhead. However, existing distributed optimization frameworks face fundamental challenges in achieving this unified goal.

**Reliance on Manual Heuristics.** Frameworks like GSPMD and OneFlow [58] rely heavily on manual intervention. GSPMD necessitates user-annotated sharding for critical tensors and propagates these strategies via broadcast mechanisms. Although OneFlow supports automatic derivation, it still requires manual seeding for initial layers. These semi-automatic approaches incur high engineering costs and lack the flexibility to adapt to diverse hardware topologies without human guidance.

**Decoupled Search in Automatic Frameworks.** State-of-the-art automatic search frameworks, such as Alpa [60] and AutoDDL [8], typically employ a hierarchical optimiza-

tion strategy. They decompose the problem into separate stages—optimizing inter-operator parallelism (pipeline) and intra-operator parallelism (tensor sharding) independently. Although this reduces search complexity, it reverses the strong dependencies between computation, memory, and network. For example, a strategy optimized solely for communication minimization might violate local memory capacity constraints. Crucially, these methods lack a unified representation to jointly optimize compute, memory, and network costs. Consequently, they often settle for local sub-optima, failing to find strategies that are globally efficient across all three dimensions in resource-constrained heterogeneous environments.

### 2.3 Computation Kernel Scheduling

The performance of kernel scheduling is the decisive factor in determining the hardware utilization and throughput of compute cores. However, the inherent heterogeneity in instruction sets, on-chip memory hierarchies, and compute unit architectures across different platforms renders the "one-size-fits-all" scheduling strategy impractical. On the one hand, vendor-provided high-performance libraries (e.g., cuDNN, MKL) suffer from inflexibility, failing to cover the diverse and rapidly evolving fusion patterns of LLM operators. On the other hand, manual optimization requires profound architectural expertise and incurs prohibitive engineering costs, making it unscalable for cross-platform deployment.

To tackle these challenges, two categories of auto-scheduling approaches have been proposed. The first is *learning-based scheduling*, represented by Ansor [59] and others [2, 17, 32, 37, 64]. These methods employ random search and machine learning cost models to explore a vast space of loop optimizations (e.g., reordering, tiling, and fusion). Despite their flexibility, they suffer from excessive compilation times, which hinders the rapid deployment and iteration of LLMs. The second category is *analytical modeling-based scheduling* [27, 30, 50, 63], which determines parameters like tile sizes using static mathematical models. Although offering fast compilation, these methods are often limited to integer parameter tuning, lack support for complex operator fusion, and struggle with accurate performance estimation.

Crucially, most existing auto-schedulers treat scalar computation as the atomic scheduling granularity. This abstraction gap makes it difficult to perform fine-grained register-level optimizations—such as precise control over loop unrolling factors to avoid register spills and maximize reuse—thereby failing to match the peak performance achievable by hand-tuned assembly kernels.

### 2.4 Observations and Solutions

Motivated by the limitations of existing approaches and the deployment challenges identified above, we synthesize three

key observations and propose the corresponding solutions implemented in *nncase* (Figure 1):

**Observation 1: Layout Optimization Bottlenecks.** The primary obstacle in data layout optimization is the *phase ordering problem* and the lack of global trade-offs. Traditional term rewriting engines perform destructive updates, irreversibly converting operators to a specific layout. This greedy approach prevents the exploration of alternative layouts (e.g., attempting a 2D pack after a 1D conversion) and lacks a quantitative comparison of conversion overheads versus computation gains.

**Solution:** *nncase* leverages **equality saturation** based on e-graphs to construct a non-destructive rewriting engine. We design a *MetaPackOperation* rule that generates all candidate *pack-compute-unpack* sequences in a single pass, preserving them within the e-graph. By evaluating the theoretical latency of each candidate using a Roofline-based cost model [53] and formulating the extraction as a constraint satisfaction problem (solved through a SAT solver [1]), we achieve global layout optimization that adapts to heterogeneous compute units.

**Observation 2: Suboptimal Distributed Strategies and Memory Risks.** Current distributed strategy searches suffer from two main defects: first, a decoupled cost modeling that optimizes communication while neglecting computation costs; second, the lack of rigorous memory capacity constraints. Existing schemes often prioritize throughput, leading to strategies that may trigger Out-Of-Memory (OOM) errors on resource-constrained devices, requiring tedious manual tuning to find a feasible configuration.

**Solution:** We propose a formal description of distributed strategies based on the SBP abstraction [58]. By defining "nodes with consistent SBP attributes" as equivalent, we map the distributed search space onto the e-graph structure. This allows us to reuse the e-graph's extraction mechanism to **jointly optimize computation and communication costs**. Crucially, we formulate the strategy extraction as a **constrained optimization problem solved by a SAT solver**, where device memory capacity is enforced as a hard constraint, ensuring that the automatically discovered strategy is not only performance-optimal but also strictly adheres to hardware memory limits without user intervention.

**Observation 3: Inefficient Kernel Scheduling.** Existing auto-schedulers struggle with a granularity mismatch: they schedule at the scalar level, missing register-level optimizations, while pure learning-based or analytical methods fail to balance code quality with compilation time.

**Solution:** *nncase* introduces the **Tensor Template Library (NTT)**, which encapsulates manually optimized register-efficient microkernels ( $\mu$ kernels) as atomic scheduling units. This hides low-level hardware complexity. On top of this, we decouple scheduling into two dimensions: *structural optimization* (loop fusion/order) solved via Monte Carlo Tree Search (MCTS) [33], and *parametric optimization* (tile sizes/buffer location) solved via analytical modeling. This hybrid ap-



Figure 1: Overview of nncase.

proach ensures register-level performance while significantly reducing search time compared to traditional methods.

### 3 Nncase Compiler Architecture

Figure 1 presents the overall architecture of nncase. The compilation pipeline is designed to bridge high-level model definitions with low-level hardware primitives through five distinct phases:

- E-Graph-Based Optimization (§3.1):** The workflow begins with ingestion of the computation graph in an e-graph structure (❶). This structure serves as the foundation for two critical tasks: determining the optimal data layout by term rewriting (❷, §3.1.2) and automatically searching for optimal distributed strategies by exploring splitting candidates and communication costs (❸, §3.1.3).
- Auto-Scheduling (§3.2):** For single-core subgraphs (❹), we convert the target subgraph into a tiered tile graph. We then employ a hybrid approach: Monte Carlo Tree Search (MCTS) optimizes the loop structure (§3.2.1), while a Mixed-Integer Nonlinear Programming (MINLP) solver optimizes tiling parameters based on a multi-memory hierarchy analytical model (§??).
- Code Generation (§3.3):** In the final phase (❺), the compiler performs platform-specific memory scheduling and instantiates the runtime environment. The executable is generated by linking register-level high-performance micro-kernels ( $\mu$ kernels) provided by the NTT Library. This phase supports separate compilation flows to ensure maximum adaptation to specific hardware constraints.

| Rewrite Rule Name       | Signature                                                                                |
|-------------------------|------------------------------------------------------------------------------------------|
| CombineBinaryLeftTrans  | $Binary(T_{perm}(A), B) \rightarrow T_{perm}(Binary(A, T_{perm^{-1}}(B)))$               |
| CombineBinaryRightTrans | $Binary(A, T_{perm}(B)) \rightarrow T_{perm}(Binary(T_{perm^{-1}}(A), B))$               |
| CombineUnaryTrans       | $Unary(T_{perm}(A)) \rightarrow T_{perm}(Unary(A))$                                      |
| FoldTwoTrans            | $T_{perm_2}(T_{perm_1}(A)) \rightarrow T_{perm_1}[perm_2[i]](A), 0 \leq i < Len(perm_2)$ |
| FoldNopTrans            | $T_{[0,1,\dots,n]}(A) \rightarrow A$                                                     |

Table 1: Rewrite rules about transpose optimization.

### 3.1 E-Graph Based Optimization

#### 3.1.1 Equality Saturation

Term rewriting is a fundamental paradigm in program optimization, relying on the repeated application of simplification rules [12]. However, traditional sequential rewriting strategies suffer from the *phase ordering problem* [6, 26]: the order in which rewrite rules are applied can destructively modify the graph, locking the optimization into a local optimum.

Figure 2 illustrates a typical example of this problem in deep learning compilers. Consider the computation graph in Figure 2(a), where the objective is to eliminate redundant transpose operations using the rules defined in Table 1. The optimal strategy requires global reasoning about where to push the transpose operators.

- Suboptimal Path (Figure 2(c)):** If the compiler greedily applies *CombineBinaryRightTrans* first, it isolates one transpose operation, preventing further fusion, and leaving a redundant operator in the final graph.
- Optimal Path (Figure 2(b)):** Conversely, applying *CombineBinaryLeftTrans* first, followed by *CombineUnaryTrans*, enables the subsequent application of *FoldTwoTrans* and *FoldNopTrans*. This sequence successfully eliminates all redundant transposes.



Figure 2: A comparison of traditional term rewriting and e-graph based rewriting.

This ambiguity highlights the limitation of greedy rewriting: it is often impossible to pre-determine the optimal rule application order without exploring the entire state space.

To resolve this, we adopt Equality Saturation [41, 42], a non-destructive approach that explores the design space by embedding equality information into an e-graph structure. Instead of overwriting the IR, equality saturation applies all applicable rules simultaneously, growing the e-graph until it reaches a saturated state containing all equivalent versions of the program.

As depicted in Figure 2(d), the e-graph consists of *e-classes* (equivalence classes, shown as dashed boxes) and *e-nodes* (computation nodes). An e-class represents a set of equivalent e-nodes, and each e-node references child e-classes rather than specific nodes. This compact structure allows efficient storage of exponential versions of the program.

Finally, extracting the optimal program from the saturated e-graph (Figure 2(e)) is non-trivial. We formulate this extraction as a Weighted Partial MaxSAT (WPMAXSAT) problem [19]. We utilize a Roofline-based cost model [53]—incorporating metrics such as memory traffic and compute cycles—to assign weights to e-nodes. The solver then selects the subgraph with the minimal total cost, resulting in the fully optimized computation graph shown in Figure 2(f).

### 3.1.2 Auto Vectorize

**Packing** (or Layout Transformation) is a fundamental prerequisite for utilizing heterogeneous computing units. It re-

organizes tensors into hardware-intrinsic layouts—such as blocked formats for Tensor Cores or aligned arrays for Vector Units—to maximize spatial locality and instruction throughput. However, in heterogeneous graphs, distinct units often impose conflicting layout constraints. A naive compilation strategy typically results in redundant data movement, inserting costly *Pack/Unpack* operations between operators with mismatched layout requirements.

To address this, we introduce an e-graph-based vectorization pass driven by two rewrite rules (see Table 2):

1. **MetaPackOperation (Exploration):** Generates candidate packed subgraphs. For a given logical operator, it produces multiple hardware-specific variants (e.g., Blocked layouts for Tensor Cores vs. Flat layouts for Vector Units) wrapped in *Pack/Unpack* nodes.
2. **FoldNopPack (Optimization):** Eliminates redundant layout transformations by fusing inverse operations ( $\text{Pack}(\text{Unpack}(x)) \rightarrow x$ ).

We demonstrate this mechanism using an Attention-like subgraph:  $O = \text{MatMul}(\text{Exp}(\text{MatMul}(Q, K)), V)$ , as shown in Figure 3. This structure represents a typical “Compute-Memory-Compute” pattern where layout conflicts frequently occur:

1. **Layout Conflict:** The first  $\text{MatMul}(Q, K)$  prefers a blocked layout (e.g.  $[M', N']\langle 16, 16 \rangle$ ) for Tensor Core acceleration. The subsequent element-wise  $\text{Exp}$  typi-

Table 2: Rewrite rules for vectorization and layout optimization.

| Rule Name         | Signature                                                                                                              |
|-------------------|------------------------------------------------------------------------------------------------------------------------|
| MetaPackOperation | $Op(\dots) \rightarrow Unpack(PackedOp(Pack(arg_1, lanes_1, axes_1), \dots, Pack(arg_n, lanes_n, axes_n)), axes^{-1})$ |
| FoldNopPack       | $Pack(Unpack(arg, \dots), \dots) \rightarrow arg$                                                                      |

cally expects a flat layout for standard Vector Units. The final  $MatMul(\dots, V)$  again requires a blocked layout.

2. **Space Exploration:** *MetaPackOperation* expands the *Exp* operator. Crucially, it generates a variant that operates directly on the blocked layout (treated the  $16 \times 16$  block as a contiguous vector of length 256).
3. **Layout Propagation:** Since the output of the packed *MatMul* matches the input of this blocked *Exp*, and the output of *Exp* matches the input of the final *MatMul*, the e-graph connects these nodes.
4. **Redundancy Elimination:** The *FoldNopPack* rule identifies and removes the intermediate *Unpack* (after the first *MatMul*) and *Pack* (before the final *MatMul*).

Mathematically, the optimized data flow avoids restoring the logical layout  $[M, N]$ :

$$\begin{aligned}
 \text{Step 1: } & T_1[M', N'] \langle 16, 16 \rangle = \text{MatMul}_{\text{tensor}}(Q_{\text{packed}}, K_{\text{packed}}) \\
 \text{Step 2: } & T_2[M', N'] \langle 16, 16 \rangle = \text{Exp}_{\text{vec}}(T_1) \\
 & \quad // \text{Direct op on blocks} \\
 \text{Step 3: } & O[M', D'] \langle 16, 16 \rangle = \text{MatMul}_{\text{tensor}}(T_2, V_{\text{packed}})
 \end{aligned} \tag{1}$$

By enabling this “pass-through” layout, the compiler extracts a graph where data remain in the optimal hardware format throughout the entire chain, significantly reducing memory access overhead.



Figure 3: e-graph based auto vectorize for the Attention-like subgraph.



Figure 4: Illustration of SBP abstraction and device placement.

### 3.1.3 Auto Distribution

To facilitate adaptation to distributed architecture, we define the distributed strategy search space using the SBP (Split, Broadcast, Partial) abstraction natively supported by OneFlow [58]. This section details the fundamentals of SBP abstraction and the search space construction logic (§3.1.3), followed by the methodology for the construction of the search space based on e-graph and the extraction of optimal strategies (§3.1.3).

**SBP Abstraction and Search Space Definition** The SBP abstraction accurately characterizes the distributed states of tensors across a device cluster. It describes how a  $m$ -dimensional tensor is distributed across an  $n$ -level logical topology using three primitives: *Split*, *Broadcast*, and *Partial* (collectively called SBP), as illustrated in Figure 4. In addition to SBP, *Placement* defines the logical topology of the device cluster. By mapping physical devices to a multi-dimensional array, the array’s axes represent different topological levels. For example, Figure 4(b) depicts a nested topology comprising two groups of devices, each containing two devices.

To formalize the distributed state, the concept of N-Dimensional (ND) SBP is introduced to characterize the tensor distribution in a  $N$ -dimensional *placement*. Specifically, ND SBP is defined as a vector of SBP operations, denoted as  $\mathcal{P}_{\text{nd}} = [\text{sbp}_0, \text{sbp}_1, \dots, \text{sbp}_{N-1}]$ . As shown in Figure 4(a), the strategy for a  $[2, 2]$  tensor in a 1D *placement* is defined by a 1D SBP:

- **Split (S):** Partitions the tensor along a specified axis.
- **Broadcast (B):** Replicates the entire tensor to all devices.
- **Partial (P):** Stores partial values (e.g., partial sums) on each device, requiring a reduction operation to recover the full data.

Crucially, SBP operations across different topological dimensions are orthogonal. As shown in Figure 4(b), a 2D SBP characterizes the strategy under a 2D topology, a paradigm extensible to higher dimensions. Thus, SBP abstraction offers a generalized capability to describe the tensor distribution under arbitrary valid topologies.

Although SBP effectively describes the state of individual tensors, the data flow in a computation graph imposes strict constraints on SBP configurations. The distributed attributes of upstream and downstream tensors cannot be arbitrarily assigned; they must strictly adhere to the operator’s semantic requirements. Consider an element-wise binary addition  $C = A + B$ , where the inputs  $A, B$  have shape [4, 4]. On a two-device cluster, the validity of the distributed execution depends on the consistency of the operands’ SBP states:

$$\begin{aligned} \text{Valid: } & A_{\mathcal{S}(0)} + B_{\mathcal{S}(0)} \rightarrow C_{\mathcal{S}(0)} \\ \text{Invalid: } & A_{\mathcal{S}(0)} + B_{\mathcal{S}(1)} \rightarrow \perp \end{aligned} \quad (2)$$

As demonstrated in Eq. 2, combining a tensor split along the axis 0 ( $\mathcal{S}(0)$ ) with a split along the axis 1 ( $\mathcal{S}(1)$ ) for elemental addition is mathematically undefined without redistribution. We define the set of all valid input-output SBP combinations for an operator as its **SBP Signature**. The global search space for distributed strategies is thus constructed from the composition of these legal SBP Signatures across the entire network.

**Search Space Construction and Extraction** We construct the e-graph based on the principle that nodes sharing identical computation logic and SBP attributes are equivalent. While standard E-Graph rewriting rules could theoretically explore the space, adding distributed equivalents one by one incurs prohibitive compilation overhead. To address this, we propose a dedicated algorithm to directly construct a distributed E-Graph from the logical computation graph.

For a logical operator  $Op(A)$ , its distributed execution is formalized as  $Box(Op(Box(A, \mathcal{P})))$ , where  $Box$  serves as a unified communication primitive. Boxing handles all necessary data movements—such as splitting, broadcasting, aggregation, and reshaping—to satisfy SBP requirements. To manage the mapping between the original logical nodes and the distributed E-Nodes, we introduce the concept of an **E-Cluster**. An E-Cluster is a dictionary that maps a logical node to a set of E-Classes, grouped by their SBP signatures.

The construction process, detailed in Figure 5, proceeds in three distinct phases:

```

1  def BuildEGraph(G):
2      eg, memo = EGraph(), {}
3
4      # 1. Inputs: Create initial Shard Boxing
5      for v in G.Inputs:
6          # Generate base candidates
7          nodes = [Box(v, s) for s in GetSBPs(v)]
8          memo[v] = {}
9          # GroupBySBP returns {sbp: [nodes]}
10         for s, ns in GroupBySBP(nodes).items():
11             memo[v][s] = eg.Add(ns)
12
13     # 2. Compute: Topo-sort & Expansion
14     for v in TopoSort(G):
15         if v in G.Inputs: continue
16
17         # Expand: Reuse + Resharding
18         in_grps = []
19         for inp in v.Inputs:
20             # Reuse existing E-Classes
21             cands = list(memo[inp].values())
22             # Add new Resharding strategies
23             cands += GenReshard(inp)
24             in_grps.append(cands)
25
26         # Cartesian Product & Check
27         nodes = []
28         for args in Product(*in_grps):
29             if IsValid(v, args):
30                 nodes.append(CreateOp(v, args))
31
32         # Add to E-Graph & Union equivalents
33         memo[v] = {}
34         for s, ns in GroupBySBP(nodes).items():
35             ids = [eg.Add(n) for n in ns]
36             memo[v][s] = eg.Union(*ids)
37
38     # 3. Outputs: Aggregate to Host
39     for v in G.Outputs:
40         # Unshare to Host
41         nodes = [Box(n, Host) for n in memo[v].values()]
42         ids = [eg.Add(n) for n in nodes]
43         eg.Union(*ids)
44
45     return eg

```

Figure 5: Pseudo-code for the distributed strategy search space construction.

1. **Input Phase (Lines 5-11):** For each graph input, we generate all feasible distributed strategy candidates. These are instantiated as Boxing nodes and registered in the e-graph. The resulting e-classes are stored in the node’s e-cluster.
2. **Compute Phase (Lines 14-36):** This is the core of the search space expansion. Reusing simply the SBP states of the upstream nodes often leads to a shrinking search space. To mitigate this, we explicitly insert *Resharding Boxing* operations (Line 23). We generate input combinations via a Cartesian product of the inputs’ e-classes. Valid combinations are instantiated as compute nodes. Crucially, nodes resulting in the same output SBP are merged into a single e-class using the *Union* operation (Line 36), ensuring that the graph remains compact.
3. **Output Phase (Lines 39-43):** Finally, to ensure that



Figure 6: Visualization of the e-graph for distributed strategy search.

results can be retrieved by the host, we append *Unshare Boxing* operations to output nodes, unifying all distributed branches into a global result.

**Figure 6(b)** visualizes this structure. Dashed boxes represent e-classes (equivalence classes of same SBP), while solid boxes represent e-clusters (collections of e-classes for one logical node). For example, the *MatMul* node's e-cluster contains multiple e-classes derived from different valid SBP signatures. The search space is fully connected via *Boxing* nodes, allowing the extractor to find the optimal path.

**Cost Model and Constraints** To guide extraction, we incorporate a communication cost model based on the Roofline model and the Alpha-Beta model (latency-bandwidth) used in MPI implementations [43]. The simulated communication time serves as the edge cost in the e-graph. Furthermore, to prevent the heuristic from favoring "Broadcast-heavy" strategies that minimize communication but explode memory usage, we enforce a memory constraint: a strategy is valid only if the sum of memory requirements across all devices does not exceed the cluster's capacity.

### 3.2 Auto Scheduling

We adopt tiles as the fundamental scheduling unit to construct high-performance computation kernels using stacking and nesting combinations [27, 38, 49, 62]. As illustrated in **Figure 7**, we decouple the design space into two orthogonal dimensions: the **Structural Part** and the **Parametric Part**.

The **Structural Part** (y-axis) dictates the macroscopic organization of the kernel, including loop ordering, nesting levels, and parallelism. Each design point in this dimension corresponds to a unique *Tiered Tile Graph*—a hierarchical data structure composed of nested Tile Graphs representing different memory levels. This structure offers two key advantages:

- **Dependency Representation:** It naturally characterizes computational dependencies derived from the original computation graph. The outer loops are mapped to the



Figure 7: The design space of a computation kernel, decoupled into Structural and Parametric dimensions.

root nodes, while the inner loops are mapped to the child nodes, clearly depicting the hierarchy suitable for complex branching and merging scenarios.

- **Fusion Logic:** The nesting structure intuitively maps to operator fusion. As shown in the green dashed box in **Figure 7** (left), the Tile Graph for  $Op_2$  contains two  $L1$ -level subgraphs. This indicates that  $Op_2$  and  $Op_1$  are fused at the  $L2$  memory level. Specifically,  $Op_2$  depends on  $Op_1$ , and intermediate results are transmitted exclusively within the  $L2$  and inner memory levels, avoiding redundant off-chip memory access.

We attach polyhedral information to each Tile Graph to define the iteration domain mapping between the current node and its parent. For example, the affine map shown in the upper left of **Figure 7**,  $R_{Op_1}^{Op_2^2} : ((d0,t0), (d1,t1), (d2,t2)) \rightarrow ((d0,t0), (d1,t1))$ , implies that the  $L1$ -level loops of  $Op_1$  are sliced from the first two dimensions of the parent  $Op_2$ .

The **Parametric Part** (x-axis) focuses on hardware-specific configurations, primarily *Tile Size* and *Buffer Placement*. These parameters directly impact data movement volume, cache utilization, and instruction efficiency.

- **Tile Size:** Determines the granularity of data processing. Excessively large tiles cause cache thrashing due to capacity evictions, while overly small tiles lead to poor bandwidth utilization and cache fragmentation.
- **Buffer Placement:** Determining where the intermediate data resides. For a fixed tile size, allocating buffers

in lower-level memory (when capacity allows) enables high-speed data reuse.

**Figure 7** (bottom) demonstrates the impact of these parameters. For the same structural graph, the red box configuration  $([1, 1, 1])$  suffers from poor locality. In contrast, the green box configuration  $([2, 32, 8])$  processes larger data blocks and allocates  $A1$  and  $B1$  buffers at the  $L2$  level for  $L1$  caching, significantly improving reuse efficiency.

**Hybrid Search Strategy.** Based on the orthogonality of these two dimensions, we propose a bi-level hybrid search framework to navigate this vast design space efficiently. As detailed in the following subsections, we employ **Monte Carlo Tree Search (MCTS)** as the outer loop to explore the discrete structural space (§3.2.1), determining the optimal loop order and fusion strategy. For each candidate structure visited by MCTS, we utilize **Mixed-Integer Nonlinear Programming (MINLP)** as the inner loop evaluation step (§3.2.2). This analytical solver rapidly identifies optimal tile sizes and buffer placements, providing a high-quality performance upper bound to guide the MCTS process.

### 3.2.1 MCTS-based Structural Search

The determination of the Structural Part involves defining the nested associations and loop execution orders within the Tiered Tile Graph. For a subgraph with  $n$  operators, the combinatorial space of potential nested structures and loop permutations grows exponentially. We formulate this as a **sequential decision-making problem**: at each step, a transformation (e.g., fusion or reordering) is applied to the current graph state. These decisions are highly correlated—early structural choices constrain future optimization opportunities—making the problem analogous to a complex game tree search.

To navigate this high-dimensional and non-convex design space efficiently, we employ the **Monte Carlo Tree Search (MCTS)** algorithm. MCTS balances exploration and exploitation by constructing a search tree where nodes represent Tiered Tile Graph states, and edges represent transformation actions. Unlike greedy approaches, MCTS can look ahead to evaluate how current structural changes impact final performance.

**Formal Representation** We adopt the tile-centric notation [62] to formally characterize the state of a Tiered Tile Graph. A node at the memory level  $n$  is defined recursively as:

$$Op^n = \{l_1^n, l_2^n, \dots\} (Op_a^{n-1}, Op_b^{n-1}, \dots) \quad (3)$$

where  $\{l_i^n\}$  denotes the loop set at level  $n$ , and  $(Op_a^{n-1}, \dots)$  denotes the topologically sorted list of child sub-graphs. For

instance, the Tiered Tile Graph in **Figure 7** is represented as:

$$\text{Level 0: } Op_0^0 = \{\} (\text{MatMul}), \quad Op_1^0 = \{\} (\text{Exp}),$$

$$Op_2^0 = \{\} (\text{MatMul})$$

$$\text{Level 1: } Op_0^1 = \{i^1, k^1, l^1\} (Op_0^0), \quad Op_1^1 = \{i^1, l^1\} (Op_1^0),$$

$$Op_2^1 = \{i^1, l^1, j^1\} (Op_2^0)$$

$$\text{Level 2: } Op_0^2 = \{i^2, k^2, l^2\} (Op_0^1),$$

$$Op_2^2 = \{i^2, l^2, j^2\} (Op_1^1, Op_2^1)$$

**Search Mechanics** The MCTS process iterates through four phases: *Selection*, *Expansion*, *Simulation*, and *Backpropagation*. The root node is initialized as the raw computation subgraph. During the *Expansion* phase, we define two atomic actions to traverse the structural space:

- **merge (src, dst, level)**: Implements *operator fusion*. It migrates all subgraphs from the source node (*src*) to the destination node (*dst*) at a specified memory level, merging their loop scopes. Taking the operation  $\text{merge}(1, 2, 2)$  in the example above, the state transition at Level 2 is as follows:

$$Op_1^2 = \{i^2, l^2\} (Op_1^1), \quad Op_2^2 = \{i^2, l^2, j^2\} (Op_2^1) \\ \xrightarrow{\text{merge}(1, 2, 2)} \quad Op_2^2 = \{i^2, l^2, j^2\} (Op_1^1, Op_2^1)$$

This action effectively fuses  $Op_1$  and  $Op_2$  at Level 2, as  $Op_1^1$  becomes a child of  $Op_2^2$ .

- **reorder (i, n, loops)**: Implements *loop permutation*. It reconfigures the execution order of the loop set for a specific tile  $Op_i^n$ .

**Analytical Simulation** A critical divergence from standard MCTS lies in the *Simulation* phase. Instead of random rollouts, we employ a deterministic solver, specifically the *Parametric Optimization via MINLP* module (detailed in §3.2.2), as the evaluator. For a given structural state (leaf node), this model solves for the optimal *Parametric Part* (e.g., tile sizes) and returns the estimated latency as the reward. This hybrid approach—using MCTS for structural search and an analytical solver for parametric tuning—significantly reduces the search space and ensures high-quality evaluation signals without expensive hardware measurements.

### 3.2.2 Parametric Optimization

Once the Structural Part is determined, searching for the optimal parameter configuration is critical for execution performance. Performance is governed by complex interactions between factors such as cache capacity, buffer size, and tile size. Relying on empirical tuning is inefficient and unreliable. To systematically address this, we adopt an analytical modeling approach using **mixed integer nonlinear programming (MINLP)**.

MINLP is selected for two reasons: (1) it naturally handles integer variables (e.g., tile sizes) and nonlinear relationships inherent in hardware execution; (2) it supports complex constraints (e.g., memory limits) and multi-objective optimization. Compared to other methods, MINLP offers a flexible framework to accurately capture the "variable abstraction - static analysis - constraint definition - objective optimization" workflow.

**Variable Definition.** The variables represent the tuning parameters and are the primary subjects of the solution process.

- (1) **Tile Size:** Each loop within a tiled code segment has an independent tile size. We define TileVar as a continuous integer variable:

$$\text{TileVar}_{l,op,i} \in \mathbb{Z}^+ \quad (4)$$

where  $l$  denotes the memory level,  $op$  the operator index, and  $i$  the loop index.

- (2) **Buffer Placement:** We decouple buffer creation from storage to enable data reuse in faster caches. A boolean variable Place characterizes this:

$$\text{Place}_{cl,op,b,e,sl} \in \{0, 1\}, \quad 0 \leq sl \leq cl \quad (5)$$

Here,  $cl$  is the creation level,  $b$  is the buffer ID,  $e$  is the placement entry (position relative to loops), and  $sl$  is the actual storage level. We restrict storage to levels at or below the creation level.

**Static Analysis.** Performance metrics are derived by traversing the Tiered Tile Graph based on the defined variables.

- (1) **Backward Extent:** While the Tile Graph records polyhedral information, the actual iteration domain depends on tile sizes. We define Extent recursively:

$$\begin{aligned} \text{Extent}_{l,op,e,i} &= \text{Extent}_{l-1,op,0,i} \times \delta \\ \delta &= \begin{cases} \text{TileVar}_{l,op,i}, & \text{if } i \geq e \wedge \mathcal{R}(\text{loop}^l, \text{loop}^{l-1}) \\ 1, & \text{otherwise} \end{cases} \end{aligned} \quad (6)$$

where  $\mathcal{R}$  determines if loops at adjacent levels map to the same iteration variable.

- (2) **Buffer Size:** The size of buffer  $b$  is determined by its access relation  $\mathcal{A}_{op}^b$  applied to the extent:

$$\text{Size}_{l,op,b,e} = \prod \mathcal{A}_{op}^b(\text{Extent}_{l,op,b,e}) \quad (7)$$

- (3) **Trip Count:** The total number of iterations is the product of tile sizes from the current level down to the bottom:

$$\text{Trip}_{l,op,e} = \begin{cases} \text{Trip}_{l+1,op,last} & e = 0 \\ \text{Trip}_{l,op,e+1} \times \text{TileVar}_{l,op,e} & e > 0 \end{cases} \quad (8)$$

- (4) **Data Traffic:** We categorize buffers into read-only ( $\mathcal{B}_{read}$ ) and read-write ( $\mathcal{B}_{write}$ ) sets. The total data volume at memory level  $l$  is:

$$\text{Writes}_l = \sum_{cl,op,e} \sum_{b \in \mathcal{B}_{write}} \Phi(cl, op, b, e, l)$$

$$\text{Reads}_l = \sum_{cl,op,e} \sum_{b \in \mathcal{B}_{read}} \Phi(cl, op, b, e, l)$$

$$\text{where } \Phi = \text{Place}_{cl,op,b,e,l} \times \text{Size}_{l,op,b,e} \times \text{Trip}_{l,op,e} \quad (9)$$

This model approximates data updates by multiplying loop counts with buffer sizes, avoiding the complexity of partial update analysis.

**Constraints.** Constraints ensure the legality of the solution.

- (1) **Domain Bounds:** The tiled loops must cover the original iteration domain. We enforce this by checking the shape at the top level ( $L_{top}$ ):

$$\mathcal{A}_{op}^b(\text{Extent}_{L_{top},op,b,0}) = \text{Shape}_{op}^b \quad (10)$$

- (2) **Buffer Placement:** (a) I/O buffers must reside in the outermost memory:

$$\text{Place}_{L_{top},op,b,0,L_{top}} = 1 \quad (11)$$

- (b) The innermost cache (level 0) must store the active buffer for computation:

$$\sum_{cl,op,e} \text{Place}_{cl,op,b,e,0} = 1 \quad (12)$$

- (c) *Unique Storage:* A buffer is stored exactly once per valid level. (d) *Fusion:* Fused operators must reuse buffers at the fusion level ( $L_{fuse}$ ). For a reused buffer  $b$ :

$$\sum_{sl=0}^{L_{fuse}} \sum_e \text{Place}_{L_{fuse},op,b,e,sl} = 1 \quad (13)$$

Child nodes sharing this buffer must respect its dynamic storage level.

- (3) **Memory Capacity:** Total buffer usage at any storage level  $sl$  must not exceed hardware capacity  $C_{sl}$ :

$$\sum_{cl,op,b,e} \text{Place}_{cl,op,b,e,sl} \times \text{Size}_{cl,op,b,e} \leq C_{sl} \quad (14)$$

**Objective Function.** The goal is to minimize total execution time, modeled as the maximum of memory and computation time (assuming parallelism).

The computation time ( $T_{comp}$ ) is derived from a linear regression model ( $\mu\text{KernelTime}$ ) of the micro-kernel, scaled

by iteration counts. Memory time ( $T_{mem}$ ) is the data traffic divided by bandwidth ( $BW_l$ ).

$$T_{comp} = \sum_{op,b} \mu KT(op, \dots) \times \text{Trip}_{0,op,last}$$

$$T_{mem} = \sum_l \frac{\text{Writes}_l + \text{Reads}_l}{BW_l} \quad (15)$$

The final objective function solved by OR-Tools is:

$$\text{minimize } \max(T_{mem}, T_{comp}) \quad (16)$$

### 3.3 Code Generation

The code generation phase serves as the pivotal bridge between the front-end graph optimizations and the final executable binary. Its primary objective is to translate the optimized computation graph into efficient, hardware-aware C++ source code. As illustrated in Figure 8, the generation process relies on two critical components to maximize runtime performance:

- **Buffer Scheduling:** This module addresses memory hierarchy constraints by optimizing allocation strategies and maximizing buffer reuse.
- **Nncase Tensor Template Library (NTT):** This component facilitates the instantiation of the kernel template, leveraging high-performance register-level pkernels to inject underlying hardware support into the generated code.

The following subsections detail the specific implementations of these two mechanisms.

#### 3.3.1 Buffer Schedule

The core objective of Buffer Schedule is to minimize runtime memory redundancy and access overhead. This is achieved through a two-stage process: **Bufferization** (logical-to-physical mapping) and **Memory Planning** (address allocation).

**Bufferization and Alias Analysis** Before allocation, the compiler transforms the logical tensors into physical buffer constraints. A key optimization in this stage is **Alias Analysis**. The compiler identifies operators with view semantics, such as Reshape, Slice, and Squeeze, where the output tensor is merely a different view of the input data. Instead of allocating new storage, these outputs are marked as aliases, sharing the underlying physical memory of their inputs. This strategy enforces *zero-copy* execution for shape transformations, significantly reducing memory footprint.



Figure 8: Generated C++ code utilizing NTT Library primitives.

**Memory Planning** Once physical buffers are defined, the system assigns optimal memory addresses:

- **Intermediate Variables:** By traversing the computation graph to perform liveness analysis, the system obtains the lifecycle and size of each non-aliased buffer. The allocation is modeled as a classic Bin Packing problem [24, 25, 31]. An SAT solver is utilized to find an optimal arrangement, maximizing memory reuse by overlapping buffers that are not live simultaneously.
- **Constants:** For static weights, the system reuses the SBP (Split-Broadcast-Partial) attributes determined by the Auto Distribution module. Constants are pre-split and pinned to the dedicated local storage of each computing node. This pre-allocation strategy ensures high-speed local access and eliminates redundant data transmission in distributed scenarios.

#### 3.3.2 Nncase Tensor Template Library

The nncase Tensor Template Library (NTT) serves as the foundational infrastructure for generated code, implemented as a header-only **C++20 template library**. Unlike traditional runtime operator libraries, NTT leverages modern C++ Metaprogramming (TMP) and Concepts to minimize runtime overhead through **Zero-cost Abstraction**.

**Hybrid Shape and Stride System** A distinguishing feature of NTT is its hybrid management of tensor metadata. As demonstrated by the usage of `ntt::fixed_shape_v` and `ntt::make_shape`, the library allows mixing static compile-time dimensions with dynamic runtime dimensions.

- **Static Inference:** For dimensions known at compile-time (e.g., model weights or fixed-size buffers), shape inference and stride calculations are performed entirely during the C++ compilation phase. This allows the compiler to resolve the offset calculations into immediate values, eliminating runtime arithmetic instructions.
- **Dynamic Compatibility:** For variable dimensions, the library seamlessly falls back to runtime computation, ensuring flexibility without compromising the performance of static parts.

**Micro-kernels and Vectorization** NTT provides a suite of architecture-aware micro-kernels tailored for core operators (e.g., MatMul, Conv2D). By exposing explicit vector types (e.g., `ntt::vector<float, 8>`) and memory layout primitives (e.g., pack/unpack), NTT enables the Codegen module to precisely control data reuse patterns and SIMD instruction generation. This design effectively avoids cache invalidation and guides the backend compiler (GCC/Clang) to generate optimal register spill/fill sequences.

**Distributed Extension (NTTD)** To support multi-device execution, the Distributed Tensor Template Library (NTTD) extends NTT with topology-aware abstractions.

- **Type-Based Sharding:** Distributed attributes such as Mesh topology and SBP (Split-Broadcast-Partial) policies are encoded directly into C++ template arguments (e.g., `shard_policy::S<2>`). This ensures that communication primitives are statically bound, avoiding runtime policy dispatch overhead.
- **Seamless Integration:** NTTD acts as the executable backend for the Auto Distribution module. The abstract tensor splitting strategies are mapped directly to NTTD’s template configurations, enabling efficient overlapping of computation and communication on specific hardware topologies (e.g., NUMA, Chiplet clusters).

## 4 Evaluation

This section evaluates the end-to-end inference performance of Large Language Models (LLMs) on a representative CPU platform using *nncase*. The experiments aim to verify its advantages over mainstream compilation frameworks and validate the effectiveness of the proposed “unified distributed compilation for heterogeneous storage architecture” design.

We employ the Qwen3 [56] series models, specifically Qwen3-0.6B (supporting the precision F32/F16) and Qwen3-1.7B (the precision F16). The evaluation was conducted on an AMD Ryzen 9 5900X 12-core processor equipped with 128GB (4 × 32GB) DDR4-3600 memory, running Ubuntu

24.04 LTS (GCC 14.2). To ensure a fair comparison, all frameworks were configured with default optimizations and utilized the highest supported instruction set (AVX2).

We compare *nncase*<sup>1</sup> against three state-of-the-art frameworks: MLC LLM [34]<sup>2</sup>, *llama.cpp* [13]<sup>3</sup>, and Intel IPEX [20]<sup>4</sup>. Performance was evaluated under varying CPU thread configurations (1T, 4T, 8T) to assess scalability in both single-core and multi-core scenarios. Following industry standards, we set the batch size to 1 with an 8-token prompt. The primary metric is *token throughput* (tokens/s), calculated based on the total duration of the decoding stage. Each test was repeated 100 times to report average values. Figure 9 and Figure 10 illustrate the performance comparisons in different precisions and scales. Detailed analyzes are provided in the following subsections.

### 4.1 Single-Core Scenario

This subsection evaluates the fundamental code generation quality of *nncase* using Qwen3-0.6B (F32/F16) and Qwen3-1.7B (F16) in a single-thread (1T) configuration. Although *nncase* employs a “unified distributed compilation” architecture, it treats a single core as the atomic unit of its distributed abstraction. Consequently, the performance in this scenario directly reflects the efficacy of the **Auto-Vectorization** and **Auto-Scheduling** modules, where the upper bound is determined by the compiler’s ability to maximize on-chip memory reuse and compute unit utilization (e.g., AVX2 saturation).

As illustrated in Figure 9, the results exhibit a consistent performance hierarchy across different model scales and precisions: *llama.cpp* > *nncase* > *Intel IPEX* ≫ *MLC LLM*. *llama.cpp* maintains the lead due to its extensive hand-written kernels and manual scheduling, serving as the theoretical performance ceiling. However, among the end-to-end compilation frameworks, *nncase* demonstrates superior efficiency.

#### Quantitative Analysis:

- **Qwen3-0.6B (F32):** *nncase* achieves 8.7 tokens/s. Although this trails the hand-optimized *llama.cpp* (10.61 tokens/s) by approximately 18%, it notably outperforms *Intel IPEX* (7.58 tokens/s) by **15%**.
- **Qwen3-0.6B (F16):** Reducing precision significantly increases throughput. *nncase* reaches 13.87 tokens/s, a **59%** improvement over its F32 performance. This narrows the gap with *llama.cpp* (17.21 tokens/s) and extends the lead over *Intel IPEX* (10.22 tokens/s).
- **Qwen3-1.7B (F16):** As the model scale increases, the pressure of the memory bandwidth intensifies. *nncase* maintains its ranking with 5.09 tokens/s, which is 19% lower than *llama.cpp*, but remains **21% higher** than

<sup>1</sup>Commit: fbe2f7ba9

<sup>2</sup>Commit: 862a7311

<sup>3</sup>Tag: b5753

<sup>4</sup>Version: v2.8.0



Figure 9: Comparison of LLM token generation throughput among different frameworks in the single-core scenario (1T). *llama.cpp* represents the hand-optimized baseline, while others represent compiler-based approaches.



Figure 10: Comparison of LLM token generation throughput in multi-core scenarios (4T/8T). *nncase* demonstrates superior scalability, surpassing hand-optimized libraries.

*Intel IPEX*. *MLC LLM* struggles in this configuration, achieving only 0.2 tokens/s.

These results validate that while there is still a marginal gap compared to heavily manually optimized libraries, *nncase*' automated optimization pipeline generates significantly more efficient code than other state-of-the-art compilers (IPEX and MLC) in a strictly constrained single-core environment.

## 4.2 Multi-Core Scenario

This subsection evaluates the efficacy of the *Auto Distribution* module by scaling the workload to multi-core configurations (4T and 8T). Unlike traditional shared-memory parallelism, *nncase* treats multiple cores as “multi-nodes,” applying distributed partition strategies to minimize synchronization overhead.

As shown in Figure 10, *nncase* exhibits the best scalability and overall performance, successfully overtaking the hand-optimized *llama.cpp* in multi-core scenarios.

### Performance Analysis:

- **Scalability & Throughput (Qwen3-0.6B-F16):** In the 4T configuration, *nncase* achieves 23.5 tokens/s, slightly outperforming *llama.cpp* (23.2 tokens/s) and significantly leading *Intel IPEX* (15.52 tokens/s). As the core

count increases to 8T, the performance of all frameworks hits the memory bandwidth wall. Nevertheless, *nncase* maintains the lead with 23.98 tokens/s.

- **Scaling Efficiency (Qwen3-1.7B-F16):** The advantage of *nncase* becomes more pronounced with larger models. Comparing 4T with the 1T baseline (Sec. 4.1), *nncase* achieves a **performance gain** of 74%, while *llama.cpp* only gains 32%. This allows *nncase* (8.85 tokens/s) to surpass *llama.cpp* (8.34 tokens/s) and *Intel IPEX* (6.93 tokens/s).

**Architectural Discussion:** The superior multi-core performance of *nncase* stems from its **Unified Distributed Compilation** paradigm.

1. **Static vs. Dynamic Scheduling:** Competitors like *llama.cpp* and *Intel IPEX* rely on OpenMP [11] for thread-level parallelism. This approach introduces runtime overhead due to dynamic task scheduling and frequent fork-join synchronization barriers. In contrast, *nncase*' *Auto Distribution* module performs static task partitioning and core mapping at compile time. This eliminates runtime scheduling overhead and ensures deterministic communication patterns.
2. **Granularity Optimization:** By viewing cores as distributed nodes, the compiler optimizes the computation granularity to balance load and minimize inter-core communication, allowing for a near-linear release of parallel computing power before hitting hardware bandwidth limits.

These results confirm that applying distributed system methodologies to single-machine multi-core compilation yields higher parallel efficiency than traditional shared-memory threading models.

## 5 Related Work

### 5.1 Graph Rewriting Optimization

Graph rewriting is crucial to enhancing execution efficiency in deep learning compilers. Existing approaches generally fall into two categories: *rule-based superoptimization* and *equality saturation*.

Rule-based methods optimize graphs through specific transformation rules and search strategies. TASO [22] automates rule generation and verifies correctness through theorem provers, using cost-driven backtracking for search. Others, such as OCGGS [23], PET [48], and OLLIE [61], extend this by relaxing performance constraints or deriving tensor algebra transformations. TPP [14] facilitates efficient vectorization through cache-aware packing primitives. However, these methods suffer from the *phase ordering problem*, where the sequence of optimizations impacts the final result.

Equality saturation addresses phase ordering by maintaining equivalent program versions in an e-graph. Diospyros [46] and Glenside [39] apply this to hardware adaptation, optimizing vectorization and access patterns. To improve the search efficiency of e-graphs, Tensat [57] and MCTS-GE [18] introduce techniques like Monte Carlo Tree Search.

**Limitations & Our Approach:** Despite these advances, existing e-graph methods rarely support **layout optimization for heterogeneous computing units** or link data locality with **adaptive memory allocation**. Most treat layout and memory as post-optimization passes. *nncase* bridges this gap by encoding layout and storage constraints directly into rewriting rules. By integrating a precise cost model within the e-graph, *nncase* simultaneously optimizes algebraic structure, data layout, and buffer placement.

## 5.2 Distribution Strategy Search

Automatic distributed parallelism research focuses on strategy representation and search algorithms. Regarding representation, OneFlow [58] and AutoDDL [8] utilize SBP abstractions (Split, Broadcast, Partial-Value) to manage resource dependencies. GSPMD [54] supports hybrid parallelism but relies on manual annotations. For search strategies, Alpa [60] constructs a hierarchical space for inter-operator and intra-operator parallelism, while Unity [45] proposes joint optimization of algebraic transformations and parallelization. In complex scenarios, systems such as PaddleDist [3] and Colossal-Auto [29] address adaptive training and fault tolerance.

**Limitations & Our Approach:** A critical limitation of current schemes is the **decoupling** of distributed strategy search from graph rewriting. Most systems (e.g., Alpa) perform the distribution search as an independent pass, preventing joint optimization with operator fusion or layout changes. *nncase* overcomes this by reusing the e-graph to construct a unified search space. This allows distributed strategies to be co-optimized with graph rewriting without manual configuration, while explicitly modeling communication costs and storage constraints for heterogeneous hardware.

## 5.3 Scheduling Optimization

Scheduling optimization balances code performance against search time. Approaches are broadly categorized into *learning-based* methods and *analytical modeling-based* methods.

Learning-based methods (e.g. AutoTVM [10], Ansol [59], MetaSchedule [37]) use machine learning to explore vast search spaces. Although they can approach optimal performance, they suffer from extremely long compilation times. Analytical methods (e.g. Welder [38], TileFlow [62], Chimera [63]) use mathematical models to quickly solve parameters such as tile sizes. However, they often oversimplify the hardware model. For example, Tuna [50] is limited to

scalar loops, and tools such as Welder often prioritize data movement over computational overhead, failing to accurately model the trade-off between cache size and memory footprint in complex hierarchies.

**Limitations & Our Approach:** Existing analytical models often lack the precision to handle the complex constraints of specialized AI accelerators. *nncase* addresses this by employing **MINLP (Mixed-Integer Non-Linear Programming)** modeling. Unlike prior works, *nncase* explicitly characterizes buffer locations and incorporates precise computational overhead into the cost evaluation, enabling efficient generation of high-performance schedules with significantly reduced compilation time.

## 6 Conclusion and Future Work

This paper introduces *nncase*, an end-to-end compilation framework designed to overcome the challenges of deploying LLMs on heterogeneous storage architectures. By constructing a unified distributed compilation framework, *nncase* bridges the gap between high-level model definitions and low-level hardware constraints. Key innovations include the e-graph-based Auto Vectorize and Auto Distribution modules, together with a hierarchical Auto Schedule module that synergizes MCTS with MINLP and integrates a specialized NTT library. These components collectively address critical bottlenecks in heterogeneous unit adaptation, distributed strategy search, and on-chip storage management. Experimental evaluations on modern multi-core platforms (e.g., AMD Ryzen 9 5900X) demonstrate that *nncase* achieves superior inference performance for Qwen3 series models compared to state-of-the-art frameworks like Intel IPEX and MLC LLM, validating the efficacy of its unified design philosophy.

Looking ahead, we plan to advance this research in three strategic directions:

1. **Holistic Performance Characterization:** We will expand the scope of evaluation to cover a wider range of model sizes, hardware platforms, and deployment scenarios. This includes a rigorous analysis of compilation overhead and memory footprint to verify the robustness and scalability of *nncase* in complex production environments.
2. **Cross-Architecture Adaptation (SIMT Support):** We aim to extend the compiler’s backend to support SIMT architectures. By enhancing the Auto Vectorize and Auto Schedule modules, we will enable seamless deployment on massively parallel devices such as GPUs, achieving true cross-platform portability.
3. **Computation-Communication Overlap:** We will focus on the automatic fusion and scheduling of communication and computation kernels. By refining dependency analysis and implementing advanced pipelining strategies, we aim to achieve effective latency hiding, thereby

maximizing hardware utilization and end-to-end inference throughput in distributed settings.

## References

- [1] or-tools. <https://github.com/google/or-tools>, 2013.
- [2] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. Learning to optimize halide with tree search and random programs. *ACM Trans. Graph.*, 38(4), July 2019.
- [3] Yulong Ao, Zhihua Wu, Dianhai Yu, Weibao Gong, Zhiqing Kui, Minxu Zhang, Zilingfeng Ye, Liang Shen, Yanjun Ma, Tian Wu, Haifeng Wang, Wei Zeng, and Chao Yang. End-to-end adaptive distributed training on paddlepaddle, 2021.
- [4] ARM Ltd. *Arm Compute Library Reference Manual*.
- [5] Pietro Belotti, Jon Lee, Leo Liberti, Francois Margot, and Andreas Wachter. Branching and bounds tightening techniques for non-convex minlp. *Optimization Methods Software*, 24(4–5):597–634, August 2009.
- [6] M. E. Benitez and J. W. Davidson. A portable global optimizer and linker. In *Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation*, PLDI ’88, page 329–338, New York, NY, USA, 1988. Association for Computing Machinery.
- [7] Cerebras Systems. Third generation 5nm Wafer Scale Engine (WSE-3) powers industry’s most scalable AI supercomputers, 2024. Press release, March 13, 2024.
- [8] Jinfan Chen, Shigang Li, Ran Guo, Jinhui Yuan, and Torsten Hoefler. Autoddl: Automatic distributed deep learning with near-optimal bandwidth cost. *IEEE Trans. Parallel Distrib. Syst.*, 35(8):1331–1344, August 2024.
- [9] T. Chen et al. TVM: An automated end-to-end optimization stack for deep learning. *SSP 2018*, 2018.
- [10] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In *Proceedings of the 32nd International Conference on Neural Information Processing Systems*, NIPS’18, page 3393–3404, Red Hook, NY, USA, 2018. Curran Associates Inc.
- [11] L. Dagum and R. Menon. Openmp: an industry standard api for shared-memory programming. *IEEE Computational Science and Engineering*, 5(1):46–55, 1998.
- [12] Nachum Dershowitz. A taste of rewrite systems. In *Functional Programming, Concurrency, Simulation and Automated Reasoning: International Lecture Series 1991–1992, McMaster University, Hamilton, Ontario, Canada*, page 199–228, Berlin, Heidelberg, 1993. Springer-Verlag.
- [13] Georgi Gerganov. llama.cpp. <https://github.com/ggml-org/llama.cpp>, 2022.
- [14] Renato Golin, Lorenzo Chelini, Adam Siemieniuk, Kavitha Madhu, Niranjan Hasabnis, Hans Pabst, Evangelos Georganas, and Alexander Heinecke. Towards a high-performance ai compiler with upstream mlir, 2024.
- [15] John Goodacre. *Technology Preview: The ARMv8 Architecture*. ARM Ltd, United Kingdom, November 2011. White paper.
- [16] Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. *ACM Trans. Math. Softw.*, 34(3), May 2008.
- [17] Dejan Grubisic, Bram Wasti, Chris Cummins, John Mellor-Crummey, and Aleksandar Zlateski. Looptune: Optimizing tensor computations with reinforcement learning, 2023.
- [18] Jakob Hartmann, Guoliang He, and Eiko Yoneki. Optimizing tensor computation graphs with equality saturation and monte carlo tree search. In *Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques*, PACT ’24, page 40–52, New York, NY, USA, 2024. Association for Computing Machinery.
- [19] Mike He, Haichen Dong, Sharad Malik, and Aarti Gupta. Improving term extraction with acyclic constraints. *Programming Language Design and Implementation*, 2023.
- [20] intel. Intel extension for pytorch. <https://github.com/intel/intel-extension-for-pytorch>, 2022.
- [21] Intel Corporation. *Intel Math Kernel Library Reference Manual*.
- [22] Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In *Proceedings of the 27th ACM Symposium on Operating Systems Principles*, SOSP ’19, page 47–62, New York, NY, USA, 2019. Association for Computing Machinery.
- [23] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Zaharia, and Alex Aiken. Optimizing dnn computation with relaxed graph substitutions. In A. Talwalkar, V. Smith, and M. Zaharia, editors, *Proceedings of Machine Learning and Systems*, volume 1, pages 27–39, 2019.

- [24] Edward G. Coffman Jr., Gábor Galambos, Silvano Martello, and Daniele Vigo. Bin packing approximation algorithms: Combinatorial analysis. In Ding-Zhu Du and Panos M. Pardalos, editors, *Handbook of Combinatorial Optimization*, pages 151–207. Springer, 1999.
- [25] Bernhard Korte and Jens Vygen. *Combinatorial Optimization: Theory and Algorithms*. Springer Publishing Company, Incorporated, 5th edition, 2012.
- [26] Prasad Kulkarni, Wankang Zhao, Hwashin Moon, Kyunghwan Cho, David Whalley, Jack Davidson, Mark Bailey, Yunheung Paek, and Kyle Gallivan. Finding effective optimization phase sequences. In *Proceedings of the 2003 ACM SIGPLAN Conference on Language, Compiler, and Tool for Embedded Systems, LCTES ’03*, page 12–23, New York, NY, USA, 2003. Association for Computing Machinery.
- [27] Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P. Sadayappan. Analytical cache modeling and tilesize optimization for tensor contractions. In *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19*, New York, NY, USA, 2019. Association for Computing Machinery.
- [28] Sean Lie. Cerebras architecture deep dive: First look inside the hardware/software co-design for deep learning. *IEEE Micro*, 43(3):18–30, 2023.
- [29] Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, and Yang You. Colossal-auto: Unified automation of parallelization and activation checkpoint for large-scale models, 2023.
- [30] Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. Analytical modeling is enough for high-performance blis. *ACM Trans. Math. Softw.*, 43(2), August 2016.
- [31] Silvano Martello and Paolo Toth. Lower bounds and reduction procedures for the bin packing problem. *Discret. Appl. Math.*, 28(1):59–70, 1990.
- [32] Massinissa Merouani, Khaled Afif Boudaoud, Iheb Nassem Aouadj, Nassim Tchoulak, Islem Kara Bernou, Hamza Benyamina, Fatima Benbouzid-Si Tayeb, Karima Benatchba, Hugh Leather, and Riyadh Baghdadi. Looper: A learned automatic code optimizer for polyhedral compilers, 2025.
- [33] Nicholas Metropolis and S. Ulam. The monte carlo method. *Journal of the American Statistical Association*, 44(247):335–341, 1949. PMID: 18139350.
- [34] MLC team. MLC-LLM, 2023-2025.
- [35] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrej Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael

- Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
- [36] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. *SIGPLAN Not.*, 48(6):519–530, June 2013.
- [37] Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. Tensor program optimization with probabilistic programs. In *Proceedings of the 36th International Conference on Neural Information Processing Systems*, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc.
- [38] Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling deep learning memory access via tile-graph. In *17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)*, pages 701–718, 2023.
- [39] Gus Henry Smith, Andrew Liu, Steven Lyubomirsky, Scott Davidson, Joseph McMahan, Michael Taylor, Luis Ceze, and Zachary Tatlock. Pure tensor program rewriting via access patterns (representation pearl). In *Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming*, MAPS 2021, page 21–31, New York, NY, USA, 2021. Association for Computing Machinery.
- [40] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. Knights landing: Second-generation intel xeon phi product. *IEEE Micro*, 36(2):34–46, 2016.
- [41] Michael Stepp, Ross Tate, and Sorin Lerner. Equality-based translation validator for llvm. In *Proceedings of the 23rd International Conference on Computer Aided Verification*, CAV’11, page 737–742, Berlin, Heidelberg, 2011. Springer-Verlag.
- [42] Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. Equality saturation: a new approach to optimization. In *Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages*, POPL ’09, page 264–276, New York, NY, USA, 2009. Association for Computing Machinery.
- [43] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in mpich. *IJHPCA*, 19:49–66, 01 2005.
- [44] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
- [45] Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Patrick S. McCormick, Jamal Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alexander Aiken. Unity: Accelerating dnn training through joint optimization of algebraic transformations and parallelization. In *USENIX Symposium on Operating Systems Design and Implementation*, 2022.
- [46] Alexa VanHattum, Rachit Nigam, Vincent T. Lee, James Bornholt, and Adrian Sampson. Vectorization for digital signal processors via equality saturation. In *Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, ASPLOS ’21, page 874–886, New York, NY, USA, 2021. Association for Computing Machinery.
- [47] Raj R. Varada, Rohini Krishnan, Ajith Subramonia, Rathish Chandran, Kalyana Chakravarthy, Utpal D. Desai, Sumedha Limaye, Puneesh Puri, David R. Mulvihill, Mike Bichan, Martin Koolhaas, Vijayalakshmi Ramachandran, and Srinivasu Kendle. 2.3 granite rapids-d:

- Intel xeon 6 soc for vran, edge, networking, and storage. In *2025 IEEE International Solid-State Circuits Conference (ISSCC)*, volume 68, pages 48–50, 2025.
- [48] Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In *15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21)*, pages 37–54. USENIX Association, July 2021.
- [49] Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, et al. Ladder: Enabling efficient low-precision deep learning computing through hardware-aware tensor transformation. In *18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)*, pages 307–323, 2024.
- [50] Yao Wang, Xingyu Zhou, Yanming Wang, Rui Li, Yong Wu, and Vin Sharma. Tuna: A static analysis approach to optimizing deep neural networks, 2021.
- [51] Martin Weidman. Introducing the scalable matrix extension for the armv9-a architecture. Technical report, ARM Ltd, 2021.
- [52] Finn Wilkinson and Simon McIntosh-Smith. An initial evaluation of arm’s scalable matrix extension. In *2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)*, pages 135–140, 2022.
- [53] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. *Commun. ACM*, 52(4):65–76, April 2009.
- [54] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. Gspmd: General and scalable parallelization for ml computation graphs, 2021.
- [55] Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, and Guihai Chen. Alt: Breaking the wall between data layout and loop optimizations for deep learning compilation. In *Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23*, page 199–214, New York, NY, USA, 2023. Association for Computing Machinery.
- [56] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengan Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yingger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025.
- [57] Yichen Yang, Phitchaya Mangpo Phothilimtha, Yisu Remy Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superoptimization, 2021.
- [58] Jinhuai Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, Haoran Zhang, and Jie Zhao. Oneflow: Redesign the distributed deep learning framework from scratch, 2022.
- [59] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: generating high-performance tensor programs for deep learning. In *Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, OSDI’20, USA, 2020*. USENIX Association.
- [60] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. In *16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)*, pages 559–578, 2022.
- [61] Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shizhi Tang, Lei Xie, Kezhao Huang, and Zhihao Jia. Ollie: Derivation-based tensor program optimizer, 2022.
- [62] Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. Tileflow: A framework for modeling fusion dataflow via tree-based analysis. In *Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23*, page 1271–1288, New York, NY, USA, 2023. Association for Computing Machinery.

- [63] Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. In *2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*, pages 1113–1126, 2023.
- [64] Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. ROLLER: Fast and efficient tensor compilation for deep learning. In *16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)*, pages 233–248, 2022.