

# Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs

Changhai Man<sup>†\*</sup>, Joongun Park<sup>†\*</sup>, Hanjiang Wu<sup>†</sup>, Huan Xu<sup>†</sup>, Srinivas Sridharan<sup>‡</sup>, Tushar Krishna<sup>†</sup>  
 Georgia Institute of Technology<sup>†</sup>, Nvidia Inc.<sup>‡</sup>

{cman8, jpark3234, hwu419, hxu398}@gatech.edu, srisridharan@nvidia.com, tushar@ece.gatech.edu

**Abstract**—Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and hardware design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces capturing execution on a specific platform cannot be easily adapted to study alternate software and/or hardware configurations, especially at scale. We introduce STAGE\*\*, a framework that synthesizes high-fidelity execution graphs to accurately model distributed AI workloads (including LLMs and MoEs). STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of model architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE will be publicly available to facilitate further research in distributed machine learning systems.

## I. INTRODUCTION

The rapid growth of machine learning models, especially Large Language Models (LLMs), including GPT [6], Llama [59], DeepSeek [12], and Mistral [29], has revolutionized the field of machine learning, driving massive advancements in natural language processing and generative AI. However, the scale and complexity of LLMs have introduced unprecedented computational challenges. These models often require massive amounts of computation and memory [41], [64], not only during training but also for inference, necessitating distributed AI systems. Several such systems exist in practice today, including NVIDIA HGX [46], Google TPU [72], Amazon Trainium [7], Cerebras CS-3 [8], and others. Optimizing compute, memory and communication resources optimally in these systems is crucial for performance [52], [67]. The need for scalable and efficient distributed training is only growing, as evidenced by the recently released Llama 4 model that leverages a Mixture-of-Experts (MoE) architecture with up to 2 trillion parameters [38], pushing the limits of current AI system infrastructure.

Standardized benchmarks play a crucial role in our community, serving two key purposes: optimizing the performance of

TABLE I  
 NUMBER OF OPERATIONS WITHIN SINGLE EPOCH PER GPU. (BATCH SIZE: 128 FOR DEEPSEEK, 32 FOR OTHERS)

| Model        | # of Param. | # of GPU | # of Comp. | # of Comm. |
|--------------|-------------|----------|------------|------------|
| GPT-3        | 175B        | 32       | 156,317    | 30,978     |
| LLaMA-3      | 70B         | 16       | 164,099    | 38,434     |
| Mixtral      | 8x22B       | 32       | 24,102     | 3,180      |
| DeepSeek-MoE | 16B         | 8        | 76,111     | 1,867      |

current AI systems and guiding the design choices for next-generation systems. Efforts like MLPerf [53] have been leading the way in identifying representative benchmarks in the domain of AI. Unfortunately, deploying the full software stack of distributed AI benchmarks for the sole purpose of running optimization and design-space exploration (DSE) studies is prohibitive in practice, as they require extensive framework (PyTorch/JAX/TensorFlow) expertise and continued access to a large-scale systems. Furthermore, it is extremely difficult to isolate hardware versus software bottlenecks, and compute versus memory versus network behaviors.

Acknowledging the aforementioned challenges, recent efforts [25], [57] have proposed the idea of Execution traces (ET) as a mechanism to capture the *coarse-grain (i.e., operator-level) compute and communication dependence behavior* during AI training. In particular, MLCommons Chakra [57] has introduced specific support within PyTorch to trace the dependence graph (with timing) of distributed AI workloads. Selective replay of the ETs [35], and analysis of the captured metadata (type, size and data volume) can help expose computation, memory, and communication bottlenecks, in turn guiding optimization tools.

While ETs are expected to play a crucial role in AI system design, we believe that ETs alone are insufficient for guiding optimization and DSE for the following reasons:

- **High cost and limited accessibility:** Generating ETs requires large-scale infrastructure—often hundreds or thousands of GPUs—accessible only to a few hyperscalers. Further, even when ETs are collected, privacy and proprietary constraints may prevent them from being shared broadly with the research community.
- **Tied to AI platform:** ETs from real-systems are inherently tied to the system they were collected on, with platform-specific software optimizations and hardware bindings baked in. This limits scalability and generality to study larger and diverse systems. As Table I shows, even a single training epoch of a mid-sized LLM involves tens of thousands of operations per GPU, making trace analysis

\*Equal contribution.

\*\*Symbolic Tensor grAph GEnerator



Fig. 1. Overview of STAGE

and scaling a nontrivial task. Efforts to scale ETs [10], [25] have focused on mimicking pre-existing system and model behaviors rather than enabling exploration of diverse configurations or novel parallelization strategies.

- **Tied to AI Model.** In the arms race of AI models, there continues to be rapid evolution of LLM architectures—driven by innovations such as MoEs [12], [30], attention mechanism variants [3], [15], [62], and state space models [21], aimed at improving model accuracy and training efficiency. This can render ETs from real-systems obsolete in a matter of months.

These challenges point to a growing need for a more agile framework for distributed AI workload generation that can flexibly adapt to emerging AI model structures and support fast iteration across diverse hardware platform architectures. To this end, we present STAGE, a novel framework for generating high-fidelity, scalable, and configurable execution graphs for distributed LLM workloads. Fig. 1 shows overall flow of STAGE. At the front-end, STAGE accepts user-defined input workloads in tensor format and supports both predefined model templates and customized inputs for future extensibility. A key innovation in STAGE is the use of a symbolic tensor representation to generate a graph representation that compactly captures distributed ML workloads, enabling scalability by describing their shared computational structure while flexibly incorporating variations in tensor dimensions. Our abstraction enables flexible tensor partitioning and systematic support for all major parallelization strategies, as well as their arbitrary combinations—including hypothetical configurations beyond those seen in existing systems. Once the distributed execution graph is constructed, STAGE converts it into a schema that can be integrated with either a downstream simulator or augment a collection of real-system ETs for system optimization/analysis.

The key contributions of this paper are as follows:

- **Symbolic Representation for Diverse AI Model Architectures:** STAGE uses symbolic operations to abstract and generalize LLMs, enabling graph-based workload generation across a wide range of model architectures including dense (e.g., LLaMA, GPT), MoE (e.g., DeepSeek, Mixtral), and state-space-style (e.g., Mamba).
- **Comprehensive Parallelism Modeling:** STAGE systematically supports all viable combinations of parallelism with novel producer-consumer-based communication matcher. It enables exhaustive exploration of parallelization configurations for diverse systems.
- **Compute, Memory, and Network Modeling:** STAGE accurately models computation, memory, and communica-

cation at tensor granularity by analyzing tensor dimensions, lifetimes, and synchronization behavior. This fine-grained modeling enables deeper insights into bottlenecks and resource utilization.

- **Validation with Real-World Traces:** STAGE generates execution graphs that model computation, communication, and memory behavior, and we validate their fidelity using real traces collected from a single GPU to a production-scale 128-GPU H100/H200 HGX clusters executing large-scale LLM training workloads<sup>†</sup>
- **Scalable and Open-Source Framework:** STAGE can synthesize training traces for models running on 32K GPUs less than 30 minutes without compromising accuracy. This enables fast and scalable system analysis. The framework will be publicly released to support the research community.

## II. BACKGROUND AND RELATED WORKS

### A. Large Language Models (LLMs)

As a neural language model, Large Language Models (LLMs) have been scaled to the unprecedented level due to the effectiveness proven by the scaling law [31]. These models, trained on various dataset, span billions or even trillions of parameters [51]. The size and complexity of LLMs stimulate substantial computational and memory demands and necessitate advanced parallelization techniques and corresponding collective communications to make training feasible and efficient. For the current LLMs, decoder-only transformer is adopted by most of the popular models like LLaMA [59] and GPT [6]. The general architecture of the decoder-only transformer is composed of stacks and within each stack, it includes a series of main operations, like LayerNorm, Multihead-Attention, MLP. Within Multihead Attention and MLP layers, the computation is further decomposed into finer-grained operations, such as matrix multiplications (e.g., Linear and MatMul), activation functions (e.g., Softmax and GeLU), and regularization components like Dropout and LayerNorm.

### B. Distributed Training with Multi-dimensional Parallelization Strategies

To support large-scale LLM training, sufficient memory is required to store both model weights and input activations, along with adequate computational resources to complete training within a reasonable timeframe. Consequently, the following parallelization strategies are used in practice.

- **Data Parallelism (DP):** Splits input data across devices with replicated weights; synchronizes gradients after backward pass [51]
- **Fully Sharded Data Parallelism (FSDP):** Shards both input batches and model parameters across devices, reducing memory use but adding communication to gather parameters during training [70].

<sup>†</sup>Upon acceptance of this paper, we will release these traces as publicly available complementary resources.



Fig. 2. An Example of Execution Trace for GPU Operations.

- **Tensor Parallelism (TP):** Shards model weights across devices while replicating input data; requires AllReduce to exchange activations after each layer.
- **Sequence Parallelism (SP):** Splits input sequences into tokens; complements TP by replacing AllReduce with more efficient AllGather and ReduceScatter.
- **Pipeline Parallelism (PP):** Divides model into stages and pipelines microbatches for concurrent execution across devices.
- **Expert Parallelism (EP):** For MoE models, uses AllToAll to route tokens to specialized experts after attention layers.

Each strategy introduces unique patterns for computation, memory access and network communications for a large-scale system [51], [56]. To maximize efficiency and scalability, LLM frameworks today combine multiple parallelism strategies for training different workloads, such as DP, TP, SP, PP and EP within a single model. More details are covered in the following Section III-B.

### C. Execution Traces

Modern large-scale machine learning systems consist of interleaved compute and communication operations, often with complex execution order and data flow. Execution traces (ETs) provide a structured view of these operations by capturing actual runtime behavior. They record the sequence of operations along with metadata such as device type, execution time, and memory usage. Tools like PyTorch profiler [49], Kineto [58], PARAM [55], and Chakra [40] collect these traces at various abstraction levels.

Graph-based formats are particularly useful. They represent compute and communication operations as nodes, and encode data and control dependencies as edges. This structure enables analysis of execution order, critical paths, operator overlap, and performance bottlenecks. Fig. 2 illustrates a simplified execution trace graph depicting GPU computations, communications, and their inter-operation dependencies. Operation specific parameters, such as the tensor size of a GeMM operation, are encoded as attributes within each node. Following the control and data dependencies from left to right, it reveals which operations must run sequentially and which can execute in parallel across different tensor objects. Labeled tensor sizes are updated following each computation or communication operation between dependent tensors.

## III. MOTIVATION

In this section, we discuss the specific challenges we target with STAGE.

TABLE II  
CATEGORIZATION OF MODERN LLM COMPONENTS AND PARALLELISM STRATEGIES WITH NATIVE SUPPORT IN STAGE

| Category            | Component               | Origin Source                        |
|---------------------|-------------------------|--------------------------------------|
| Attention Mechanism | Multi-head              | Transformer [63]                     |
|                     | Group Query Attention   | LLaMA [59]                           |
|                     | Multi-latent            | DeepSeek-V2 [1]                      |
|                     | State Space Model       | Mamba [21]                           |
| Feedforward Network | Up-down FFN             | GPT [6]                              |
|                     | Gate-up-down FFN        | LLaMA [59]                           |
| Normalization       | RMSNorm                 | LLaMA [59]                           |
|                     | Elem-wise Norm          | BERT [16]                            |
| Mixture-of-Experts  | MoE                     | Gshard [33], Switch Transformer [18] |
|                     | MoE with Shared Experts | DeepSeek-MoE [12]                    |
| Sharding Strategy   | Data Parallelism        | PyTorch DDP [34]                     |
|                     | Tensor Parallelism      | Megatron-LM [56]                     |
|                     | Pipeline Parallelism    | GPipe [26], PipeDream [23]           |
|                     | FSDP (ZeRO-3)           | DeepSpeed [51], PyTorch-FSDP [70]    |
|                     | Expert Parallelism      | Switch Transformer [18]              |

### A. Challenge 1: Limitations of Real-System ETs

Access to high-fidelity workloads is crucial for optimization and DSE efforts. However, obtaining real ETs is extremely prohibitive in practice due to the high computational and financial cost of running LLMs over large clusters. Moreover, data-sharing limitations prevent organizations with resources from making internal ETs publicly available due to security concerns. Furthermore, even if system designers/optimizers have access to real ETs, their properties are inherently tied to the model architecture, parallelism strategy, and underlying hardware platform (e.g., fused operators depends on compiler support within the platform, compute and communication volumes are tied to the system size, and so on). This makes it challenging to extend the ETs or perform DSE for hypothetical future platforms.

### B. Challenge 2: Modeling LLM Architecture and Parallelism Diversity

The wide range of LLM architectures and parallelization strategies makes synthetic modeling of distributed ML workloads particularly challenging. Table II summarizes commonly used model components and parallel strategies optimized for specific training or inference objectives. In practice, LLM developers rarely rely on a single model architecture or parallel strategy. Instead, they often combine multiple design components, resulting in compositional and complex workloads.

**Diverse and Rapidly Evolving Model Architectures.** Modern LLM architectures exhibit substantial diversity, significantly increasing complexity in workload modeling. For instance, LLaMA [59] incorporates Group Query Attention (GQA) in its attention mechanism alongside a unique three-layer feed-forward network, differing substantially from traditional GPT architectures. More recent models, such as DeepSeek-R1 [14], further increase complexity by employing MoE layers with shared experts and Multi-head Latent Attention (MLA). Additionally, non-transformer architectures, exemplified by Mamba [21], replace conventional attention with selective state-space models. Emerging hybrid architectures combining transformer and state-space models, such as Zamba [19] and Jamba [37], further compound the complexity,

underscoring the difficulty of accurately and systematically capturing diverse model behaviors.

#### Complexity and Variability in Parallelization Strategies.

Practical deployments of LLMs often employ hybrid parallel strategies to optimize system performance and resource utilization. Common training frameworks such as Megatron [56], NeMo [24], and HuggingFace Accelerate [22] frequently integrate multi-dimensional parallelization, combining data parallelism, tensor parallelism, and pipeline parallelism. Further optimizations, such as sequence parallelism, are also increasingly adopted in combination with tensor parallelism to enhance communication efficiency. More recently, FSDP has become crucial for reducing memory overhead. Notably, fully-sharded context parallel by combining weight sharding and context parallel can be a valid parallel strategy, even though it is not commonly acknowledged in current practices. These hybrid strategies create an expansive design space that existing frameworks struggle to systematically represent and evaluate, highlighting a critical gap in current workload modeling capabilities.

#### Infeasibility of Manually Defining Distributed Workloads.

Several simulation frameworks (such as Calculon [27], Mad-Max [25], SimAI [65]) rely on customized templates or analytical first-order equations to address the challenge of describing AI workloads. While this approach enables fast analysis and can be easily evaluated on a CPU-only system, they are often over-optimized for specific target workloads and require deep understanding with the codebase for extensions. Moreover, their analytical nature limits the ability to capture realistic system and hardware behaviors, such as compute–communication dependencies. This limitation poses a barrier for researchers and practitioners from architecture or hardware backgrounds who aspire to prototype new designs on the latest AI workloads. Bridging this gap is therefore critical to improving simulation fidelity and enabling hardware and software optimizations for LLM training.

## IV. STAGE: SYMBOLIC TENSOR GRAPH GENERATOR

STAGE addresses the challenges discussed in Section III by representing LLM workloads based on symbolic abstractions. Users simply specify high-level parameters such as model size and parallelism degrees, and STAGE automatically generates execution graphs capturing computation, communication, and memory behavior. By simplifying workload modeling while preserving key execution characteristics, STAGE bridges the gap between synthetic and real-world traces, supporting scalable and systematic design exploration.

### A. STAGE Overview

Fig. 3 provides a high-level overview of STAGE, illustrating its workflow from model specification to workload simulation. ① Symbolic Tensor Graphs (STG): The flow starts from a set of templates of commonly used modules in LLMs. These modules are integrated in the STAGE framework in the format of Symbolic Tensor Graph Intermediate Representation (STG IR). ② STAGE then assembles these modules into the



Fig. 3. STAGE Generation Flow Overview.

whole model by repeating and connecting each module into a large STG for the whole model. ③ With the assembled model, STAGE distributes the workload from a single piece to multiple accelerators by doing tensor-level distribution and graph-level distribution. STAGE analyzes the communication required by each parallel strategies based on the tensor/graph shardings. ④ Finally, STAGE interprets the STG IR and generates a directed acyclic graph (DAG) with explicit operation dependencies for downstream tasks.

### B. Workload definition

STAGE is designed to be both simple to use and highly flexible, providing a systematic pipeline from model specification to workload generation.

1) **Input: Model and Module Templates:** To ensure ease of use, STAGE requires only two user inputs: the target model (e.g., GPT, LLaMA) and a selection of module templates (e.g., MHA, FFN, MoE) in STG IR format. This design allows users to generate symbolic tensor graphs without manually specifying the entire model structure. In addition, STAGE supports user-defined operations beyond the built-in templates and models, enabling researchers to extend the framework with custom computations. This flexibility is essential for supporting future system-level optimizations and accommodating emerging model architectures.

2) **Output: Execution Graph.**: By default, STAGE leverages the *Chakra* schema [40] since it is being standardized by MLCommons [39]. This schema captures the dependencies between compute and communication tasks, essential for identifying bottlenecks, critical paths, and opportunities for computation-communication overlap during distributed training. Using execution graph to explicitly model task dependencies is widely adopted in both workload benchmarking [34], [57] and workload modeling [17], [36]. While Chakra is the default, STAGE can be flexibly adapted to other output formats, by introducing suitable translation modules.

### C. Symbolic Tensor Representation

STAGE introduces the *Symbolic Tensor Graph* (STG) as an intermediate representation (IR) to model ML workloads. STG abstracts tensor shapes, operations, and distribution strategies symbolically, enabling efficient reuse across workloads that share the same graph structure but differ in dimensions (e.g., batch size).



Fig. 4. Using Symbolic Tensor Representation to Annotate MultiHead Attention with Sequence and Tensor Parallelism.

**Symbolic Tensor Format:** Tensors are represented as:

Tensor[Shape @ Hidden]

Here, Shape includes symbolic dimensions such as Batch (B) and Sequence (S) and may also contain *partition symbols*, such as data parallelism (dp), tensor parallelism (tp) or sequence parallelism (sp). The optional *Hidden* (H) field denotes partial sums across devices. In the ML context, the hidden dimension typically corresponds to the model’s embedding size or other feature dimensions. For instance, a tensor  $x$  with dp is represented as  $x[B/dp, H]$  with the batch dimension sharded across devices.

Fig. 4 illustrates how tensor representations are used to model multihead attention sp and tp. For clarity, the input tensor assumes a batch size of 1, and key intermediate tensors undergoing shape transformations are highlighted in grey.

**Tensor-Level Distribution Types:** STAGE defines three symbolic distribution semantics:

- *Duplicated*: Full copy on all devices.
- *Partition*: Tensor is disjointly sharded across devices along a specific dimension.
- *PartialSum*: Each device holds a partial result; reduction is required.

These distribution types can be composed to represent complex parallelization strategies. For example, the following notation combines *dp*, *sp*, and *tp*:

$X[B/dp, S/sp, H] @ 1/tp$

Here, *dp* applies to the batch dimension, *sp* to the sequence dimension, and *tp* at the end indicates that the tensor is in *PartialSum* form across the hidden dimension.

**Symbolic Operations:** Operators are expressed using a concise format:

`output = op[op_attr](input1, input2, ...)`

For example, matrix multiplication is written symbolically as:

$y = \text{einsum}[bm, mn \rightarrow bn](x, w)$

Here,  $x$  has shape  $[b, m]$ ,  $w$  has shape  $[m, n]$ , and the output  $y$  has shape  $[b, n]$ . STAGE adopts *einsum* to express all tensor multiplications, allowing representation of preserved, reduced, and shared dimensions. By encoding partitioning strategies directly into symbolic tensor shapes, STAGE offers a unified abstraction that captures both computation and parallel execution, serving as the foundation for STG construction and downstream simulation.

TABLE III

TENSOR-LEVEL DISTRIBUTION IN A LINEAR LAYER. WE USE  $[H, 4H]$  TO DENOTE THE WEIGHT MATRIX FOR THE UP-PROJECTION, A COMMON OPERATION IN MLPs.

| Parallel Strategy                                                                  | Symbolic Tensor Representation                             |
|------------------------------------------------------------------------------------|------------------------------------------------------------|
| No Parallel                                                                        | $x[B, H]$<br>$w[H, 4H]$<br>$y[B, 4H]$                      |
| Data-Parallel ( <i>dp</i> )                                                        | $x[B/dp, H]$<br>$w[H, 4H]$<br>$y[B/dp, 4H]$                |
| Tensor-Parallel (Row) ( <i>tp</i> )                                                | $x[B, H/tp @ 1]$<br>$w[H/tp, 4H @ 1]$<br>$y[B, 4H @ 1/tp]$ |
| Tensor-Parallel (Column) ( <i>tp</i> )                                             | $x[B, H]$<br>$w[H, 4H/tp]$<br>$y[B, 4H/tp]$                |
| Fully Sharded Data Parallel ( <i>fsdp</i> )                                        | $x[B/fsdp, H]$<br>$w[H/fsdp, 4H]$<br>$y[B/fsdp, 4H]$       |
| Hybrid-Parallel ( <i>hp</i> )<br>(Column Tensor Parallel<br>w/ Activation Sharded) | $x[B/hp, H]$<br>$w[H, 4H/hp]$<br>$y[B, 4H/hp]$             |

Table III enumerates some of the common distributing techniques used for linear layer in LLM, and also a hybrid one to show the flexibility. With this systematic definition, STAGE reduces the need for user intervention while supporting flexibility for defining custom distributing strategies. Section VI-E discusses how conventional parallel strategies can be defined with STAGE.

#### D. Workload Distributor

In STAGE, distributed workloads are handled using two approaches: (1) *Tensor-level distribution* where each machine holds shards of a tensor and collaborates to execute a single operator and (2) *Graph-level distribution* where each machine is responsible for a portion of the computation graph and exchanges data via send-receive pairs when information flows across graph partitions. Depending on the deployed parallelization strategy, STAGE employs dedicated components to implement either the tensor-level or graph-level distribution approach.

1) **Tensor-level distributor:** Tensor-level distribution transforms the initial tensor representations with corresponding parallel dimensions, enabling efficient workload distribution across multiple devices. However, these strategies inherently introduce the need for *collective communication*, which is essential for maintaining consistency and data alignment between devices during computation. Accurately modeling these communication patterns is crucial to reflect the real-world behavior of parallel workloads. STAGE encodes tensor shardings in the assembled models. Then the tensor-level distributor will apply the corresponding parallel strategies, analyze and generate the collective communications required by the parallel strategies using Collective Communication Matcher.



Fig. 5. Tensor Distribution Mismatch: After applying tensor-level distribution



Fig. 6. Collective Communication can be divided into two steps: Pull + Push. Note that Slice\* is a special case on a single machine.

In Fig. 5, we illustrate how propagating the initial parallelization across an undistributed compute graph—to avoid manually defining every sharded tensor—can create tensor distribution mismatches, where the producer and consumer of a tensor expect different sharding layouts. In STAGE, before applying tensor-level distribution, we first propagate the compute graph to infer each tensor’s shape. Then we apply the tensor distribution separately for each operator, and repropagate the shape. Then, we can observe that there will be a distribution mismatch of tensor  $x_1$  from different views. From the producer view of  $x1 = \text{einsum}(x0, w0)$ , it has output of  $[a, c @ 1/tp]$ . Meanwhile from the consumer view of  $x2 = \text{einsum}(x1, w1)$ , it needs of  $[a, c]$ . The resulting tensor mismatch necessitates a collective AllReduce operation to aggregate the partial sums across distributed tensors.

**2) Collective Communication Matcher:** To handle the diverse communication requirements arising from various parallelization strategies, STAGE uses a communication matcher to systematically identify and encode the required communication operations. The matcher operates by analyzing the distribution patterns of tensors across devices and matches the appropriate collective communication operations based on the relationship between the distribution from the *producer* that produces the tensor, and the distribution from the *consumer* that consumes this tensor. This producer-consumer model is divided into two conceptual steps *Pull* and *Push* as Fig. 6 shows.

In the *Pull* step, data is gathered from all devices in the pro-

TABLE IV  
EXAMPLES OF MATCHED COLLECTIVE COMMUNICATIONS. THE SYMBOLIC TENSOR NOTATIONS ARE DEFINED IN SECTION IV-D.

| Producer Tensor Distribution | Matched Coll-Comm        | Consumer Tensor Distribution |
|------------------------------|--------------------------|------------------------------|
| $[B/dp, S, H@1/tp]$          | ReduceScatter            | $[B/dp, S, H/tp]$            |
| $[B/dp, S, H@1/tp]$          | AllToAll                 | $[B, S/dp, H@1/tp]$          |
| $[B/dp, S, H@1/tp]$          | AllGather                | $[B, S, H@1/tp]$             |
| $[B/dp, S, H@1/tp]$          | AllReduce                | $[B/dp, S, H]$               |
| $[B/dp, S, H@1/tp]$          | ReduceScatter + AllToAll | $[B/tp, S, H/dp]$            |
| $[B/dp, S, H@1/tp]$          | AllReduce + AllGather    | $[B, S, H]$                  |

ducer distribution to assemble a complete tensor. In the *Push* step, this tensor is then distributed to devices according to the consumer distribution. To bridge these two steps, we introduce a virtual graph node that serves as an intermediate conceptual connector, enabling flexible mix-and-match between *Pull* and *Push* without adding any computation overhead to the final execution graph.

For *Pull*, the process of reconstructing the complete tensor from different distributions is defined as follows:

- *Duplicated*: each device already holds a complete copy of the tensor. As a result, the head node does not require communication with other devices, making *No Communication* necessary.
- *Partition*: the tensor is divided into shards across devices. The head node gathers all shards from the devices and assembles the complete tensor through a process referred to as *Gather*.
- *PartialSum*: while similar to Partition, the aggregation involves summing the values across devices instead of concatenation. This operation is commonly known as *Reduce*.

On the other hand, for *Push*, the process of distributing the tensor to devices is described as follows:

- *Duplicated*: the tensor is replicated from the virtual head node to all devices using a *Broadcast* operation.
- *Partition*: each device receives its corresponding shard of the tensor through an operation called *Scatter*.
- *PartialSum*: Generally not used, as distributing a full tensor as partial sums is uncommon in practice.

To summarize the required communication patterns for tensor transformation, we share examples how the collective communication matcher can be used as shown in Table IV. By integrating a matching algorithm based on push-pull communication principles, STAGE identifies additional patterns that were previously overlooked but can arise from arbitrary tensor distribution schemes.

**3) Graph-Level Distributor:** Graph-level distribution plays a critical role in modeling parallel strategies, particularly pipeline parallelism. Unlike tensor-level distribution, which distributes individual operators, graph-level distribution divides the compute graph into multiple subgraphs and assigns these subgraphs to different devices.

In STAGE, a graph distribution can be defined with multiple lists of nodes, where each list contains the nodes within this subgraph. Furthermore, for specific parallel strategies like

pipeline parallel, we predefine a rule-based script to partition the workload into multiple stages by evenly dividing models according to their layer.

By partitioning the graph into subgraphs, we create some cross-graph edges, which indicate where the tensor moves from one machine to another. STAGE inserts send/recv pairs by identifying the rank of the source and destination nodes on each side of cross-graph edges.

#### E. Graph Instantiation: Symbolic to Numeric Conversion

At the final stage of the STAGE pipeline, the STG is transformed into fully instantiated execution graphs. In this step, STAGE replaces symbolic tensor shapes, operations, and communication patterns with concrete numeric values (such as batch size, sequence length, or hidden size), producing a detailed per-node representation of tensor sizes, communication volumes, and operator types. Once specified, these values are automatically propagated through the STG, resulting in a complete and consistent execution graph.

For advanced use cases, STAGE also supports plugging in real-world values collected from profiling tools such as PyTorch or Kineto [50]. These real values can be selectively injected into the symbolic graph to guide the instantiation process, enabling hybrid scenarios where partial traces are extended or scaled. This feature allows users to maintain high fidelity to real system behaviors while still benefiting from the scalability of symbolic modeling.

By separating graph construction from value instantiation, STAGE offers both scalability and adaptability—supporting systematic simulation and analysis workflows across a broad design space.

## V. VALIDATION

To ensure the fidelity of STAGE-generated workloads, we conducted a comprehensive comparison with real ETs.

### A. Methodology

Execution traces were collected from a system equipped with 128 NVIDIA H100 GPUs (SMX5) across 16 servers, each hosting 8 GPUs. The system was configured with NVIDIA NeMo 24.07, CUDA 12.5, and PyTorch 2.5.0. Additionally, each server was powered by dual Intel Sapphire Rapids CPUs (32-core, 2.8 GHz) and DDR5 DRAM. We modified the NVIDIA NeMo [44] framework to integrate PyTorch’s profiling features and enable Chakra trace collection [40]. This setup employed CUDA Profiling Tools Interface (CUPTI) [43] to capture kernel execution timelines and operator-level activity, offering detailed insights into computational and communication operations as well as device-memory usage. For validation, we focused on three aspects: (1) the peak device-memory usage, (2) computation operators and volume, (3) communication operators and volume

### B. Memory Footprint Validation

For memory-footprint validation, we fed STAGE-synthesized graphs to ASTRA-Sim [66], extending its native Chakra-format workload feeder<sup>‡</sup> to track memory

<sup>‡</sup>ASTRA-Sim natively consumes execution traces in the Chakra format.

TABLE V  
PEAK PER-GPU MEMORY ANALYSIS.

| Model        | Hardware         | Parallelization  | Measured | Synthesized | Error Rate* |
|--------------|------------------|------------------|----------|-------------|-------------|
| GPT-3 5B     | one 8-H100-HGX   | FSDP=8           | 18.1 GB  | 16.1 GB     | 5.5%        |
| GPT-3 5B     | one 8-H100-HGX   | TP=8             | 15.4 GB  | 13.7 GB     | 4.5%        |
| GPT-3 5B     | one 8-H100-HGX   | PP=8             | 17.5 GB  | 15.2 GB     | 7.4%        |
| GPT-3 175B   | four 8-H100-HGX  | TP=32            | 118.9 GB | 115.2 GB    | 2.3%        |
| LLaMA-3 70B  | two 8-H100-HGX   | TP=16            | 94.3 GB  | 92.1 GB     | 1.3%        |
| Mixtral8x7B  | eight 8-H100-HGX | TP=4, EP=8, PP=4 | 15.8 GB  | 16.07 GB    | 1.7%        |
| Mixtral8x7B  | four 8-H100-HGX  | EP=8, PP=4       | 56.8 GB  | 58.55 GB    | 3.0%        |
| Mixtral-144E | four 8-H100-HGX  | EP=8, TP=2, DP=2 | 26.6 GB  | 27.4 GB     | 2.9%        |

\*We remove the CUDA initialization footprint for error estimate.

usage over the simulation lifetime. Our modifications enable ASTRA-Sim to utilize tensor metadata (e.g., name, size) from STAGE graphs when generating tensor read/write events. These events are then post-processed to determine each tensor’s lifetime, from creation to last use, assuming garbage collection immediately thereafter.

Table V compares per-device peak memory usage across different hardware configurations, models, and parallelization strategies, using both measured traces and STAGE-synthesized execution graphs. On average, the simulated peak memory usage is about 2GB lower than the measured value. This discrepancy primarily arises from PyTorch’s CUDA initialization, which consumes roughly 1GB of VRAM and also from delays in actual tensor garbage collection. After excluding this initialization overhead, the memory footprint predicted by STAGE accounts for approximately 97% of the measured footprint on average. This inaccuracy is within acceptable bounds for our targeted large-scale simulations, with the error rate decreasing as model size increases (Table V).

### C. Compute Validation

We categorize compute into four types: (1) *GeMM*, (2) *Attention*, (3) *ElementWise*, and (4) *Other*.

**Operator Significance.** In Fig. 7, we show the runtime of each operator for different instances we measure from real systems. We can see that *GeMMs* and *Attentions*, contribute the most to the overall performance, followed by *ElementWise* for some models. *Others* primarily includes computations for data transformation and memory management, and are negligible in terms of relative performance cost (less than 1% on average, at max 5% in one case).

**Operator Counts.** Table VI compares the operation counts per GPU during a single training epoch. The results show a strong alignment between the real-world traces and STAGE-synthesized workloads (4.6% error rate for a large model like GPT3-175B) for the performance-heavy compute operators like *GeMMs* and *Attentions*. For *Other* operators, in STAGE we provide two data transformation ops: transpose and reshape, which get synthesized automatically based on the compute shapes. PyTorch, on the other hand, includes several additional operators for memory allocation and data transfers that show up in the real traces. This leads to a mismatch in raw counts, but adds negligible impact to overall runtime (Fig. 7) and are highly vendor-specific so we decided not to include them in STAGE (though they can be easily added). For other mismatches like *ElementWises*, it is mostly because they are usually fused together, or with other compute operators, in the real traces. This is highly device/vendor-specific information that gets embedded in real traces, while in STAGE has the

TABLE VI  
NUMBER OF KEY OPERATION OCCURRENCES PER GPU FOR A SINGLE EPOCH (MEASURED / SYNTHESIZED)

| Model       | GPUs | Parallelization              | Micro Batch / Batch | Computation     |               |                 |                  | Communication. |                 |               |                 |                 |
|-------------|------|------------------------------|---------------------|-----------------|---------------|-----------------|------------------|----------------|-----------------|---------------|-----------------|-----------------|
|             |      |                              |                     | GeMM            | Attn          | ElementWise     | Others           | P2P            | AllReduce       | All2All       | AllGather       | ReduceScatter   |
| GPT-3.5B    | 8    | TP=8, w/ SP                  | 1 / 128             | 37,248 / 37,632 | 6,144 / 6,144 | 24,832 / 28,160 | 165,207 / 21,504 | 0 / 0          | 514 / 256       | 0 / 0         | 18,816 / 18,432 | 12,416 / 12,288 |
|             | 8    | PP=8                         | 1 / 128             | 4,608 / 4,992   | 768 / 768     | 3,072 / 3,584   | 21,475 / 2,688   | 136 / 128      | 2 / 0           | 0 / 0         | 0 / 0           | 0 / 0           |
|             | 8    | FSDP=8                       | 8 / 128             | 4,656 / 4,704   | 768 / 768     | 3,104 / 3,520   | 9,400 / 2,688    | 0 / 0          | 177 / 32        | 0 / 0         | 784 / 768       | 400 / 384       |
| GPT-3.175B  | 32   | TP=32 w/ SP                  | 1 / 128             | 36,960 / 37,056 | 6,144 / 6,144 | 24,640 / 27,776 | 88,573 / 21,504  | 0 / 0          | 130 / 64        | 0 / 0         | 18,528 / 18,432 | 12,320 / 12,288 |
|             | 64   | TP = 4, DP = 2, PP = 8, w/SP | 1 / 128             | 9,216 / 9,408   | 1,536 / 1,536 | 3,072 / 6,976   | 26,615 / 5,376   | 67 / 64        | 67 / 896        | 0 / 0         | 4,683 / 4,608   | 3,078 / 3,072   |
|             | 8    | TP=4, PP=2                   | 1 / 32              | 24,576 / 24,960 | 4,096 / 4,096 | 12,288 / 18,560 | 46,080 / 14,336  | 130 / 128      | 2 / 128         | 0 / 0         | 12,416 / 12,288 | 8,320 / 8,192   |
| Llama-3 70B | 8    | TP=8                         | 1 / 128             | 49,536 / 49,920 | 8,192 / 8,192 | 24,832 / 37,376 | 85,634 / 28,672  | 0 / 0          | 16,897 / 16,640 | 0 / 0         | 0 / 128         | 0 / 0           |
|             | 16   | TP=4, PP=2, DP=2             | 1 / 128             | 12,288 / 12,480 | 2,048 / 2,048 | 6,144 / 9,280   | 23,054 / 7,168   | 65 / 64        | 4,161 / 5,248   | 0 / 0         | 16 / 0          | 16 / 0          |
| Mixtral 8x7 | 128  | TP=8, PP=4                   | 1 / 128             | 1,920 / 1,968   | 256 / 256     | 512 / 1,680     | 10,801 / 2,688   | 16 / 16        | 131 / 16        | 512 / 512     | 659 / 1152      | 531 / 512       |
|             | 32   | EP=8, PP=4                   | 1 / 128             | 1,920 / 1,968   | 256 / 256     | 512 / 1,680     | 11,028 / 2,688   | 16 / 16        | 2 / 0           | 512 / 512     | 9 / 0           | 9 / 0           |
| DeepSeek 8E | 8    | EP=8                         | 1 / 128             | 27,456 / 25,632 | 896 / 896     | 6,784 / 896     | 46,938 / 21,952  | 0 / 0          | 18 / 0          | 1,728 / 1,792 | 23 / 1,344      | 23 / 0          |
|             | 32   | EP=8, TP=2, DP=2             | 1 / 128             | 13,116 / 12,908 | 224 / 224     | 1,443 / 1,384   | 22,567 / 14,356  | 0 / 0          | 131 / 114       | 448 / 448     | 576 / 562       | 458 / 450       |

TABLE VII  
COMMUNICATION BREAKDOWN PER GPU FOR A SINGLE EPOCH (MEASURED / SYNTHESIZED)

| Model            | GPUs | Parallelization         | Micro Batch / Batch | Communication Volume (MB) |                       |                         |                         | Total Error             |             |  |  |
|------------------|------|-------------------------|---------------------|---------------------------|-----------------------|-------------------------|-------------------------|-------------------------|-------------|--|--|
|                  |      |                         |                     | Send                      | Receive               | AllReduce               | AllGather               | ReduceScatter           | Total Error |  |  |
| GPT-3.5B         | 8    | TP=8, w/ SP             | 1 / 128             | 0.000 / 0.000             | 0.000 / 0.000         | 1075,126 / 1073,742     | 19730,006 / 19327,353   | 104152,957 / 103079,215 | 0.23%       |  |  |
|                  | 8    | PP=8                    | 1 / 128             | 1073,742 / 1073,742       | 1073,742 / 1073,742   | 206,045 / 206,045       | 0.000 / 0.000           | 0.000 / 0.000           | 0.000%      |  |  |
|                  | 8    | FSDP=8                  | 8 / 128             | 0.000 / 0.000             | 0.000 / 0.000         | 0.000 / 0.000           | 19761,349 / 20401,095   | 80760,930 / 78383,153   | 0.346%      |  |  |
| GPT-3.175B       | 32   | TP=32, w/ SP            | 1 / 128             | 0.000 / 0.000             | 0.000 / 0.000         | 812,605 / 805,306       | 14571,012 / 14495,515   | 310042,952 / 309237,645 | 0.055%      |  |  |
|                  | 64   | TP=4, DP=2, PP=8, w/ SP | 1 / 128             | 13287,555 / 13287,555     | 13287,555 / 13287,555 | 1767,211 / 1384,120     | 29393,682 / 28991,029   | 77309,411 / 77309,411   | 0.043%      |  |  |
| LLaMA-3 70B      | 8    | TP=4, DP=2, PP=8, w/ SP | 1 / 128             | 1073,742 / 1073,742       | 1073,742 / 1073,742   | 0.000 / 0.000           | 104152,957 / 103210,287 | 279172,874 / 275008,979 | 0.265%      |  |  |
|                  | 8    | TP=8                    | 1 / 128             | 0.000 / 0.000             | 0.000 / 0.000         | 558315,340 / 587068,342 | 0.000 / 0.000           | 0.000 / 0.000           | 0.985%      |  |  |
| Mixtral 8x7      | 16   | TP=4, DP=2, PP=2        | 1 / 128             | 2147,484 / 2147,484       | 2147,484 / 2147,484   | 139552,883 / 138425,733 | 0.000 / 0.000           | 0.000 / 0.000           | 2.980%      |  |  |
|                  | 128  | TP=4, EP=8, PP=4        | 1 / 128             | 4496,293 / 4362,076       | 4496,293 / 4362,076   | 0.329 / 16,384*         | 3825,205 / 3590,357     | 13153,337 / 17716,740   | 2.755%      |  |  |
| DeepSeek-MoE 64E | 8    | EP=8                    | 1 / 128             | 44767,773 / 45097,157     | 44836,544 / 45097,157 | 0.000 / 0.000           | 142,359 / 3758,785*     | 1138,870 / 1138,870     | 0.945%      |  |  |
|                  | 32   | EP=8, TP=2, DP=2        | 1 / 128             | 1981,809 / 1961,082       | 1981,809 / 1961,082   | 8,356 / 16,662          | 1720,713 / 1814,284     | 2954,887 / 3025,804     | 1.501%      |  |  |

\*Attn here is the fused kernel of the flash attention that includes multiple multiple GEMMs and activation

flexibility to implement arbitrary fusion strategies, if that is required by the user.

**Compute Time.** To evaluate the compute time for each operator, we utilized a compute model with a mixture of look-up table of benchmarked operators, as well as a calibrated roofline model based on benchmarked operators. As our model is based on real system benchmarking and profiling, we can achieve a high accuracy that limited only to the benchmarked system, however, that is sufficient enough for our validation. Table VIII compares the total compute time for each operator category per GPU during a single training epoch. From the table we can read that for most workload, we achieve an error rate from 0.3% to 15.0%, with an average error rate of 4.25% across all cases. For reference, Calculon [27] achieve an average of 3.65% across only 4 different model architecture and it does not show the detailed breakdown of different operators. Also, MADMAX [25] achieves 15.34% on Llama-70B which is worse than our averaged 7.7% error across 3 different setups for Llama-3 70B models. In comparison, STAGE covers both more models and also better accuracy, as STAGE utilized a graph-based representation and models in finer granularity with high fidelity.

#### D. Communication Validation.

We categorize communication into five types: point-to-point (P2P), AllReduce, All2All, AllGather, and ReduceScatter.

**Operator Counts.** From Table VI, we observe that the total number of communication operators show a 97.1% match, on average between measured and synthesized. STAGE is likely to generate slightly higher operators in some configurations. After inspection, we found most of them are caused by communication fusion, where the real system might bind two communications of the same type together in order to reduce the initialization cost. In the communication volume validation (next), we will see that the mismatch is gone, because fusion does not affect the communication volume, as well as the overall runtime.

**Communication Volume.** Table VII compares the communication volume that happens on each type. Because CUDA implements the *AllToAll* by decomposing into *Sends* and *Receives*, and Kineto trace only records the communication volume for decomposed ones, in the table, we also decompose STAGE *AllToAll* for better alignment. The results show a strong alignment between the real-world traces and STAGE-synthesized workloads in terms of communication volumes,

TABLE VIII  
OPERATOR TOTAL COMPUTE TIME [MS] (MEASURED / SYNTHESIZED)

| Model             | GPUs | Parallelization              | Micro Batch / Batch | GeMM                  | Attn*               | ElementWise         | Others            | Total Error |           |         |           |
|-------------------|------|------------------------------|---------------------|-----------------------|---------------------|---------------------|-------------------|-------------|-----------|---------|-----------|
|                   |      |                              |                     |                       |                     |                     |                   | P2P         | AllReduce | All2All | AllGather |
| GPT-3.5B          | 8    | TP=8, w/ SP                  | 1 / 128             | 2187,038 / 2060,398   | 210,781 / 197,372   | 106,930 / 96,945    | 50,706 / 44,567   | 6.7%        |           |         |           |
|                   | 8    | PP=8                         | 1 / 128             | 1307,873 / 1413,572   | 183,963 / 197,372   | 96,951 / 100,088    | 88,020 / 66,955   | 5.9%        |           |         |           |
|                   | 8    | FSDP=8                       | 8 / 128             | 1834,065 / 1771,409   | 432,135 / 432,135   | 182,106 / 173,221   | 144,921 / 101,096 | 4.6%        |           |         |           |
| GPT-3.175B        | 32   | TP=32 w/ SP                  | 1 / 128             | 3719,369 / 3690,618   | 444,097 / 444,097   | 164,995 / 173,138   | 164,294 / 109,184 | 1.7%        |           |         |           |
|                   | 64   | TP = 4, DP = 2, PP = 8, w/SP | 1 / 128             | 6697,358 / 6685,758   | 266,667 / 266,667   | 61,357 / 116,860    | 224,446 / 155,882 | 0.3%        |           |         |           |
| Llama-3           | 8    | TP=4, PP=2                   | 1 / 32              | 8913,064 / 8775,129   | 4401,427 / 4399,241 | 524,799 / 487,029   | 343,963 / 281,810 | 1.7%        |           |         |           |
|                   | 8    | TP=8                         | 1 / 128             | 12156,512 / 10993,011 | 5126,283 / 5126,283 | 1896,772 / 1810,958 | 599,802 / 435,314 | 7.4%        |           |         |           |
| Mixtral 8x7       | 16   | TP=4, PP=2, DP=2             | 1 / 128             | 4222,051 / 3635,444   | 2197,434 / 1922,335 | 540,688 / 508,906   | 172,728 / 122,494 | 14.2%       |           |         |           |
|                   | 128  | TP=4, EP=8, PP=4             | 1 / 128             | 444,677 / 508,821     | 43,585 / 43,585     | 222,835 / 197,003   | 47,692 / 32,516   | 3%          |           |         |           |
| DeepSeek-MoE 64E  | 8    | EP=8                         | 1 / 128             | 1688,050 / 1967,291   | 266,427 / 266,427   | 182,059 / 184,090   | 165,123 / 120,834 | 9.8%        |           |         |           |
|                   | 32   | EP=8, TP=2, DP=2             | 1 / 128             | 1015,328 / 1213,253   | 89,478 / 89,478     | 111,846 / 152,674   | 182,876 / 171,558 | 15.0%       |           |         |           |
| DeepSeek-MoE 144E | 32   | EP=8, TP=2, DP=2             | 1 / 128             | 136,410 / 152,649     | 13,192 / 13,192     | 19,526 / 26,035     | 38,589 / 34,914   | 0.088       |           |         |           |

which is necessary for modeling the distributed behavior.

## VI. EVALUATION

We present a suite of design space exploration (DSE) case studies to showcase the value of STAGE for co-design. Unless specified otherwise, all experiments use the ASTRA-sim [66] simulator to model diverse systems.

### A. Impact of Parallelism Strategies

We demonstrate how STAGE can be utilized to explore the complex design space of various parallelization strategies and model optimization techniques and highlight some observations. These case studies are not intended to be comprehensive - and can be extended for deeper research enabled by STAGE.

**1) Observation: Different models prefer different parallelism strategies.:** This experiment demonstrates that no single parallelism strategy fits all models; each model and system may prefer different strategies. This underscores the need for STAGE to generate and explore diverse parallel strategies.

To make a comparison, we simulate a system with 64 H100s connected in an 8x8 NVLink+IB network, and we run DSE on two different model setups: 1) large model and small batch, with PaLM-540B [11] and batch=64. 2) small model and large batch, with LLaMa3.2-1B [20] and batch=2048. Fig. 8a and Fig. 8b show the peak memory usage vs overall runtime for each setup. The data point shapes indicate if weight sharding is applied, colors denote DP/TP/CP configurations, and pipeline parallelism (PP) is calculated as  $pp = GPUs/(dp * tp * cp)$ , where a higher pp will be shown as a darker data point.

For the small-batch-large-model case, we can observe two things: (i) Higher data parallel models are faster **but** require more memory to run, while tensor-parallel models are slower but require less memory, presenting a trade-off of runtime and memory capacity. (ii) Weight-sharding can greatly reduce memory footprint, with the cost of a little extra time.

For the small-model-large-batch case, we can see a clear distinction compared to the previous case. (i) Memory footprint and runtime seem not to be a trade-off anymore, as data-parallel can have both less runtime and less memory footprint. (ii) Weight-sharding has fewer effects, because there are not many large weights that are worth being sharded anymore.

From the two cases we can see that different models and training setups can result in different preferences in parallel

strategies. And real-world cases can be much more complicated. Fig. 8c shows the results of running a LLaMa-70B model with batch=1024 on a 1024 H100 system, which composites features from both previous cases. First, we observe the effect of weight sharding on reducing the memory footprint. Second, the best strategies for reducing the memory footprint are not the simple data/tensor/content/pipeline parallel, but more likely a mixture of different parallel strategies, shown as dirty colors at the bottom of the figure. Third, data-parallel is still the best strategy in reducing overall runtime. However, this is only achievable when memory capacity is sufficient. In our case, the high DP strategies are achievable on both the 80GB and 40GB versions of H100. However, if we further constrain the memory capacity down to 24GB, the optimal one will come with a composite parallel strategy with ( $dp=64$ ,  $tp=4$ ,  $cp=4$ , w/ FSDP)

**2) Observation: Different hardware prefer different parallelism strategies.:** Fig. 9 presents results for various parallel strategies under different hardware configurations. We fix the network topology to an 8x8 2D torus, varying the bandwidth distribution across each dimension as well as the HBM capacity. The total bandwidth on each GPU are constrained to be the same for all setups. From the figure, we observe that under certain hardware configurations, the optimal parallel strategy shifts from pure data parallelism to hybrid strategies due to hardware constraints. This highlights the necessity of DSE, enabled by STAGE, for achieving optimal deployment on specific systems.

**3) Observation: More communication might not mean more runtime.:** The communication and compute overlap is also important. From the previous DSE experiments we see that FSDP can greatly reduce memory footprint in some cases, while not impacting runtime much. However, with FSDP the weight is assembled everytime it is being used. That should cause lots of communications and greatly increase the overall runtime, which is misaligned with our results.

To inspect the reason, in Fig. 10 we visualize the overlap ratio between compute and communication versus the overall runtime. We use dashed line to pair the data points with the same parallel degree, with and without weight sharding. From the figure we can see that for most cases where FSDP do affect, the compute-communication overlap is increased, which indicates that the extra communication introduced by FSDP can be hidden by compute. Furthermore, we observe that the runtime is actually becoming less for most case, possibly because the weight optimizer is also sharded and distributed, and each node has less compute to do.

**4) Observation: Activation Recompute offers a promising trade-off.:** For a specific model and parallel strategy, we can use STAGE to generate workloads with and without activation recomputation [20], [32]. For LLaMa-7B with batch=1, TP=8, and SP, Fig. 11 demonstrates that activation recomputation reduces peak memory but increases runtime. This lowered memory footprint could potentially enable an increased number of data-parallel dimensions, which could be advantageous according to the previous analysis.



Fig. 7. Timing breakdown for different operators



Fig. 8. Peak Memory Usage vs Runtime across configurations.



Fig. 9. Runtime on different HBM capacity and Network Bandwidth. Llama70B @ 64 × H100



Fig. 10. Compute-Comms Overlap vs Runtime, PaLM-540B @ 64 H100

### B. Scalability Studies with STAGE

Next, we show how STAGE can be used for scalability studies on distributed ML systems, particularly how the communications are involved when we scale up the system with different parallel strategies.

**Target System Setup:** The simulated system is constructed using NVIDIA DGX systems, where each box contains 8 H100 GPUs interconnected via NVLink. Sixteen boxes form a pod interconnected through a local ring, and multiple pods are connected via a ring topology. Our experiments span system sizes from 512 GPUs up to 16K GPUs<sup>§</sup>

<sup>§</sup>Because we are targeting the large-scale systems, the native ASTRA-sim simulation triggers an Out of Memory failure. We enhanced the workload feeder for offloaded trace processing on hard drive with caching.



Fig. 11. Memory w/ and w/o Activation Recomputation



Fig. 12. Time Breakdown for Scalability Studies

**Study1 - Data Parallel:** We analyze how data parallelism impacts performance with a fixed microbatch size per GPU (i.e., *weak scaling*), simulating scenarios where batch size is scaled out for more stable convergence and improved training. Using LLaMA-70B with PP=4, we keep the per-GPU batch size at 8 and scale DP. Fig. 12a presents the breakdown of computation and communication times. As expected, compute time stays constant due to fixed per-device batch size and minor contribution to overall runtime. With scaling, communication overhead increases and finally converges, matching the behavior of data-parallel ring all-reduce.

**Study2 - Tensor Parallel :** We evaluate tensor parallelism's impact on training PaLM-540B [11] (DP=4, CP=4, micro-batch=256), scaling TP w/ SP from 4 to 1024 GPUs to simulate faster training (i.e., *strong scaling*). As shown in

TABLE IX  
DECODE AND PREFILLING PERFORMANCE ACROSS DIFFERENT EP CONFIGURATIONS.

| Phase          | Decode  |         |         | Prefilling |           |           |
|----------------|---------|---------|---------|------------|-----------|-----------|
| Cluster Size   | 36      | 72      | 144     | 36         | 72        | 144       |
| Batch Size     | 512     | 1024    | 2048    | 512        | 1024      | 2048      |
| # Tokens       | 512     | 1024    | 2048    | 524,288    | 1,048,576 | 2,097,152 |
| Step Time (ms) | 227.483 | 187.483 | 163.681 | 2051.994   | 2866.145  | 3723.360  |
| Throughput*    | 62.520  | 75.859  | 86.890  | 7097.270   | 5081.235  | 3911.401  |

\*Throughput here is number of tokens processed per second, per GPU.

Fig. 12b, compute time decreases with more GPUs, while communication time remains nearly constant. This is because tensor parallelism with sequence parallelism mainly uses ring reduce-scatter. As the TP degree grows, group size and communication steps increase, but per-device communication volume decreases, keeping total communication time stable. Furthermore, compute time reductions taper off at scale, causing scalability to plateau—especially beyond 2048 GPUs.

### C. Real System Application Study: Deepseek-R1 Inference System

In this section, we want to demonstrate the ability of STAGE in modeling real-world LLM applications, the Deepseek-R1 Inference System [13]. In the system, it introduced a prefilling-decoding disaggregation architecture. The key point is that, decoding and prefilling phases has very different behavior and asks for different parallelism strategy setup for optimal performance.

In our experiment, we assume a system of 144 GPUs, which can be partitioned as 4 clusters of 36 GPUs, 2 clusters of 72 GPUs, or a single cluster of 144 GPUs. Within each cluster we employ pure expert-parallelism for MoE layers, and pure data-parallelism for the remaining layers. We assume that the total batch size, aggregated over all clusters, is fixed at 2048. Table IX reports the decoding and prefilling performance under different EP configurations.

From a performance-efficiency perspective, higher throughput is preferable. The table indicates that prefilling generally benefits from a lower degree of EP, as it typically involves sufficient sequence length and batch size, making it compute-bound; reducing the EP degree also helps lower all-to-all communication overhead. In contrast, decoding involves short sequences per step and thus benefits from larger effective batch sizes, which favor larger clusters and higher EP degrees to maximize throughput.

### D. STAGE performance

We now evaluate the performance of STAGE including runtime and memory footprint for workload generation at different scales. We show that STAGE can greatly shorten the time to gather these graph workloads for simulation. We run the test on a Linux server with 4x Intel Xeon E7-8880v4 @ 2.2GHz, as well as 354 GiB DDR3 @ 1333MHz memory.

We evaluated STAGE across a wide range of GPU scales to assess how generation time grows with model and system size. As shown in Fig. 13, runtime increases non-linearly due to the



Fig. 13. STAGE Runtime Scaling with Number of GPUs  
TABLE X  
STATE-SPACE MODEL

| Inputs                                                                                         | Output          |
|------------------------------------------------------------------------------------------------|-----------------|
| $x[B/p1, S, D/p2]$                                                                             |                 |
| $wdt1[D/p2, R], wdt2[R, D/p2]$                                                                 |                 |
| $A[D/p2, P], B[B/p1, S, P]$                                                                    |                 |
| $C[B/p1, S, P], D[D/p2]$                                                                       | $y[B/p1, S, D]$ |
| Compute:                                                                                       |                 |
| $dt1[B/p1, S, R] = \text{AllReduce}(\text{einsum}[bsd, de \rightarrow bse](x, wdt1))$          |                 |
| $dt[B/p1, S, D/p2] = \text{einsum}[bse, ed \rightarrow bsd](dt1, wdt2)$                        |                 |
| $dA[B/p1, S, D/p2, P] = \text{einsum}[dp, bsd \rightarrow bsdp](A, dt)$                        |                 |
| $dB[B/p1, S, D/p2, P] = \text{einsum}[bsp, bsd \rightarrow bsdp](B, dt)$                       |                 |
| $\delta\text{eta}B[B/p1, S, D/p2, P] = \text{einsum}[bsdp, bsd \rightarrow bsdp](\delta B, x)$ |                 |
| $hs[B/p1, S, D/p2, P] = \text{pscan}[dim=1](dA, \delta\text{eta}B)$                            |                 |
| $y0[B/p1, S, D/p2] = \text{einsum}[bsdp, bsp \rightarrow bsd](hs, C)$                          |                 |
| $y[B/p1, S, D/p2] = \text{einsum}[bsd, d](y0, D)$                                              |                 |

expanding parallel configuration space, yet STAGE remains highly efficient. At 32K GPUs, it generates graphs for a 540B dense LLM in just 28 minutes. For more complex models like Mixtral-8x7B, with added expert parallelism, generation remains practical at around 50 minutes. Memory usage stays below 500MB in all cases. In contrast, real-system trace generation can take tens of minutes per node and scale to hours on large clusters.

### E. Discussion: General Modeling with STAGE

While the evaluation in this work focuses on conventional LLMs (including MoEs), the symbolic representation employed by STAGE is not inherently limited to LLMs. The framework’s design allows it to generalize to any tensor computation workloads from ML or other fields. Here we show the flexibility of STAGE through some application cases.

**Emerging Model Architecture: State Space Model (SSM) [21]:** SSMs are emerging as a compelling alternative to traditional transformer architectures in LLMs, primarily due to their linear computational and memory complexity, which allows for efficient handling of long sequences. Therefore, to showcase the flexibility of STAGE, Table X shows how users can model State-Space Model along with STAGE, where we denote data-parallel and tensor-parallel as  $p1$  and  $p2$ .

**Emerging Parallel Strategies:** STAGE supports modeling emerging parallel strategies, as long as it can be represented in a tensor symbolic way. Table XI illustrates a symmetric parallel strategy for FSDP called Fully-Sharded Tensor Parallel (FSTP), which is based on tensor parallelism (TP) with activation sharding. Although this parallel strategy does not currently exist in practice, STAGE is capable of modeling it. While it may not be beneficial for current models, it could

TABLE XI  
FULLY-SHARDED TENSOR PARALLEL

| Inputs                                                                      | Output                                   |
|-----------------------------------------------------------------------------|------------------------------------------|
| X[Batch/ $\text{dp}$ , D1/ $\text{tp}$ ]                                    | Y[Batch/ $\text{dp}$ , D2/ $\text{tp}$ ] |
| W[D1/ $\text{tp}$ , D2]                                                     |                                          |
| <b>Compute:</b>                                                             |                                          |
| X*[Batch/ $\text{dp}$ , D1]=AllGather[ $\text{tp}$ ] (X)                    |                                          |
| Y*[Batch/ $\text{dp}$ , D2@1/ $\text{tp}$ ]=einsum[bm, mn->bn] (X*, W)      |                                          |
| Y*[Batch/ $\text{dp}$ , D2/ $\text{tp}$ ]=ReduceScatter[ $\text{tp}$ ] (Y*) |                                          |

TABLE XII  
2-LAYER MLP, INTERLEAVED DP/TP

| Inputs                                                 | Output        |
|--------------------------------------------------------|---------------|
| X0[Batch/ $\text{p}$ , D1]                             |               |
| W1[D1, D2]                                             | X2[Batch, D3] |
| W2[D2/ $\text{p}$ , D3]                                |               |
| <b>Compute:</b>                                        |               |
| X1[Batch/ $\text{p}$ , D2]=einsum[bm, mn->bn] (X0, W1) |               |
| X1*[Batch, D2/ $\text{p}$ ]=AllToAll(X1)               |               |
| X2[Batch, D3]=einsum[bm, mn->bn] (X1, W2)              |               |

become useful for future models that can fully exploit it, and STAGE can serve as an efficient tool for rapid prototyping.

**Layerwise Flexible Parallel Strategies:** STAGE supports modeling of non-standard parallelization strategies that combine different forms of parallelism across layers. Table XII shows a two-layer MLP where data parallelism is applied to the first layer and tensor parallelism to the second. This strategy requires a collective communication operation, specifically *AllToAll* between the two layers. STAGE automatically detects such mismatches and inserts the required communication, streamlining the modeling of complex hybrid strategies.

**Non-ML workload:** Tensor-Train decomposition [47] is a mathematical technique used by the computational physics and mechanics community to simplify the problem with approximation. It decomposes an  $n$ -dimensional tensor into a chain of products of  $n - 2$  tensors called kernels, and recovers the original tensor on the fly when they are using it.

Table XIII shows a Tensor-Train decomposition example with tensor-level distribution applied along specific dimensions ( $\text{p1}$  and  $\text{p2}$ ) using STAGE. In this example, the first and second dimensions of each kernel tensor are distributed, except for G1, since partitioning it across three dimensions would make the resulting einsum computation unmappable in practice.

## VII. RELATED WORKS

**Benchmarking for Distributed Training.** DeepBench [5] and MLPerf [53] offer standardized metrics for evaluating the performance of training and inference tasks. While these tools excel in providing reproducible benchmarks, they do not support detailed profiling data. PyTorch Execution Observer [49] and NVIDIA CUPTI [45] provide performance profile result of training systems. However, they require actual run to collect traces. Moreover, the generated execution traces lack annotations for optimizations and dependencies, which are essential for profiling system architectures. PyTorch FX [54] can capture static model behaviors with dependency graph during compile time but it lacks information from post execution and requires optimized codes for analysis. In contrast,

TABLE XIII  
RECOVERY OF 6-D TENSOR-TRAIN DECOMPOSITION, WITH PARALLELIZATION

| Inputs                                                                                       | Output                                                |
|----------------------------------------------------------------------------------------------|-------------------------------------------------------|
| G1[R0, M1, R1/ $\text{p1}$ ]                                                                 |                                                       |
| G2[R1/ $\text{p1}$ , M2/ $\text{p2}$ , R2]                                                   | X[R0/ $\text{p1}$ , M1, M2, M3, M4/ $\text{p2}$ , R4] |
| G3[R2/ $\text{p1}$ , M3/ $\text{p2}$ , R3]                                                   |                                                       |
| G4[R3/ $\text{p1}$ , M4/ $\text{p2}$ , R4]                                                   |                                                       |
| <b>Compute:</b>                                                                              |                                                       |
| T1[R0, M1, M2/ $\text{p2}$ , R2@1/ $\text{p1}$ ] = einsum[amb, bnc->amnc] (G1, G2)           |                                                       |
| T1*[R0, M1, M2, R2/ $\text{p1}$ ] = ReduceScatter[AllGather](T1)                             |                                                       |
| T2[R0, M1, M2, M3/ $\text{p2}$ , R3@1/ $\text{p1}$ ] = einsum[amnc, cod->amnod] (T1, G3)     |                                                       |
| T2*[R0, M1, M2, M3, R3/ $\text{p1}$ ] = ReduceScatter[AllGather](T2)                         |                                                       |
| X[R0, M1, M2, M3, M4/ $\text{p2}$ , R4@1/ $\text{p1}$ ] = einsum[amod, dpe->amnode] (T2, G4) |                                                       |
| X*[R0/ $\text{p1}$ , M1, M2, M3, M4/ $\text{p2}$ , R4] = ReduceScatter(X)                    |                                                       |

STAGE automatically partitions the operators, generating an updated computational graph that incorporates the appropriate parallelization annotations and dependencies.

**Performance Modeling for Distributed Training.** Recent efforts on performance modeling such as vTrain [4], MADMAX [25], and Calculon [27] have significantly advanced the community’s understanding of distributed LLM workloads through detailed analytical modeling or trace-driven simulation. However, these frameworks share a common limitation in terms of flexibility and configurability, making it difficult to systematically explore emerging model such as MoE and state space model in detail. vTrain primarily focuses on operators explicitly profiled from real systems, limiting extensibility to new or custom models. In this context, our work, STAGE, aims not to compete but rather to complement these existing frameworks by providing a flexible and configurable workload generation mechanism.

**Tensor Representation for System-level Optimizations.** Tensor representation is commonly utilized for system-level optimization of deep learning models [9], [61], [68], enabling computational graph optimizations for frameworks including PyTorch [48] and TensorFlow [2]. Techniques such as operator fusion leverage tensor representations to enhance parallel processing and memory efficiency [42], [69]. FlexFlow [28] and Unity [60] employ system-level compilation to determine effective parallelization strategies in distributed settings, while Mist [71] recently proposed symbolic tensor representations specifically for memory parallelism. In contrast, we propose a symbolic tensor graph that systematically annotates key operators with parallelization dimensions to guide runtime optimization for large-scale LLM training.

## VIII. CONCLUSION

We introduce STAGE, a framework for generating high-fidelity workload graphs for distributed LLM training. It provides practitioners with a robust tool for system-level design exploration and scalable benchmarking in future AI infrastructure research. The symbolic tensor graph allows for a structured representation of parallelization strategies, moving beyond ad-hoc methods and enabling the exploration of previously unattainable system configurations. Our validation against real-world traces and scalability up to 32K GPUs demonstrate its effectiveness and practicality.

## REFERENCES

- [1] “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.” [Online]. Available: <https://arxiv.org/abs/2405.04434>

- [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: <https://www.tensorflow.org/>

[3] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” 2023. [Online]. Available: <https://arxiv.org/abs/2305.13245>

[4] J. Bang, Y. Choi, M. Kim, Y. Kim, and M. Rhu, “vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training,” *arXiv preprint arXiv:2312.12391*, 2023.

[5] S. Belloni, D. Ritter, M. Schröder, and N. Rörup, “Deepbench: Benchmarking json document stores,” ser. DBTest ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1–9. [Online]. Available: <https://doi.org/10.1145/3531348.3532176>

[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020. [Online]. Available: <https://arxiv.org/abs/2005.14165>

[7] N. Bshara, “Aws trainium: The journey for designing and optimization full stack ml hardware,” in *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3*, ser. ASPLOS ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 4. [Online]. Available: <https://doi.org/10.1145/3620666.3655592>

[8] Cerebras Systems, Inc., “CS-3 System,” <https://www.cerebras.ai/system>, n.d., accessed: 2025-08-01.

[9] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” 2018. [Online]. Available: <https://arxiv.org/abs/1802.04799>

[10] J. Cho, M. Kim, H. Choi, and J. Park, “Llmbservsim: A simulation infrastructure for llm inference serving systems.”

[11] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellatt, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” 2022. [Online]. Available: <https://arxiv.org/abs/2204.02311>

[12] D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,” 2024. [Online]. Available: <https://arxiv.org/abs/2401.06066>

[13] deepseek ai. (2025, Feb.) Deepseek v3/r1 inference system overview. GitHub: Open Infra Index, Day 6 of 2025 Open Source Week. Accessed: 2025-10-20. [Online]. Available: [https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day\\_6\\_one\\_more\\_thing\\_deepseekV3R1\\_inference\\_system\\_overview.md](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md)

[14] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Ding, H. Xin, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Wang, J. Chen, J. Yuan, J. Qiu, J. Li, J. L. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Su, L. Chen, L. Sun, L. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Xu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: <https://arxiv.org/abs/2501.12948>

[15] DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J. Qiu, J. Li, J. Song, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Wang, L. Zhang, M. Li, M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Wang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge, R. Zhang, R. Pan, R. Wang, R. Xu, R. Zhang, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Pan, T. Wang, T. Yun, T. Pei, T. Sun, W. L. Xiao, W. Zeng, W. Zhao, W. An, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, X. Chen, X. Zhang, X. Chen, X. Nie, X. Sun, X. Wang, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yu, X. Song, X. Shan, X. Zhou, X. Yang, X. Li, X. Su, X. Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Y. Zhang, Y. Xu, Y. Xu, Y. Huang, Y. Li, Y. Zhao, Y. Sun, Y. Li, Y. Wang, Y. Yu, Y. Zheng, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Tang, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Wu, Y. Ou, Y. Zhu, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Zhu, Y. Xiong, Y. Ma, Y. Yan, Y. Luo, Y. You, Y. Liu, Y. Zhou, Z. F. Wu, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Huang, Z. Zhang, Z. Xie, Z. Zhang, Z. Hao, Z. Gou, Z. Ma, Z. Yan, Z. Shao, Z. Xu, Z. Wu, Z. Zhang, Z. Li, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Gao, and Z. Pan, “Deepseek-v3 technical report,” 2025. [Online]. Available: <https://arxiv.org/abs/2412.19437>

[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [Online]. Available: <https://arxiv.org/abs/1810.04805>

[17] J. Duan, X. Li, P. Xu, X. Zhang, S. Yan, Y. Liang, and D. Lin, “Proteus: Simulating the performance of distributed dnn training,” 2023. [Online]. Available: <https://arxiv.org/abs/2306.02267>

[18] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” 2022. [Online]. Available: <https://arxiv.org/abs/2101.03961>

[19] P. Glorioso, Q. Anthony, Y. Tokpanov, J. Whittington, J. Pilault, A. Ibrahim, and B. Millidge, “Zamba: A compact 7b ssm hybrid model,” 2024. [Online]. Available: <https://arxiv.org/abs/2405.16712>

[20] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Wyatt, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, F. Radenovic, F. Guzmán, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Thattai, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, I. Evtimov, J. Zhang, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. van der Linde, J. Billock, J. Hong,

- J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Prasad, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, K. Lakhotia, L. Rantala-Yearly, L. van der Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Tsimpoukelli, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogochev, N. Chatterji, N. Zhang, O. Duchenne, O. Celebi, P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Maheswari, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramamathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Albiero, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Wang, X. E. Tan, X. Xia, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Couder, Z. Yan, Z. Chen, Z. Papakipos, A. Singh, A. Srivastava, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Teo, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poult, A. Ryan, A. Ramchandani, A. Dong, A. Franco, A. Goyal, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, B. Huang, B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Liu, C. Wang, C. Kim, C. Zhou, C. Hu, C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer, C. Gao, D. Civin, D. Beaty, D. Kreymer, D. Li, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E.-T. Le, E. Brinkman, E. Arcuate, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Kokkinos, F. Ozgenel, F. Caggioni, F. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Bader, G. Swee, G. Halpern, G. Herman, G. Sizov, G. Guangyi, G. Zhang, G. Lakshminarayanan, H. Inan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspren, H. Goldman, H. Zhan, I. Damlaj, I. Molybog, I. Tufanov, I. Leontiadis, I.-E. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Lam, J. Asher, J.-B. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, K. H. U. K. Saxena, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelen, K. Li, K. Jagadeesh, K. Huang, K. Chawla, K. Huang, L. Chen, L. Garg, L. A. L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Liu, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singh, N. Egebo, N. Usunier, N. Mehta, N. P. Laptev, N. Dong, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Parthasarathy, R. Li, R. Hogan, R. Battey, R. Wang, R. Howes, R. Rinott, S. Mehta, S. Siby, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Mahajan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, S. C. Zha, S. Patil, S. Shankar, S. Zhang, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Deng, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Koehler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Montanez, V. Mohan, V. S. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wu, X. Wang, X. Wu, X. Gao, Y. Kleinman, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Y. Wang, Y. Zhao, Y. Hao, Y. Qian, Y. Li, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, Z. Zhao, and Z. Ma, "The llama 3 herd of models," 2024. [Online]. Available: <https://arxiv.org/abs/2407.21783>
- [21] A. Gu and T. Dao, "Mamba: Linear-time sequence modeling with selective state spaces," 2024. [Online]. Available: <https://arxiv.org/abs/2312.00752>
- [22] S. Gugger, L. Debut, T. Wolf, P. Schmid, Z. Mueller, S. Mangrulkar, M. Sun, and B. Bossan, "Accelerate: Training and inference at scale made simple, efficient and adaptable." <https://github.com/huggingface/accelerate>, 2022.
- [23] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons, "Pipedream: Fast and efficient pipeline parallel dnn training," 2018. [Online]. Available: <https://arxiv.org/abs/1806.03377>
- [24] E. Harper, S. Majumdar, O. Kuchaiev, J. Li, Y. Zhang, E. Bakhturina, V. Noroozi, S. Subramanian, N. Koluguri, J. Huang, F. Jia, J. Balam, X. Yang, M. Livne, Y. Dong, S. Naren, and B. Ginsburg, "Nemo: a toolkit for conversational ai and large language models," <https://github.com/NVIDIA/NeMo>, 2024, if you use this software, please cite it as above. [Online]. Available: <https://nvidia.github.io/NeMo/>
- [25] S. Hsia, A. Golden, B. Acun, N. Ardalani, Z. DeVito, G.-Y. Wei, D. Brooks, and C.-J. Wu, "Mad-max beyond single-node: Enabling large machine learning model acceleration on distributed systems," in *2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)*. IEEE, 2024, pp. 818–833.
- [26] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, "Gpipe: Efficient training of giant neural networks using pipeline parallelism," 2019. [Online]. Available: <https://arxiv.org/abs/1811.06965>
- [27] M. Isaev, N. McDonald, L. Dennison, and R. Vuduc, "Calculon: a methodology and tool for high-level co-design of systems and large language models," in *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis*, 2023, pp. 1–14.
- [28] Z. Jia, M. Zaharia, and A. Aiken, "Beyond data and model parallelism for deep neural networks," 2018. [Online]. Available: <https://arxiv.org/abs/1807.05358>
- [29] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, "Mistral 7b," 2023. [Online]. Available: <https://arxiv.org/abs/2310.06825>
- [30] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, "Mixtral of experts," 2024. [Online]. Available: <https://arxiv.org/abs/2401.04088>
- [31] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, "Scaling laws for neural language models," 2020. [Online]. Available: <https://arxiv.org/abs/2001.08361>
- [32] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, "Reducing activation recomputation in large transformer models," 2022. [Online]. Available: <https://arxiv.org/abs/2205.05198>
- [33] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, "Gshard: Scaling giant models with conditional computation and automatic sharding," 2020. [Online]. Available: <https://arxiv.org/abs/2006.16668>
- [34] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, "Pytorch distributed: Experiences on accelerating data parallel training," 2020. [Online]. Available: <https://arxiv.org/abs/2006.15704>
- [35] M. Liang, W. Fu, L. Feng, Z. Lin, P. Panakanti, S. Zheng, S. Sridharan, and C. Delimitrou, "Mystique: Enabling accurate and scalable generation of production ai benchmarks," 2023. [Online]. Available: <https://arxiv.org/abs/2301.04122>
- [36] M. Liang, H. T. Kassa, W. Fu, B. Coutinho, L. Feng, and C. Delimitrou, "Lumos: Efficient performance modeling and estimation for large-scale llm training," 2025. [Online]. Available: <https://arxiv.org/abs/2504.09307>

- [37] O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, S. Meirom, Y. Belinkov, S. Shalev-Shwartz, O. Abend, R. Alon, T. Asida, A. Bergman, R. Glogzman, M. Gokhman, A. Manevich, N. Ratner, N. Rozen, E. Shwartz, M. Zusman, and Y. Shoham, "Jamba: A hybrid transformer-mamba language model," 2024. [Online]. Available: <https://arxiv.org/abs/2403.19887>
- [38] Meta AI. (2025, Apr.) The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. Accessed: 2025-04-22. [Online]. Available: <https://ai.meta.com/blog/llama-4-multimodal-intelligence/>
- [39] MLCommons, "Chakra working group," <https://mlcommons.org/working-groups/research/chakra/>, 2023.
- [40] MLCommons, "Chakra schema," <https://github.com/mlcommons/chakra/wiki/Chakra-Schema>, 2024.
- [41] D. Nguyen, W. Yang, R. Anand, Y. Yang, and B. Mirzasoleiman, "Minibatch coresets for memory-efficient training of large language models," in *arXiv:2407.19580 [cs.LG]*, 2024.
- [42] W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, "Dnnfusion: accelerating deep neural networks execution with advanced operator fusion," in *Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation*, ser. PLDI 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 883–898. [Online]. Available: <https://doi.org/10.1145/3453483.3454083>
- [43] NVIDIA, "Nvidia cupti - cuda profiling tools interface," 2024, accessed: 2024-11-23. [Online]. Available: <https://developer.nvidia.com/cupti>
- [44] NVIDIA, "Nvidia nemo - open-source toolkit for conversational ai," 2024, accessed: 2024-11-23. [Online]. Available: <https://www.nvidia.com/en-us/ai-data-science/products/nemo/>
- [45] NVIDIA Corporation, "Cuda profiling tools interface (cupti)," <https://developer.nvidia.com/cupti>, 2024, accessed: 2024-11-21.
- [46] NVIDIA Corporation, "NVIDIA HGX Platform," <https://www.nvidia.com/en-us/data-center/hgx/>, n.d., accessed: 2025-08-01.
- [47] I. V. Oseledets, "Tensor-train decomposition," *SIAM Journal on Scientific Computing*, vol. 33, no. 5, pp. 2295–2317, 2011.
- [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," 2019. [Online]. Available: <https://arxiv.org/abs/1912.01703>
- [49] PyTorch Contributors, "Pytorch profiler recipe," [https://pytorch.org/tutorials/recipes/recipes/profiler\\_recipe.html](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html), 2024, accessed: 2024-11-21.
- [50] PyTorch Team, "Kineto: Performance profiling library for pytorch," <https://github.com/pytorch/kineto>, 2025, accessed: 2025-07-31.
- [51] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, "Zero: Memory optimizations toward training trillion parameter models," in *SC20: International Conference for High Performance Computing, Networking, Storage and Analysis*. IEEE, 2020, pp. 1–16.
- [52] S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, "Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models," in *Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA '22)*, 2022, p. 581–596.
- [53] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, "Mlperf inference benchmark," 2020. [Online]. Available: <https://arxiv.org/abs/1911.02549>
- [54] J. K. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, "Torch.fx: Practical program capture and transformation for deep learning in python," 2022. [Online]. Available: <https://arxiv.org/abs/2112.08429>
- [55] F. Research, "Param: A trace abstraction for ml workloads," <https://github.com/facebookresearch/param>, 2023, accessed: 2025-04-14.
- [56] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," 2020. [Online]. Available: <https://arxiv.org/abs/1909.08053>
- [57] S. Sridharan, T. Heo, L. Feng, Z. Wang, M. Bergeron, W. Fu, S. Zheng, B. Coutinho, S. Rashidi, C. Man *et al.*, "Chakra: Advancing performance benchmarking and co-design using standardized execution traces," *arXiv preprint arXiv:2305.14516*, 2023.
- [58] P. Team, "Kineto: A cpu+gpu profiling library for pytorch," <https://github.com/pytorch/kineto>, 2023, accessed: 2025-04-14.
- [59] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, "Llama: Open and efficient foundation language models," 2023. [Online]. Available: <https://arxiv.org/abs/2302.13971>
- [60] C. Unger, Z. Jia, W. Wu, S. Lin, M. Baines, C. E. Q. Narvaez, V. Ramakrishnaiah, N. Prajapati, P. McCormick, J. Mohd-Yusof, X. Luo, D. Mudigere, J. Park, M. Smelyanskiy, and A. Aiken, "Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization," in *16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)*. Carlsbad, CA: USENIX Association, Jul. 2022, pp. 267–284. [Online]. Available: <https://www.usenix.org/conference/osdi22/presentation/unger>
- [61] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, "Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions," 2018. [Online]. Available: <https://arxiv.org/abs/1802.04730>
- [62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2023. [Online]. Available: <https://arxiv.org/abs/1706.03762>
- [63] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2023. [Online]. Available: <https://arxiv.org/abs/1706.03762>
- [64] W. Wang, M. Ghobadi, K. Shakeri, Y. Zhang, and N. Hasani, "Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters," in *Proceedings of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI)*, 2024.
- [65] X. Wang, Q. Li, Y. Xu, G. Lu, D. Li, L. Chen, H. Zhou, L. Zheng, S. Zhang, Y. Zhu, Y. Liu, P. Zhang, K. Qian, K. He, J. Gao, E. Zhai, D. Cai, and B. Fu, "SimAI: Unifying architecture design and performance tuning for Large-Scale large language model training with scalability and precision," in *22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)*. Philadelphia, PA: USENIX Association, Apr. 2025, pp. 541–558. [Online]. Available: <https://www.usenix.org/conference/nsdi25/presentation/wangxizheng-simai>
- [66] W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, "Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale," in *2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. IEEE, 2023, pp. 283–294.
- [67] W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, "Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale," 2023. [Online]. Available: <https://arxiv.org/abs/2303.14006>
- [68] C. Xia, J. Zhao, Q. Sun, Z. Wang, Y. Wen, T. Yu, X. Feng, and H. Cui, "Optimizing deep learning inference via global analysis and tensor expressions," in *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1*, ser. ASPLOS '24. New York, NY, USA: Association for Computing Machinery, 2024, p. 286–301. [Online]. Available: <https://doi.org/10.1145/3617232.3624858>
- [69] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng, and X. Jin, "Apollo: Automatic partition-based operator fusion through layer by layer optimization," in *Proceedings of Machine Learning and Systems*, D. Marculescu, Y. Chi, and C. Wu, Eds., vol. 4, 2022, pp. 1–19. [Online]. Available: [https://proceedings.mlsys.org/paper\\_files/paper\\_2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf](https://proceedings.mlsys.org/paper_files/paper_2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf)
- [70] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y. Hao, A. Mathews, and S. Li, "Pytorch fsdp: Experiences on scaling fully sharded data parallel," 2023. [Online]. Available: <https://arxiv.org/abs/2304.11277>
- [71] Z. Zhu, C. Giannoula, M. Andoovereed, Q. Su, K. Mangalam, B. Zheng, and G. Pekhimenko, "Mist: Efficient distributed training of large language models via memory-parallelism co-optimization,"

- in *Proceedings of the Twentieth European Conference on Computer Systems*, ser. EuroSys '25. ACM, Mar. 2025, p. 1298–1316. [Online]. Available: <http://dx.doi.org/10.1145/3689031.3717461>
- [72] Y. Zu, A. Ghaffarkhah, H.-V. Dang, B. Towles, S. Hand, S. Huda, A. Bello, A. Kolbasov, A. Rezaei, D. Du, S. Lacy, H. Wang, A. Wisner, C. Lewis, and H. Bahini, “Resiliency at scale: Managing Google’s TPUv4 machine learning supercomputer,” in *21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)*. Santa Clara, CA: USENIX Association, Apr. 2024, pp. 761–774. [Online]. Available: <https://www.usenix.org/conference/nsdi24/presentation/zu>