

# A Survey of High-Level Modeling and Simulation Methods for Modern Machine Learning Workloads

MICRO 2026 Submission – Confidential Draft – Do NOT Distribute!!

## Abstract

As machine learning workloads grow in scale and complexity, architects and system designers need fast, accurate methods to predict their performance across diverse hardware platforms. This survey provides a comprehensive analysis of modeling and simulation methods for predicting the performance of ML workloads, covering analytical models, cycle-accurate simulators, trace-driven approaches, and ML-augmented hybrid techniques. We survey over 50 tools and methods from architecture and systems venues published between 2016–2026, spanning DNN accelerator modeling (Timeloop, MAESTRO, Sparseloop), GPU simulation (GPGPU-Sim, Accel-Sim, NeuSight), distributed training simulation (ASTRA-sim, Lumos, SimAI), and LLM inference serving (VIDUR, Frontier, AMALI). We organize the literature along two primary dimensions—methodology type (analytical, simulation, ML-augmented, hybrid) and target platform (accelerators, GPUs, distributed systems, edge devices)—while additionally characterizing tools by workload coverage, prediction targets, and independently verified accuracy. Our analysis reveals that hybrid approaches combining analytical structure with learned components achieve the best accuracy-speed trade-offs, while pure analytical models offer superior interpretability for design space exploration. We conduct hands-on reproducibility evaluations and report independently measured accuracy, finding significant gaps between paper-reported and independently verified numbers. We identify key open challenges including cross-workload generalization beyond CNNs, composition of kernel-level predictions to end-to-end accuracy, and support for emerging architectures. This survey provides practitioners guidance for selecting appropriate modeling tools and researchers a roadmap for advancing ML workload performance prediction.

## Keywords

performance modeling, machine learning workloads, simulation, computer architecture, design space exploration, survey

## 1 Introduction

Machine learning workloads—spanning training and inference for CNNs, transformers, mixture-of-experts models, and graph neural networks—have become the dominant consumers of compute across datacenters and edge devices. Architects and system designers need fast, accurate performance predictions to navigate vast design spaces, select parallelization strategies, provision serving infrastructure, and optimize hardware-software co-design. Yet ML workloads pose unique modeling challenges: they exhibit diverse computational patterns (dense matrix operations in attention layers, sparse accesses in GNNs, communication-bound collective operations in distributed training) across an increasingly heterogeneous

hardware landscape of GPUs, TPUs, custom accelerators, and multi-device clusters.

A rich ecosystem of modeling and simulation tools has emerged to address these challenges, spanning a methodological spectrum from analytical models to cycle-accurate simulators to ML-augmented hybrid approaches. Analytical frameworks like Timeloop [45] and MAESTRO [34] model DNN accelerator performance through closed-form data movement analysis, achieving 5–10% accuracy at microsecond evaluation speed. Cycle-accurate simulators like GPGPU-Sim [3] and Accel-Sim [29] provide detailed GPU modeling but require hours per workload. Trace-driven simulators like ASTRA-sim [62] and VIDUR [2] target distributed training and LLM serving at system scale. ML-augmented approaches like NeuSight [38] learn performance functions from profiling data, achieving 2.3% error on GPU kernel prediction. Each methodology occupies a distinct point in the accuracy-speed-generality trade-off space.

Despite this rich tool landscape, no comprehensive survey organizes these methods from the perspective of the ML workload practitioner. Existing surveys focus on ML *techniques* for performance modeling [56] or on specific hardware targets [45], leaving practitioners without guidance on which tools suit their needs across the full modeling spectrum. This survey fills that gap.

We make the following contributions:

- A **methodology-centric taxonomy** organizing tools along two primary dimensions: methodology type (analytical, simulation, ML-augmented, hybrid) and target platform (DNN accelerators, GPUs, distributed systems, edge devices), with additional characterization by workload coverage, prediction targets, and verified accuracy.
- A **systematic survey** of over 50 modeling tools and methods from architecture venues (MICRO, ISCA, HPCA, ASPLOS) and systems venues (MLSys, OSDI, NSDI) published between 2016–2026, using documented selection criteria.
- A **comparative analysis** examining trade-offs between accuracy, speed, generalization, and interpretability, with independently measured accuracy where feasible rather than relying solely on paper-reported numbers.
- **Hands-on reproducibility evaluations** of representative tools with a 10-point rubric, and identification of **open challenges** including the CNN-to-transformer generalization gap, kernel-to-end-to-end error composition, and emerging accelerator support.

The remainder of this paper is organized as follows. Section 2 describes our survey methodology. Section 3 provides background on ML workload characteristics and modeling fundamentals. Section 4 presents our classification taxonomy. Section 5 surveys approaches organized by target platform. Section 6 offers comparative analysis across key dimensions. Section 7 discusses open challenges and future directions. Section 8 presents hands-on reproducibility evaluations. Section 9 concludes.



**Figure 1: Evolution of performance modeling tools for ML workloads (2016–2026).** Early analytical frameworks (Eyeriss, Paleo) gave way to systematic accelerator modeling (Timeloop, MAESTRO) and distributed training simulation (ASTRA-sim). ML-augmented approaches (TVM, Habitat, NeuSight) learn performance functions from data. Recent work targets LLM-specific modeling (VIDUR, AMALI, Frontier) and large-scale training prediction (Lumos).

Figure 1 illustrates the evolution of performance modeling tools for ML workloads, from early analytical frameworks through simulators to modern hybrid approaches.

## 2 Survey Methodology

We follow a systematic methodology for identifying, selecting, and classifying papers in this survey.

**Search strategy.** We searched ACM Digital Library, IEEE Xplore, Semantic Scholar, and arXiv using terms including “performance modeling DNN,” “DNN accelerator simulator,” “LLM inference prediction,” “distributed training simulation,” “neural network latency estimation,” and “ML workload performance.” We additionally performed backward/forward citation tracking from seminal works (Timeloop, ASTRA-sim, NeuSight) and monitored proceedings of target venues.

**Target venues.** Architecture: MICRO, ISCA, HPCA, ASPLOS. Systems: MLSys, OSDI, SOSP, NSDI. Related: NeurIPS, ICML, MobiSys, DAC, ISPASS.

**Inclusion criteria.** Papers must (1) propose or evaluate a tool or method for predicting performance of ML workloads (training or inference), (2) target at least one hardware platform (GPU, accelerator, distributed system, or edge device), and (3) include quantitative evaluation of prediction accuracy or modeling fidelity.

**Exclusion criteria.** We exclude (1) papers using ML for non-performance tasks (e.g., power estimation without latency), (2) papers modeling general-purpose (non-ML) workloads exclusively, and (3) papers without quantitative evaluation.

**Selection process.** Our initial search yielded 287 candidate papers. After title/abstract screening against inclusion criteria, 118 remained. Full-text review reduced the set to 53 papers that met all criteria. We additionally include 12 foundational works (gem5, roofline model, DRAMSim, etc.) as context for understanding the modeling landscape.

**Time period.** We cover papers published between 2016–2026, with foundational works from earlier years included for context.

**Classification.** We classify each paper along two primary dimensions: *methodology type* (analytical, cycle-accurate simulation, trace-driven simulation, ML-augmented, or hybrid) and *target platform* (DNN accelerator, GPU, distributed system, edge device, or CPU). Secondary dimensions include workload coverage, prediction targets, and reported accuracy metrics.

## 3 Background

This section provides background on the characteristics of ML workloads that make performance modeling challenging, and reviews the fundamental approaches used to model them.

### 3.1 ML Workload Characteristics

ML workloads present unique performance modeling challenges compared to general-purpose programs.

**Computational structure.** ML workloads are composed of well-defined operators (convolutions, matrix multiplications, attention layers, normalization) with statically known shapes and data types. This regularity enables analytical modeling of compute and data movement, unlike branch-heavy general-purpose code. However, modern architectures like mixture-of-experts (MoE) and dynamic inference introduce input-dependent control flow that complicates static analysis.

**Memory hierarchy sensitivity.** DNN accelerators employ specialized memory hierarchies with explicit data orchestration. The mapping of tensor operations to hardware (dataflow, tiling, loop ordering) critically determines performance. For LLM inference, KV cache management dominates memory behavior, with cache sizes scaling linearly with sequence length and batch size [35].

**Scale and distribution.** Large model training distributes computation across thousands of GPUs using data, tensor, pipeline, and expert parallelism [13]. Performance depends on the interplay between compute, memory bandwidth, and network communication—requiring system-level modeling beyond single-device prediction.

**Distinct inference phases.** LLM inference exhibits qualitatively different phases: prefill (compute-bound, processing the full prompt) and decode (memory-bound, generating tokens autoregressively) [47]. Effective modeling must capture both phases and their interaction under batched serving [1, 65].

### 3.2 Modeling Methodologies

We classify modeling approaches into four categories that form the primary axis of our taxonomy.

**Analytical models** express performance as closed-form functions of workload and hardware parameters. The roofline model [61] bounds throughput by  $P = \min(\pi, \beta \cdot I)$ , where  $\pi$  is peak compute,  $\beta$  is memory bandwidth, and  $I$  is operational intensity. For DNN accelerators, Timeloop [45] analytically computes data movement costs across memory hierarchies for any valid mapping. Analytical models provide microsecond evaluation and full interpretability, but require manual derivation per architecture and struggle with dynamic microarchitectural effects.

**Cycle-accurate simulators** model hardware at the register-transfer level. gem5 [5] (CPUs), GPGPU-Sim [3] (GPUs), and Accel-Sim [29] (modern GPUs) achieve detailed accuracy but suffer 1000–10000× slowdown, making them impractical for full ML workload evaluation. Sampling techniques (SimPoint [53], SMARTS [64]) reduce simulation time but were designed for general-purpose workloads and may not capture ML-specific patterns.

**Trace-driven simulation** uses execution traces as input rather than full binary execution, enabling faster evaluation. ASTRA-sim [62] models distributed training using Chakra execution traces [54] with pluggable compute, memory, and network backends. VIDUR [2]

provides discrete-event simulation for LLM serving using kernel-level profiles. This approach trades some fidelity for orders-of-magnitude speedup over cycle-accurate simulation.

**ML-augmented approaches** learn performance functions from profiling data. These range from simple models (random forests in nn-Meter [68], XGBoost in TVM [10]) to deep learning (NeuSight [38]) and meta-learning (HELP [37]). The quality of profiling data depends on hardware performance counter infrastructure—PAPI [6] provides a portable API for accessing hardware counters across architectures, while LIKWID [57] offers lightweight topology-aware measurement for x86 multicore environments. These tools underpin the data collection pipeline for ML-augmented approaches, though their coverage of GPU-specific counters (e.g., tensor core utilization, warp scheduling events) remains limited. ML-augmented approaches can capture complex non-linear relationships that elude analytical treatment, but require training data and may not generalize beyond their training distribution.

### 3.3 Problem Formulation

Performance modeling maps workload  $\mathcal{W}$  and hardware  $\mathcal{H}$  to a performance metric  $y$ :  $\hat{y} = f(\mathcal{W}, \mathcal{H}; \theta)$ . Workloads are represented at operator level (layer parameters), graph level (computation graphs), IR level (compiler representations), or trace level (recorded runtime behavior). Hardware is characterized by specifications, performance counters, or learned embeddings.

**Prediction targets** include latency (execution time), throughput (samples/second), energy (Joules per inference), and memory footprint. Multi-objective formulations enable Pareto-optimal design selection.

**Accuracy metrics** vary across the literature: MAPE (scale-invariant relative error), RMSE (penalizes large deviations), and rank correlation (Kendall’s  $\tau$ ) for design space ordering. Direct comparison across papers is limited by differences in benchmarks, hardware targets, and evaluation protocols—a challenge we discuss in Section 6.

## 4 Taxonomy

We organize the literature along two primary dimensions: *methodology type* and *target platform*. Figure 2 illustrates how these dimensions intersect. This methodology-centric taxonomy helps practitioners select appropriate tools for their use case and helps researchers identify gaps in the current landscape.

### 4.1 By Methodology Type

The choice of methodology reflects fundamental trade-offs between accuracy, evaluation speed, generality, and interpretability.

**4.1.1 Analytical Models.** Analytical models express performance as closed-form functions of workload and hardware parameters. For DNN accelerators, Timeloop [45] models data movement across memory hierarchies for any valid loop-nest mapping, achieving 5–10% accuracy versus RTL at 2000 $\times$  speedup. MAESTRO [34] provides data-centric dataflow analysis using intuitive directives. Sparseloop [63] extends to sparse tensor operations. Paleo [48] pioneered layer-wise analytical modeling for DNNs, decomposing



**Figure 2: Two-dimensional taxonomy for ML workload performance modeling.** The primary axis is methodology type (how performance is predicted); the secondary axis is target platform. Arrows show dominant pairings: analytical models for accelerators, cycle-accurate simulation for GPUs/CPUs, trace-driven simulation for distributed systems, and ML-augmented approaches for edge devices.

networks into compute and communication components for distributed training prediction. AMALI [9] targets LLM inference on GPUs through improved memory hierarchy modeling.

Analytical models provide microsecond evaluation, full interpretability, and “what-if” design analysis. Their limitation is that they require manual derivation per architecture and may miss complex dynamic effects (e.g., memory contention, scheduling variability). AMALI’s 23.6% MAPE illustrates the challenge: reducing GPU LLM inference modeling error from 127% to 24% through improved memory hierarchy treatment, but significant residual error remains from unmodeled dynamic effects—this represents the fundamental difficulty of analytical approaches for complex workloads, not a quality issue [9].

**4.1.2 Cycle-Accurate Simulation.** Cycle-accurate simulators model hardware at register-transfer level, providing the highest fidelity. gem5 [5] (CPUs), GPGPU-Sim [3] (GPUs), and Accel-Sim [29] (modern NVIDIA GPUs, SASS-level trace-driven) achieve 0.90–0.97 IPC correlation. PyTorchSim [30] integrates PyTorch 2 with NPU simulation supporting custom RISC-V ISA and systolic arrays.

The primary limitation is speed: simulating a single ResNet-50 inference may require hours, making these tools impractical for design space exploration of ML workloads. Simulation sampling techniques (SimPoint [53], SMARTS [64], LoopPoint [51]) accelerate general-purpose workload simulation but are not specifically validated for ML workload patterns. Recent work on dissecting modern GPU cores [25] has improved Accel-Sim’s accuracy to 13.98% MAPE by reverse-engineering undocumented microarchitectural details.

**4.1.3 Trace-Driven Simulation.** Trace-driven approaches use recorded execution traces rather than full binary execution, enabling system-level modeling at practical speeds. ASTRA-sim [62] models distributed training end-to-end using Chakra execution traces [54], with pluggable compute, memory, and network backends, achieving 5–15% accuracy versus real clusters. Echo [7] simulates distributed

349 training at scale using analytical compute models with network  
 350 simulation. Lumos [41] targets LLM training performance through  
 351 trace-driven modeling, achieving 3.3% error on H100 GPUs.

352 For LLM inference serving, VIDUR [2] provides discrete-event  
 353 simulation capturing prefill/decode phases, KV cache management,  
 354 and request scheduling (Orca [65], Sarathi [1] strategies) with <5%  
 355 error. Frontier [18] extends to MoE and disaggregated inference  
 356 with stage-centric simulation. SimAI [59] provides full-stack LLM  
 357 training simulation achieving 98.1% alignment with production  
 358 results at Alibaba Cloud scale.

359 These tools occupy a practical middle ground: fast enough for  
 360 design exploration, detailed enough to capture system-level inter-  
 361 actions that analytical models miss. Note that some tools in this  
 362 category use ML internally (e.g., VIDUR uses random forests for  
 363 kernel latency prediction), blurring the boundary with hybrid ap-  
 364 proaches.

365  
 366 **4.1.4 ML-Augmented Models.** ML-augmented approaches learn  
 367 performance functions entirely from profiling data, without embed-  
 368 ding analytical domain knowledge. nn-Meter [68] uses random for-  
 369 est ensembles with kernel-level feature engineering for edge device  
 370 latency prediction. LitePred [16] scales to 85 edge platforms using  
 371 VAE-based intelligent sampling and transfer learning. HELP [37]  
 372 formulates cross-hardware prediction as meta-learning, achiev-  
 373 ing adaptation with just 10 samples on new devices. TVM [10]  
 374 and Ansor [69] use XGBoost/MLP cost models to guide compiler  
 375 autotuning, with the TenSet dataset [70] (52M records) enabling  
 376 pre-trained models.

377 ML-augmented approaches excel when sufficient profiling data  
 378 is available and the training distribution matches deployment con-  
 379 ditions. However, they may fail silently outside their training dis-  
 380 tribution (e.g., CNN-trained models applied to transformers) and  
 381 provide limited interpretability for design insight. nn-Meter’s paper-  
 382 reported <1% MAPE cannot be independently verified, as the tool’s  
 383 pre-trained predictors fail with current scikit-learn versions due  
 384 to pickle serialization changes—a cautionary example of how ML-  
 385 augmented approaches can become irreproducible.

387  
 388 **4.1.5 Hybrid Analytical+ML Models.** Hybrid approaches combine  
 389 analytical structure with learned components, achieving both inter-  
 390 pretability and high accuracy. The analytical component provides a  
 391 physics-based prior; the ML component learns residual corrections.

392 NeuSight [38] uses tile-based prediction mirroring CUDA’s execu-  
 393 tion model with MLP prediction heads, achieving 2.3% error on  
 394 GPT-3 inference across H100, A100, and V100 GPUs. Concorde [44]  
 395 fuses compositional analytical models with learned corrections  
 396 for CPU performance, achieving 2% CPI error at five orders of  
 397 magnitude faster than gem5. Habitat [66] decomposes execution  
 398 into analytically-modeled compute and memory components that  
 399 scale with hardware parameters. ArchGym [33] connects ML opti-  
 400 mization algorithms to analytical simulators for design space  
 401 exploration.

402 The latency predictor study [15] demonstrates that hybrid ap-  
 403 proaches with transfer learning achieve 22.5% average improvement  
 404 over baselines. Note that accuracy comparisons between hybrid  
 405 and pure approaches require care: ArchGym’s reported 0.61% RMSE

406 measures surrogate-vs-simulator fidelity, not real hardware accu-  
 407 racy, and should not be directly compared with NeuSight’s 2.3%  
 408 MAPE against measured hardware [33].

## 4.2 By Target Platform

The target platform determines what performance effects must be modeled and constrains which methodologies are applicable.

**DNN Accelerators** (systolic arrays, dataflow architectures) are best served by analytical models (Timeloop, MAESTRO, Sparseloop) due to their regular, statically analyzable memory hierarchies and explicit dataflow control.

**GPUs** span the full methodology spectrum, from cycle-accurate (GPGPU-Sim, Accel-Sim) through analytical (AMALI, roofline [27, 61]) to hybrid (NeuSight, Habitat), reflecting the complexity of SIMT execution, warp scheduling, and memory coalescing.

**Distributed systems** are primarily served by trace-driven simulation (ASTRA-sim, VIDUR, Lumos, SimAI, Frontier) because system-level interactions (collective communication, pipeline parallelism, scheduling) cannot be captured by single-device models.

**Edge/mobile devices** are dominated by ML-augmented approaches (nn-Meter, LitePred, HELP) because the diversity of edge hardware makes per-device analytical modeling impractical.

**CPUs** for ML workloads are less studied because most ML training and inference runs on GPUs/accelerators. Concorde and GRANITE [56] target CPU performance but focus on general-purpose workloads rather than ML-specific patterns.

## 5 Survey of Approaches

This section surveys performance modeling tools for ML workloads, organized by target platform. For each platform, we examine the modeling challenges, describe the available tools across methodology types, and critically analyze their strengths and limitations. Table 1 provides a comprehensive comparison.

### 5.1 DNN Accelerator Modeling

DNN accelerators employ specialized dataflows and memory hierarchies optimized for tensor operations. The regularity of DNN computations makes this domain particularly amenable to analytical modeling.

**Analytical frameworks** dominate accelerator modeling. Timeloop [45] analytically computes data reuse, latency, and energy from loop-nest representations, achieving 5–10% accuracy versus RTL simulation at 2000× speedup. It provides reference outputs for standard accelerator designs (Eyeriss [11], Simba) with deterministic results—a key reproducibility strength. MAESTRO [34] offers data-centric dataflow directives that simplify specification but is less precise than Timeloop for detailed energy modeling. Sparseloop [63] extends to sparse tensor operations, critical for efficient transformer inference where attention matrices exhibit structured sparsity.

**Simulation approaches.** PyTorchSim [30] integrates PyTorch 2 with NPU simulation supporting custom RISC-V ISA and systolic arrays, bridging the gap between ML frameworks and hardware simulation.

**ML-augmented design.** ArchGym [33] connects ML optimization algorithms to analytical simulators for design space exploration. Its reported 0.61% RMSE measures how faithfully the ML surrogate

**Table 1: Summary of surveyed performance modeling tools for ML workloads, organized by target platform. Methodology:** A=Analytical, S=Simulation, T=Trace-driven, M=ML-augmented, H=Hybrid. \*Accuracy measures surrogate-vs-simulator fidelity, not real hardware error. †Reported accuracy unverifiable due to reproducibility issues. ‡No accuracy baseline against real hardware reported.

| Tool                                        | Platform    | Method | Target             | Accuracy         | Speed      | Key Capability          |
|---------------------------------------------|-------------|--------|--------------------|------------------|------------|-------------------------|
| <i>DNN Accelerator Modeling</i>             |             |        |                    |                  |            |                         |
| Timeloop [45]                               | NPU         | A      | Latency/Energy     | 5–10%            | μs         | Loop-nest DSE           |
| MAESTRO [34]                                | NPU         | A      | Latency/Energy     | 5–15%            | μs         | Data-centric directives |
| Sparseloop [63]                             | NPU         | A      | Sparse tensors     | 5–10%            | μs         | Compression modeling    |
| PyTorchSim [30]                             | NPU         | S      | Cycle-accurate     | N/A <sup>‡</sup> | Hours      | PyTorch 2 integration   |
| ArchGym [33]                                | Multi       | H      | Multi-objective    | 0.61%*           | ms         | ML-aided DSE            |
| <i>GPU Performance Modeling</i>             |             |        |                    |                  |            |                         |
| Accel-Sim [29]                              | GPU         | S      | Cycle-accurate     | 10–20%           | Hours      | SASS trace-driven       |
| GPGPU-Sim [3]                               | GPU         | S      | Cycle-accurate     | 10–20%           | Hours      | CUDA workloads          |
| AMALI [9]                                   | GPU         | A      | LLM inference      | 23.6%            | ms         | Memory hierarchy        |
| NeuSight [38]                               | GPU         | H      | Kernel/E2E latency | 2.3%             | ms         | Tile-based prediction   |
| Habitat [66]                                | GPU         | H      | Training time      | 11.8%            | Per-kernel | Wave scaling            |
| <i>Distributed Training and LLM Serving</i> |             |        |                    |                  |            |                         |
| ASTRA-sim [62]                              | Distributed | T      | Training time      | 5–15%            | Minutes    | Collective modeling     |
| SimAI [59]                                  | Distributed | T      | Training time      | 1.9%             | Minutes    | Full-stack simulation   |
| Lumos [41]                                  | Distributed | T      | LLM training       | 3.3%             | Minutes    | H100 training           |
| VIDUR [2]                                   | GPU cluster | T      | LLM serving        | <5%              | Seconds    | Prefill/decode phases   |
| Frontier [18]                               | Distributed | T      | MoE inference      | —                | Minutes    | Stage-centric sim.      |
| TrioSim [39]                                | Multi-GPU   | T      | DNN training       | N/A <sup>‡</sup> | Minutes    | Lightweight multi-GPU   |
| <i>Edge Device Modeling</i>                 |             |        |                    |                  |            |                         |
| nn-Meter [68]                               | Edge        | M      | Latency            | <1%†             | ms         | Kernel detection        |
| LitePred [16]                               | Edge        | M      | Latency            | 0.7%             | ms         | 85-platform transfer    |
| HELP [37]                                   | Multi       | M      | Latency            | 1.9%             | ms         | 10-sample adaptation    |
| <i>Compiler Cost Models</i>                 |             |        |                    |                  |            |                         |
| TVM [10]                                    | GPU         | M      | Schedule perf.     | ~15%             | ms         | Autotuning guidance     |
| Ansor [69]                                  | GPU         | M      | Schedule perf.     | ~15%             | ms         | Program sampling        |
| TLP [67]                                    | GPU         | M      | Tensor program     | <10%             | ms         | Transformer cost model  |

reproduces the simulator’s predictions—not accuracy against real hardware. This distinction matters: surrogate fidelity enables fast DSE but does not validate the underlying simulator’s accuracy.

**Emerging accelerators.** Processing-in-memory (PIM) architectures present fundamentally different modeling challenges. uPIMulator [26] provides cycle-accurate PIM simulation for UPMEM devices. AttAcc [46] and NeuPIMs [23] target PIM-based attention acceleration for transformers, while PAISE [36] addresses PIM-accelerated LLM inference scheduling. These tools remain in early stages compared to the mature DNN accelerator modeling ecosystem.

## 5.2 GPU Performance Modeling

GPUs dominate ML training and inference, making accurate GPU performance prediction critical. GPU modeling must account for SIMD execution, warp scheduling, memory coalescing, and workload-dependent occupancy effects.

**Cycle-accurate simulation.** GPGPU-Sim [3] and Accel-Sim [29] achieve 0.90–0.97 IPC correlation through detailed microarchitectural modeling. Recent work reverse-engineering modern GPU cores [25] has improved Accel-Sim to 13.98% MAPE by modeling

previously undocumented features. However, 1000–10000× slowdown makes these tools impractical for full ML workloads at production scale.

**Analytical models.** The roofline model [61] provides a useful upper bound but misses occupancy and memory hierarchy effects. Roofline-LLM [27] extends roofline analysis to LLM inference. AMALI [9] reduces GPU LLM inference MAPE from 127% (prior analytical baselines) to 23.6% through improved memory hierarchy modeling. The residual 23.6% error reflects the fundamental difficulty of analytically modeling GPU dynamic behavior (warp scheduling, L2 cache contention, bank conflicts) rather than a quality limitation.

**Hybrid learned models.** NeuSight [38] introduces tile-based prediction that mirrors CUDA’s execution model, achieving 2.3% MAPE on GPT-3 inference across H100, A100, and V100 GPUs. Habitat [66] decomposes execution into analytically-modeled compute and memory components using wave scaling analysis, achieving 11.8% error across GPU generations. Note that direct comparison between NeuSight and Habitat requires caution: NeuSight evaluates on 2023–2025 hardware (H100) with LLM workloads, while Habitat

581 was designed for earlier GPUs with CNN/RNN workloads—the re-  
 582 ported “50× improvement” reflects different evaluation conditions  
 583 rather than purely methodological advances.

584 **LLM-specific modeling.** LLM execution exhibits qualitatively  
 585 different prefill (compute-bound) and decode (memory-bound) phases [71]. VIDUR [2] provides discrete-event simulation for LLM serv-  
 586 ing systems, capturing request scheduling strategies (Orca [65],  
 587 Sarathi [1]) with <5% error. LIFE [17] offers hardware-agnostic  
 588 analytical LLM inference modeling. HERMES [4] targets heteroge-  
 589 neous multi-stage inference pipelines. Emerging work uses LLMs  
 590 themselves for GPU kernel performance prediction and optimiza-  
 591 tion: Omniwise [21] achieves 90% of predictions within 10% error  
 592 on AMD MI250/MI300X, while SwizzlePerf [58] uses hardware-  
 593 aware LLMs for spatial optimization of GPU kernels, achieving up  
 594 to 2.06× speedup through swizzling—demonstrating that LLMs can  
 595 both predict *and* optimize kernel performance.

596 **Compiler cost models.** TVM [10] and Ansor [69] use ML cost  
 597 models (XGBoost, MLP) to guide autotuning, achieving ~15% MAPE.  
 598 The TenSet dataset [70] (52M records) enables pre-trained models  
 599 that accelerate autotuning 10×. TLP [67] uses deep learning for  
 600 tensor program cost modeling. SynPerf [60] takes a complementary  
 601 approach, synthesizing high-performance GPU kernels via pipeline  
 602 decomposition—using performance models to guide kernel gener-  
 603 ation rather than merely evaluating existing kernels. These tools  
 604 prioritize ranking accuracy for schedule selection over absolute  
 605 error.

### 606 5.3 Distributed Training and LLM Serving

607 Distributed systems introduce communication overhead, synchronization barriers, and parallelism strategy choices. Performance  
 608 depends on the interplay between compute, memory, and network—  
 609 requiring system-level modeling.

610 **Training simulation.** ASTRA-sim [62] provides end-to-end  
 611 distributed training simulation using Chakra execution traces [54],  
 612 with validated HGX-H100 configurations and pluggable network  
 613 backends. It achieves 5–15% accuracy versus real clusters and en-  
 614 ables exploration of parallelization strategies at scale. SimAI [59]  
 615 provides full-stack LLM training simulation at Alibaba Cloud scale,  
 616 achieving 98.1% alignment with production results. Lumos [41]  
 617 targets LLM training through trace-driven modeling, achieving  
 618 3.3% error on H100 GPUs by capturing gradient accumulation, op-  
 619 timizer states, and activation checkpointing. Echo [7] simulates  
 620 distributed training at scale. TrioSim [39] provides lightweight  
 621 multi-GPU simulation. PRISM [19] uses probabilistic models for  
 622 training performance prediction at 10K+ GPU scale.

623 **Scaling and parallelism.** The choice of parallelism strategy  
 624 (data, tensor, pipeline, expert) critically impacts performance. Pa-  
 625 leo [48] pioneered analytical estimation of training time by de-  
 626 composing workloads into compute and communication compo-  
 627 nents. MAD Max [24] extends analytical modeling to distributed  
 628 systems. The Llama 3 scaling study [13] documents 4D parallelism  
 629 at 16K H100 GPUs, providing ground truth for simulator valida-  
 630 tion. Sailor [55] addresses automated parallelism selection over  
 631 heterogeneous clusters.

632 **Inference serving.** VIDUR [2] simulates LLM inference serv-  
 633 ing with scheduling strategies (vLLM [35], Orca [65], Sarathi [1])

634 without requiring GPU hardware. Frontier [18] extends to MoE  
 635 and disaggregated inference with stage-centric simulation. Throt-  
 636 tLL’em [28] models GPU throttling for energy-efficient LLM infer-  
 637 ence. Recent LLM inference optimizations also change the perfor-  
 638 mance characteristics that simulators must capture: MEDUSA [8]  
 639 introduces speculative decoding with multiple decode heads that  
 640 transforms the sequential token-by-token generation into parallel  
 641 verification, fundamentally altering the compute-to-memory ratio  
 642 that models like VIDUR assume. POD-Attention [22] achieves full  
 643 prefill-decode overlap, breaking the assumption of sequential phase  
 644 execution that underlies most analytical LLM inference models.  
 645 AQUA [52] demonstrates 20× improved responsiveness through  
 646 network-accelerated KV cache offloading across GPU domains,  
 647 introducing network latency into what was previously a local mem-  
 648 ory management problem. These optimizations illustrate a moving-  
 649 target challenge: performance models must track not just hardware  
 650 evolution but algorithmic innovations that restructure execution  
 651 patterns. These tools collectively enable infrastructure planning  
 652 and scheduling algorithm comparison at scale.

653 **Memory system interactions.** Memory increasingly domi-  
 654 nates ML performance. KV cache management is the dominant  
 655 LLM serving challenge; vLLM’s PagedAttention [35] achieves 2–  
 656 4× throughput improvement. VIDUR models cache allocation and  
 657 eviction at the system level. Memory simulators have evolved  
 658 through multiple generations: DRAMSim2 [50] established cycle-  
 659 accurate DDR2/DDR3 simulation validated against manufacturer  
 660 Verilog models, while DRAMSim3 [40] added thermal modeling and  
 661 HMC support. Ramulator [32] introduced extensible support across  
 662 DDRx, LPDDRx, GDDRx, and HBM standards, with Ramulator  
 663 2 [42] adding DDR5, HBM3, and RowHammer mitigation mod-  
 664 eling. These simulators are primarily integrated with CPU/GPU  
 665 simulators rather than used standalone for ML workloads, but they  
 666 provide critical memory subsystem fidelity for tools like Accel-Sim  
 667 that model data movement bottlenecks in ML training.

### 674 5.4 Edge Device Modeling

675 Edge devices impose strict power, memory, and latency constraints.  
 676 The diversity of edge hardware (mobile CPUs, GPUs, NPUs, DSPs)  
 677 makes per-device analytical modeling impractical, leading to ML-  
 678 augmented approaches.

679 nn-Meter [68] uses random forest ensembles with kernel-level  
 680 feature engineering, reporting <1% MAPE. However, this claim is  
 681 currently unverifiable: the tool’s pre-trained predictors fail with  
 682 modern scikit-learn versions due to pickle serialization changes,  
 683 scoring only 3/10 in our reproducibility evaluation. LitePred [16]  
 684 scales to 85 edge platforms using VAE-based intelligent sampling  
 685 and transfer learning, achieving 0.7% MAPE with under one hour  
 686 of adaptation per device. HELP [37] formulates cross-hardware  
 687 prediction as meta-learning with MAML-style adaptation, achieving  
 688 1.9% MAPE with just 10 measurement samples on new devices.  
 689 ESM [43] provides a framework for building effective surrogate  
 690 models for hardware-aware neural architecture search.

691 The latency predictor study [15] provides the most systematic  
 692 evaluation across approaches, showing transfer learning provides  
 693 22.5% average improvement, up to 87.6% on challenging cross-  
 694 platform transfers.

## 697 5.5 Cross-Cutting Challenges

698 Several challenges cut across platform categories.

699 **CNN-to-transformer gap.** Nearly all reported accuracy numbers  
 700 are measured on CNN workloads. Performance on transformers, MoE,  
 701 and diffusion models is less well characterized. NeuSight  
 702 is a notable exception, evaluating on GPT-3 inference, but most  
 703 tools lack validated transformer support.

704 **Kernel-to-end-to-end composition.** Many tools predict kernel-  
 705 level performance (nn-Meter, NeuSight), but composing kernel pre-  
 706 dictions into accurate end-to-end estimates is an unsolved problem.  
 707 Memory allocation, kernel launch overhead, and inter-operator  
 708 data movement introduce errors that compound across layers.

709 **Static vs. profiling-based approaches.** A fundamental practical  
 710 divide exists between tools that predict from static specifications  
 711 only (Timeloop, MAESTRO, Paleo) and those requiring runtime  
 712 profiling data (Habitat, nn-Meter, HELP). Static approaches enable  
 713 pre-silicon evaluation and NAS; profiling-based approaches achieve  
 714 higher accuracy on existing hardware. This distinction is often more  
 715 practically relevant than the analytical-vs-ML divide.

## 716 6 Comparison and Analysis

717 We analyze trade-offs across methodology types along four dimensions:  
 718 accuracy, speed, generalization, and interpretability. Table 2  
 719 summarizes key characteristics.

### 720 6.1 Accuracy by Problem Difficulty

721 Rather than comparing accuracy numbers directly (which is misleading  
 722 across different benchmarks, metrics, and hardware), we organize results  
 723 by problem difficulty.

724 **Accelerator dataflow modeling** is the most amenable to accurate  
 725 prediction because computations are regular and memory access  
 726 patterns are statically determined. Timeloop achieves 5–10%  
 727 error against RTL through purely analytical means.

728 **Single-GPU kernel prediction** for known architectures achieves  
 729 2–12% error through hybrid approaches (NeuSight, Habitat) that  
 730 embed hardware-specific inductive biases.

731 **Distributed system-level prediction** achieves 2–15% error  
 732 through trace-driven simulation (SimAI 1.9%, Lumos 3.3%, ASTRA-  
 733 sim 5–15%), reflecting the challenge of modeling compute-communication  
 734 interaction.

735 **Cross-platform edge prediction** achieves 0.7–2% error (LitePred,  
 736 HELP) but requires per-device profiling data, trading generality for  
 737 accuracy.

738 **GPU analytical modeling** remains the most difficult, with  
 739 AMALI’s 23.6% representing the current state of the art for purely  
 740 analytical GPU LLM inference prediction—a problem where dynamic  
 741 microarchitectural effects resist closed-form treatment.

### 742 6.2 Generalization Challenges

743 **Workload generalization.** Nearly all reported accuracy numbers  
 744 are measured on CNN workloads. Cross-workload-type transfer  
 745 (CNN→transformer) remains largely unvalidated. NeuSight is a  
 746 notable exception, evaluating on LLM workloads, but most edge  
 747 device predictors (nn-Meter, LitePred, HELP) are validated primarily  
 748 on CNNs.

749 **Hardware generalization.** Three strategies show promise: meta-  
 750 learning (HELP with 10-sample adaptation), feature-based transfer  
 751 (LitePred across 85 devices), and analytical decomposition (Habi-  
 752 tatt separating compute/memory scaling). Cross-family transfer  
 753 (GPU→TPU→PIM) remains unsolved.

754 **Temporal generalization.** Software stack evolution (frame-  
 755 work updates, driver changes, compiler optimizations) invalidates  
 756 trained models over time. No surveyed tool addresses continual  
 757 learning for evolving software environments.

## 758 6.3 Interpretability and Design Insight

759 A key advantage of analytical models is actionable design insight.  
 760 Timeloop identifies data movement bottlenecks; MAESTRO reveals  
 761 suboptimal dataflow choices; VIDUR exposes scheduling inefficiencies.  
 762 These insights directly guide design decisions.

763 ML-augmented approaches (nn-Meter, HELP) provide feature  
 764 importance rankings but limited causal understanding. Hybrid ap-  
 765 proaches (NeuSight, Concorde) offer partial interpretability through  
 766 their analytical components. The interpretability gap is practically  
 767 significant: architects need to understand *why* a design is slow, not  
 768 just predict *that* it is slow.

## 769 7 Open Challenges and Future Directions

### 770 7.1 Workload Coverage Gaps

771 Existing tools are primarily validated on CNN workloads. Trans-  
 772 formers, mixture-of-experts (MoE), diffusion models, and dynamic  
 773 inference patterns (e.g., AI agents with tool use [31]) remain under-  
 774 represented in validation benchmarks. LLM serving introduces  
 775 variable sequence lengths (128–128K tokens) and dynamic batch-  
 776 ing that challenge static models. Scaling law prediction [12, 20]  
 777 connects model size to performance but does not address hardware-  
 778 specific modeling.

### 779 7.2 The Composition Problem

780 Many tools predict kernel-level or operator-level performance, but  
 781 composing these predictions into accurate end-to-end estimates is  
 782 an unsolved problem. Memory allocation overhead, kernel launch  
 783 latency, inter-operator data movement, and framework scheduling  
 784 introduce compounding errors. For distributed systems, the com-  
 785 position extends across devices with communication overhead and  
 786 synchronization. No surveyed tool provides validated composition  
 787 guarantees.

### 788 7.3 Emerging Hardware Support

789 PIM architectures [23, 26, 36, 46], neuromorphic processors, and  
 790 analog compute present fundamentally different modeling chal-  
 791 lenges. Existing frameworks (Timeloop, MAESTRO) assume con-  
 792 ventional memory hierarchies; PIM blurs the compute-memory  
 793 boundary. Chiplet-based designs and disaggregated architectures  
 794 introduce new interconnect modeling requirements.

### 795 7.4 Integration with Design Flows

796 Compiler integration (TVM, Ansor) needs uncertainty quantifi-  
 797 cation for exploration-exploitation trade-offs. Architecture explo-  
 798 ration (ArchGym) requires active learning for sample efficiency.

Table 2: Comparative analysis of representative tools across key dimensions. Accuracy figures are as reported in original papers; direct comparison is limited by differences in benchmarks, workloads, hardware targets, and evaluation protocols.

<sup>†</sup>Unverifiable. <sup>\*</sup>Surrogate fidelity, not hardware accuracy.

| Tool           | Methodology  | Accuracy (reported) | Setup Cost         | Generalization    | Interpretability | Eval. Speed   |
|----------------|--------------|---------------------|--------------------|-------------------|------------------|---------------|
| Timeloop [45]  | Analytical   | 5–10%               | Arch spec only     | Any accelerator   | High             | $\mu\text{s}$ |
| MAESTRO [34]   | Analytical   | 5–15%               | Arch spec only     | Any accelerator   | High             | $\mu\text{s}$ |
| AMALI [9]      | Analytical   | 23.6% MAPE          | None               | GPU LLM inference | High             | ms            |
| Accel-Sim [29] | Simulation   | 10–20%              | GPU binary         | GPU-specific      | High             | Hours         |
| ASTRA-sim [62] | Trace-driven | 5–15%               | Execution trace    | Configurable      | Medium           | Minutes       |
| VIDUR [2]      | Trace-driven | <5%                 | Kernel profiles    | LLM-specific      | High             | Seconds       |
| SimAI [59]     | Trace-driven | 1.9%                | Full-stack setup   | LLM training      | Medium           | Minutes       |
| Lumos [41]     | Trace-driven | 3.3%                | Execution trace    | LLM training      | Medium           | Minutes       |
| nn-Meter [68]  | ML-augmented | <1% <sup>†</sup>    | 1K samples/kernel  | Device-specific   | Medium           | ms            |
| LitePred [16]  | ML-augmented | 0.7% MAPE           | 100 samples/device | 85+ devices       | Low              | ms            |
| HELP [37]      | ML-augmented | 1.9% MAPE           | 10 samples/device  | Cross-platform    | Low              | ms            |
| TVM [10]       | ML-augmented | ~15% MAPE           | 10K+               | Operator-level    | Medium           | ms            |
| NeuSight [38]  | Hybrid       | 2.3% MAPE           | Pre-trained        | Cross-GPU         | Medium           | ms            |
| Habitat [66]   | Hybrid       | 11.8% MAPE          | Online profiling   | Cross-GPU         | Medium           | Per-kernel    |
| ArchGym [33]   | Hybrid       | 0.61% RMSE*         | Simulation runs    | Arch-specific     | Medium           | ms            |
| Concorde [44]  | Hybrid       | 2% CPI              | Training corpus    | Cross- $\mu$ arch | Medium           | ms            |

LLM serving needs real-time prediction within microseconds; VIDUR provides offline simulation but online adaptation remains challenging. FlashAttention [14] and other hardware-aware algorithm optimizations change the performance landscape faster than models can be retrained.

## 7.5 Reproducibility and Trust

Our evaluation reveals a critical gap between reported accuracy and independently verifiable results. nn-Meter’s claimed <1% MAPE is unverifiable because the tool cannot be run. Accuracy claims without reproducible evaluation are of limited value to practitioners. The community would benefit from standardized benchmarks with common workloads, hardware targets, and evaluation protocols.

## 7.6 Threats to Validity

**Selection bias.** Our literature search focused on top architecture venues (MICRO, ISCA, HPCA, ASPLOS) and systems venues (ML-Sys, OSDI, SOSP, NSDI), potentially under-representing work from application-specific venues, industry reports, or non-English publications.

**Tool evaluation scope.** Our reproducibility evaluation covers five tools (Timeloop, ASTRA-sim, VIDUR, nn-Meter, NeuSight), selected for coverage across methodology types and availability of open-source implementations. Results may not generalize to proprietary tools.

**Metrics comparability.** Accuracy figures use different metrics (MAPE, RMSE, Kendall’s  $\tau$ ), benchmarks, and hardware targets. Tables 1 and 2 report metrics as stated in original papers; cross-paper comparisons should be interpreted with caution.

## 7.7 Future Directions

Five high-priority research opportunities: (1) **Transformer/MoE-aware tools**—current tools are validated on CNNs; attention and

Table 3: Reproducibility evaluation scores (10-point rubric).

| Tool      | Setup | Reprod. | Usability | Total  |
|-----------|-------|---------|-----------|--------|
| Timeloop  | 3     | 4       | 2         | 9/10   |
| ASTRA-sim | 2.5   | 3       | 3         | 8.5/10 |
| VIDUR     | 2.5   | 3.5     | 3         | 9/10   |
| nn-Meter  | 2     | 0       | 1         | 3/10   |
| NeuSight  | 2     | 3       | 2.5       | 7.5/10 |

expert routing have distinct performance characteristics. (2) **Validated composition**—methods to compose kernel predictions into end-to-end estimates with bounded error. (3) **Unified energy-latency-memory prediction**—most tools focus on latency; edge and datacenter deployment need energy and memory modeling. (4) **Temporal robustness**—benchmarks for evaluating model accuracy under software stack evolution. (5) **Unified tooling**—no single tool addresses all needs; Docker-first deployment, portable model formats (ONNX), and composable modeling engines with standard workload representations (Chakra [54]) could reduce fragmentation.

## 8 Experimental Evaluation

We conducted hands-on reproducibility evaluations of five representative tools using a 10-point rubric: Setup (3 pts: Docker availability, clean installation, quick start), Reproducibility (4 pts: reference outputs, determinism, examples), and Usability (3 pts: API clarity, interpretability, maintenance). Table 3 summarizes results.

**Key findings.** Docker-first tools (Timeloop, ASTRA-sim, VIDUR) scored 8.5+ by isolating dependencies; we executed all three on both x86\_64 and aarch64 without issues. Timeloop provides reference outputs for all examples (Eyeriss, Simba) with deterministic results. ASTRA-sim includes validated HGX-H100 configurations;

VIDUR enables scheduler comparison (vLLM, Orca, Sarathi) without GPU hardware. NeuSight’s tile-based hybrid approach achieves 2.3% error on LLM workloads.

**Critical anti-pattern.** nn-Meter’s pre-trained predictors fail with current scikit-learn due to pickle format changes—a cautionary example of ML model serialization fragility. Projects should prefer portable formats (ONNX) or pin exact dependency versions.

**Best practices:** (1) Provide Docker images to isolate dependencies; (2) Document Python version requirements; (3) Include reference outputs for validation; (4) Use portable model formats; (5) Pin dependency versions.

## 9 Conclusion

This survey analyzed over 50 tools and methods for modeling and predicting performance of ML workloads, spanning analytical models, cycle-accurate simulators, trace-driven simulation, and ML-augmented hybrid approaches.

**Key findings:** (1) Analytical frameworks (Timeloop, MAESTRO) remain the most effective for DNN accelerator design space exploration, offering microsecond evaluation with full interpretability. (2) Trace-driven simulators (ASTRA-sim, VIDUR, SimAI, Lumos) have emerged as the practical choice for system-level modeling of distributed training and LLM serving, achieving 2–15% accuracy at practical speeds. (3) Hybrid approaches combining analytical structure with learned components (NeuSight, Concorde) achieve the best accuracy on GPU kernel prediction (2–3% error). (4) LLM workloads require specialized modeling for prefill/decode phases, KV cache management, and multi-stage inference pipelines. (5) Reproducibility varies dramatically across tools—Docker-first tools score 8.5+ on our rubric while tools relying on serialized ML models risk becoming unusable.

**Gaps and opportunities:** Most tools are validated on CNN workloads; transformer/MoE validation is sparse. The kernel-to-end-to-end composition problem remains unsolved. Emerging hardware (PIM, chiplets) lacks mature modeling support. The community needs standardized benchmarks for cross-tool accuracy comparison.

As ML workloads grow in scale and diversity, accurate performance prediction becomes critical for efficient hardware design, system provisioning, and serving infrastructure planning. This survey provides practitioners a guide for selecting appropriate tools and researchers a roadmap for advancing the field.

## References

- [1] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramachandran. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In *Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 117–134.
- [2] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramachandran. 2024. VIDUR: A Large-Scale Simulation Framework for LLM Inference. In *Proceedings of Machine Learning and Systems (MLSys)*. 1–15.
- [3] Ali Bakhtoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 163–174. <https://doi.org/10.1109/ISPASS.2009.4919648>
- [4] Abhimanyu Rajeshkumar Bambhaniya et al. 2025. HERMES: Understanding and Optimizing Multi-Stage AI Inference Pipelines. *arXiv preprint arXiv:2504.09775*
- [5] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 Simulator. *ACM SIGARCH Computer Architecture News* 39, 2 (2011), 1–7. <https://doi.org/10.1145/2024716.2024718>
- [6] Shirley Browne, Jack Dongarra, Nathan Garner, George Ho, and Philip Mucci. 2000. A Portable Programming Interface for Performance Evaluation on Modern Processors. *International Journal of High Performance Computing Applications* 14, 3 (2000), 189–204. <https://doi.org/10.1177/10943420001400303> PAPI: portable API for hardware performance counters, foundational tool for performance analysis.
- [7] Kai Cai, Wei Miao, Junyu Zhu, Jiaxu Chen, Hao Shan, Huanyu Li, and Chi Zhang. 2024. Echo: Simulating Distributed Training At Scale. *arXiv preprint arXiv:2412.12487* (2024).
- [8] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In *Proceedings of the 41st International Conference on Machine Learning (ICML)*. 1–15.
- [9] Zheng Cao et al. 2025. AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)*. 1–14. <https://doi.org/10.1145/3695053.3731064> Reduces GPU LLM inference MAPE from 127.56% to 23.59% vs GCoM baseline.
- [10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In *Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 578–594.
- [11] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In *Proceedings of the 43rd International Symposium on Computer Architecture (ISCA)*. 367–379. <https://doi.org/10.1109/ISCA.2016.40>
- [12] Leshem Choshen, Yang Zhang, and Jacob Andreas. 2025. A Hitchhiker’s Guide to Scaling Law Estimation. In *Proceedings of the 42nd International Conference on Machine Learning (ICML)*. 1–25. Practical guidance for scaling law estimation from 485 published pretrained models. IBM/MIT.
- [13] Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Pavan Balaji, Ching-Hsiang Chu, Jongsoo Park, et al. 2025. Scaling Llama 3 Training with Efficient Parallelism Strategies. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)*. 1–15. 4D parallelism for Llama 3 405B on 16K H100 GPUs. Achieves 400 TFLOPS/GPU. Meta..
- [14] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 35. 16344–16359.
- [15] Lukasz Dudziak, Thomas Chau, Mohamed S. Abdelfattah, Royston Lee, Hyeji Kim, and Nicholas D. Lane. 2024. Latency Predictors for Neural Architecture Search. In *Proceedings of Machine Learning and Systems (MLSys)*. 1–14.
- [16] Yang Feng, Zhehai Li, Jiacheng Yang, and Yunxin Liu. 2024. LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search. In *Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI)*. 1–18.
- [17] Paraskevas Gavriilidis et al. 2025. LIFE: Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling. *arXiv preprint arXiv:2508.00904* (2025). Hardware-agnostic analytical model for LLM inference performance forecasting.
- [18] Siddharth Ghosh et al. 2025. Frontier: Simulating the Next Generation of LLM Inference Systems. *arXiv preprint arXiv:2508.03148* (2025). Stage-centric simulator for MoE and disaggregated LLM inference, models expert parallelism and cross-cluster routing.
- [19] Alicia Golden et al. 2025. PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training. *arXiv preprint arXiv:2510.15596* (2025). Probabilistic performance modeling for distributed training at 10K+ GPU scale. Meta..
- [20] Alexander Haggle, Elie Bakouck, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. 2024. Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 37. Spotlight. Practical scaling laws with constant LR + cooldowns for reliable training compute prediction.
- [21] Ameer Haj-Ali et al. 2025. Omniwise: Predicting GPU Kernels Performance with LLMs. *arXiv preprint arXiv:2506.20886* (2025). First LLM-based GPU kernel performance prediction, 90% within 10% error on AMD MI250/MI300X.
- [22] Yanbin Hao et al. 2025. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (AS-PLoS)*. 1–15. Full overlap between prefill and decode phases for LLM inference.

- 1045 [23] Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee,  
1046 Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Neu-  
1047 uPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inference. In  
1048 *Proceedings of the 29th ACM International Conference on Architectural Support  
1049 for Programming Languages and Operating Systems (ASPLOS)*. 1–17. NPU-  
1050 PIM heterogeneous architecture for LLM inference with performance modeling.  
KAIST/Georgia Tech.
- 1051 [24] Samuel Hsia, Kartik Chandra, and Kunle Olukotun. 2024. MAD Max Beyond  
1052 Single-Node: Enabling Large Machine Learning Model Acceleration on Dis-  
1053 tributed Systems. In *Proceedings of the 51st Annual International Symposium on  
1054 Computer Architecture (ISCA)*. 753–766. <https://doi.org/10.1109/ISCA59077.2024.00064>
- 1055 [25] Rodrigo Huerta, Mojtaba Aba Shoushtary, Jose-Lorenzo Cruz, and Antonio  
1056 Gonzalez. 2025. Dissecting and Modeling the Architecture of Modern GPU Cores.  
In *Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture  
1057 (MICRO)*. 369–384. Reverse-engineers modern NVIDIA GPU cores, improves  
1058 Accel-Sim to 13.98% MAPE. UPC Barcelona.
- 1059 [26] Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfind-  
1060 ing Future PIM Architectures by Demystifying a Commercial PIM Technology.  
In *Proceedings of the IEEE International Symposium on High Performance Com-  
1061 puter Architecture (HPCA)*. 1–15. uPIMulator: cycle-accurate PIM simulation  
1062 framework for UPMEM. KAIST.
- 1063 [27] Ryota Imai, Kentaro Harada, Ryo Sato, and Toshio Nakaike. 2024. Roofline-  
1064 Driven Machine Learning for Large Language Model Performance Prediction.  
NeurIPS Workshop on Machine Learning for Systems (2024).
- 1065 [28] Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios  
1066 Xydis, and Dimitrios Soudris. 2025. throttLLM: Predictive GPU Throttling for  
1067 Energy Efficient LLM Inference Serving. In *Proceedings of the IEEE International  
1068 Symposium on High Performance Computer Architecture (HPCA)*. 1–14. Achieves  
1069 up to 43.8% lower energy consumption for LLM inference.
- 1070 [29] Mahmoud Khairy, Zhecheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020.  
1071 Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In  
1072 *Proceedings of the 47th International Symposium on Computer Architecture (ISCA)*.  
1073 473–486. <https://doi.org/10.1109/ISCA45697.2020.00047>
- 1074 [30] Jungho Kim et al. 2025. PyTorchSim: A Comprehensive, Fast, and Accurate  
1075 NPU Simulation Framework. In *Proceedings of the 58th IEEE/ACM International  
1076 Symposium on Microarchitecture (MICRO)*. 1–14. <https://doi.org/10.1145/3725843.3756045> PyTorch 2-integrated NPU simulator with custom RISC-V ISA and  
1077 Tile-Level Simulation.
- 1078 [31] Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. 2026. The Cost of  
1079 Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI  
1080 Infrastructure Perspective. In *Proceedings of the IEEE International Symposium on  
1081 High Performance Computer Architecture (HPCA)*. 1–14. HPCA 2026 (Jan 31–  
1082 Feb 4, 2026, Las Vegas). First comprehensive system-level analysis of AI agents;  
1083 quantifies resource usage, latency, and datacenter power consumption.
- 1084 [32] Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and  
1085 Extensible DRAM Simulator. *IEEE Computer Architecture Letters* 15, 1 (2016), 45–  
1086 49. <https://doi.org/10.1109/LCA.2015.2414456> Fast extensible DRAM simulator  
1087 supporting DDRx, LPDDRx, GDDRx, WIOx, HBMx standards.
- 1088 [33] Srivatsan Krishnan, Amir Yazdanbakhsh, Shvetank Prakash, Norman P.  
1089 Jouppi, Jignesh Parmar, Hyoukjun Kim, James Laudon, and Chandrakant  
1090 Narayanaswami. 2023. ArchGym: An Open-Source Gymnasium for Machine  
1091 Learning Assisted Architecture Design. In *Proceedings of the 50th International  
1092 Symposium on Computer Architecture (ISCA)*. 1–16. <https://doi.org/10.1145/357931.3589049>
- 1093 [34] Hyoukjun Kwon, Prasanth Chatarasi, Michael Barber, Michael Pellauer, Angshuman  
1094 Parashar, and Tushar Krishna. 2019. MAESTRO: A Data-Centric Approach  
1095 to Understand Reuse, Performance, and Hardware Cost of DNN Mappings. In  
1096 *Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture  
1097 (MICRO)*. 1–14. <https://doi.org/10.1145/3352460.3358292>
- 1098 [35] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng,  
1099 Cody Hao Yu, Joseph E. Gonzalez, Haizhang, and Ion Stoica. 2023. Efficient  
1100 Memory Management for Large Language Model Serving with PagedAttention.  
In *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*.  
1101 611–626. <https://doi.org/10.1145/360006.3613165>
- 1102 [36] Hyojung Lee, Daehyun Baek, Jimyoung Son, Jeun Choi, Kiyo Moon, and  
1103 Minsung Jang. 2025. PAISE: PIM-Accelerated Inference Scheduling Engine for  
1104 Transformer-based LLM. In *Proceedings of the IEEE International Symposium on  
1105 High Performance Computer Architecture (HPCA)*. 1–14. PIM-based LLM  
1106 inference scheduling. 48.3% speedup, 11.5% power reduction. Samsung..
- 1107 [37] Hayeon Lee, Sewooong Lee, Song Chong, and Sung Ju Hwang. 2021. HELP:  
1108 Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning.  
In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 34. 27016–  
1109 27028.
- 1110 [38] Seunghyun Lee, Amar Phanishayee, and Divya Mahajan. 2025. NeuSight: GPU  
1111 Performance Forecasting via Tile-Based Execution Analysis. In *Proceedings of  
1112 the 30th ACM International Conference on Architectural Support for Programming  
1113 Languages and Operating Systems (ASPLOS)*. 1–15.
- 1113 [39] Jianbo Li et al. 2025. TrioSim: A Lightweight Simulator for Large-Scale DNN  
1114 Workloads on Multi-GPU Systems. In *Proceedings of the 52nd Annual International  
1115 Symposium on Computer Architecture (ISCA)*. 1–13. Multi-GPU DNN simulation  
1116 with lightweight approach for distributed training analysis.
- 1117 [40] Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2020.  
1118 DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator. *IEEE Computer  
1119 Architecture Letters* 19, 2 (2020), 106–109. <https://doi.org/10.1109/LCA.2020.2973991> Modernized DRAM simulator with thermal modeling and HMC  
1120 support.
- 1121 [41] Wenxuan Liang et al. 2025. Lumos: Efficient Performance Modeling and Esti-  
1122 mation for Large-scale LLM Training. In *Proceedings of Machine Learning and  
1123 Systems (MLSys)*. 1–16. Trace-driven performance modeling achieving 3.3% error  
1124 on H100 GPUs for LLM training.
- 1125 [42] Haocong Luo, Yahya Can Tugrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray  
1126 Yaglikeci, and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensi-  
1127 ble DRAM Simulator. *IEEE Computer Architecture Letters* 22, 2 (2023), 129–132.  
1128 <https://doi.org/10.1109/LCA.2023.3333759> Modular DRAM simulator with DDR5,  
1129 LPDDR5, HBM3, GDDR6 support and RowHammer mitigation modeling.
- 1130 [43] Azaz-Ur-Rehman Nasir, Samroz Ahmad Shoaila, Muhammad Abdullah Hanif, and  
1131 Muhammad Shafique. 2025. ESM: A Framework for Building Effective Surrogate  
1132 Models for Hardware-Aware Neural Architecture Search. In *Proceedings of the  
1133 62nd ACM/IEEE Design Automation Conference (DAC)*. 1–6. 97.6% accuracy  
1134 surrogate model framework for HW-aware NAS.
- 1135 [44] Amir Nasr-Esfahany et al. 2025. Concorde: Fast and Accurate CPU Performance  
1136 Modeling with Compositional Analytical-ML Fusion. In *Proceedings of the 52nd  
1137 Annual International Symposium on Computer Architecture (ISCA)*. 1–15. Hybrid  
1138 analytical-ML approach achieving 2% CPI error at 5 orders of magnitude faster  
1139 than gem5.
- 1140 [45] Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen,  
1141 Victor A. Ying, Anurag Muber, Rangharajan Venkatesan, Brucek Khailany,  
1142 Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach  
1143 to DNN Accelerator Evaluation. In *Proceedings of the IEEE International Sym-  
1144 posium on Performance Analysis of Systems and Software (ISPASS)*. 304–315.  
1145 <https://doi.org/10.1109/ISPASS.2019.00042>
- 1146 [46] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk  
1147 Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. AttAcc! Unleashing the Power of  
1148 PIM for Batched Transformer-based Generative Model Inference. In *Proceedings of  
1149 the 29th ACM International Conference on Architectural Support for Programming  
1150 Languages and Operating Systems (ASPLOS)*. 1–16. PIM-based accelerator for  
1151 batched transformer attention. Seoul National University/UIUC..
- 1152 [47] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aakanksha Shah, Íñigo Goiri,  
1153 Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM  
1154 Inference Using Phase Splitting. In *Proceedings of the 51st Annual International  
1155 Symposium on Computer Architecture (ISCA)*. 118–132. <https://doi.org/10.1109/ISCA59077.2024.00019> Best Paper Award.
- 1156 [48] Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance  
1157 Model for Deep Neural Networks. In *Proceedings of the 5th International Con-  
1158 ference on Learning Representations (ICLR)*. <https://openreview.net/forum?id=SyVVJ85lg>
- 1159 [49] Saeed Rashidi, Srinivas Srinivasan, Kazem Hamedani, and Tushar Krishna. 2020.  
1160 ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed  
1161 DL Training Platforms. In *Proceedings of the IEEE International Symposium on  
1162 Performance Analysis of Systems and Software (ISPASS)*. 81–92. <https://doi.org/10.1109/ISPASS48437.2020.00018>
- 1163 [50] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle  
1164 Accurate Memory System Simulator. *IEEE Computer Architecture Letters* 10, 1  
1165 (2011), 16–19. <https://doi.org/10.1109/L-CA.2011.4> Widely-used cycle-accurate  
1166 DDR2/DDR3 memory simulator validated against manufacturer Verilog models.
- 1167 [51] Alen Sabu, Harish Patil, Ameer Haj-Ali, and Trevor E. Carlson. 2022. LoopPoint:  
1168 Checkpoint-driven Sampled Simulation for Multi-threaded Applications. In  
1169 *Proceedings of the IEEE International Symposium on High Performance Computer  
1170 Architecture (HPCA)*. 606–618. <https://doi.org/10.1109/HPCA5396.2022.00052>  
Extends sampling to multi-threaded applications with 2.3% error and up to 800x  
1171 speedup.
- 1172 [52] Zhuomin Shen, Jaeho Kim, et al. 2025. AQUA: Network-Accelerated Memory  
1173 Offloading for LLMs in Scale-Up GPU Domains. In *Proceedings of the 30th ACM  
1174 International Conference on Architectural Support for Programming Languages  
1175 and Operating Systems (ASPLOS)*. 1–16. <https://doi.org/10.1145/3676641.3715983>  
Improves LLM inference responsiveness by 20x through network-accelerated  
1176 memory offloading.
- 1177 [53] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Auto-  
1178 matically Characterizing Large Scale Program Behavior. In *Proceedings of the 10th  
1179 International Conference on Architectural Support for Programming Languages  
1180 and Operating Systems (ASPLOS)*. 45–57. <https://doi.org/10.1145/605397.605403>  
Introduces SimPoint: automatic selection of representative simulation points  
1181 using k-means clustering.

- 1161 [54] Srinivas Sridharan, Taekyung Heo, Jinwoo Choi, Garyfallia Yu, Saeed Rashidi, William Won, Zhaodong Meng, and Tushar Krishna. 2023. Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces. *arXiv preprint arXiv:2305.14516* (2023).
- 1162 [55] Foteini Strati, Zhendong Zhang, George Manos, Ixeia Sanchez Periz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, and Ana Klimovic. 2025. Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters. In *Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP)*. 1–18. Automated distributed training with runtime/memory simulation over heterogeneous resources. ETH Zurich/MIT.
- 1163 [56] Ondrej Sykora, Alexis Rucker, Charith Mendis, Rajkishore Barik, Phitchaya Mangpo Phothilimthana, and Saman Amarasinghe. 2022. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation. In *Proceedings of the IEEE International Symposium on Workload Characterization (ISWC)*. 1–13. <https://doi.org/10.1109/ISWC55918.2022.00014>
- 1164 [57] Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In *Proceedings of the 39th International Conference on Parallel Processing Workshops (ICPPW)*. 207–216. <https://doi.org/10.1109/ICPPW.2010.38> Lightweight tools for thread/cache topology, affinity, and performance counter measurement.
- 1165 [58] Adrian Tschanz, Mohamed Awad, et al. 2025. SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization. *arXiv preprint arXiv:2508.20258* (2025). LLM-based spatial optimization for GPU kernels, up to 2.06x speedup via swizzling.
- 1166 [59] Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Heyang Zhou, Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, et al. 2025. SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale LLM Training with Scalability and Precision. In *Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI)*. 1–18. Full-stack LLM training simulator achieving 98.1% alignment with real-world results. Alibaba Cloud/Tsinghua..
- 1167 [60] Zixian Wang et al. 2025. SynPerf: Synthesizing High-Performance GPU Kernels via Pipeline Decomposition. *arXiv preprint* (2025). Under review.
- 1168 [61] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. *Commun. ACM* 52, 4 (2009), 65–76. <https://doi.org/10.1145/1498765.1498785>
- 1169 [62] William Won, Taekyung Heo, Saeed Rashidi, Saeed Talati, Srinivas Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-Model Training at Scale. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 283–294. <https://doi.org/10.1109/ISPASS557527.2023.00035>
- 1170 [63] Yannan Nellie Wu, Joel Emer, and Vivienne Sze. 2022. Sparseloop: An Analytical Approach to Sparse Tensor Accelerator Modeling. In *Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. 1–15. <https://doi.org/10.1109/MICRO56248.2022.00078>
- 1171 [64] Roland E. Wunderlich, Thomas F. Wenisch, Babak Falsafi, and James C. Hoe. 2003. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. In *Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA)*. 84–97. <https://doi.org/10.1109/ISCA.2003.1206991> Statistical sampling achieving 0.64% CPI error with 35x speedup over detailed simulation.
- 1172 [65] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. ORCA: A Distributed Serving System for Transformer-Based Generative Models. In *Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 521–538.
- 1173 [66] Geoffrey X. Yu, Yubo Gao, Pavel Golber, and Asaf Cidon. 2021. Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In *Proceedings of the USENIX Annual Technical Conference (ATC)*. 503–521.
- 1174 [67] Yi Zhai, Yu Cheng Wang, Peng Jiang, and Congming Kang. 2023. TLP: A Deep Learning-based Cost Model for Tensor Program Tuning. In *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 833–845. <https://doi.org/10.1145/3575693.3575736>
- 1175 [68] Li Lyua Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In *Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys)*. 81–93. <https://doi.org/10.1145/3458864.3467882> Best Paper Award.
- 1176 [69] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In *Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 863–879.
- 1177 [70] Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E. Gonzalez, Ion Stoica, and Zhihao Zhang. 2021. TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 34. 29876–29888.
- 1178 [71] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianyu Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In *Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 1–18.

1219  
1220  
1221  
1222  
1223  
1224  
1225  
1226  
1227  
1228  
1229  
1230  
1231  
1232  
1233  
1234  
1235  
1236  
1237  
1238  
1239  
1240  
1241  
1242  
1243  
1244  
1245  
1246  
1247  
1248  
1249  
1250  
1251  
1252  
1253  
1254  
1255  
1256  
1257  
1258  
1259  
1260  
1261  
1262  
1263  
1264  
1265  
1266  
1267  
1268  
1269  
1270  
1271  
1272  
1273  
1274  
1275