

# A Survey of High-Level Modeling and Simulation Methods for Modern Machine Learning Workloads

MICRO 2026 Submission – Confidential Draft – Do NOT Distribute!!

Anonymous Author(s)

Under Review

Anonymous

## Abstract

As machine learning workloads grow in scale and complexity, architects need fast, accurate methods to predict performance across diverse hardware. This survey analyzes 22 tools from 53 papers across architecture and systems venues (2016–2026), covering analytical models, trace-driven simulators, and ML-augmented hybrid techniques for DNN accelerators, GPUs, distributed training, and LLM inference serving. We organize the literature by methodology type, target platform, and abstraction level, revealing a pervasive CNN-validation bias and finding that hybrid approaches achieve the best accuracy-speed trade-offs. We conduct hands-on evaluations of representative tools, independently measuring their performance and accuracy rather than relying on original claims; tools with Docker-first deployment remain reproducible, while those relying on serialized ML models become unusable. We identify open challenges including cross-workload generalization, kernel-to-end-to-end error composition, and emerging architecture support, providing practitioners tool selection guidance and researchers a roadmap for the field.

## Keywords

ML workload performance prediction, DNN accelerator modeling, GPU simulation, distributed training simulation, LLM inference serving, design space exploration, survey

## 1 Introduction

Machine learning workloads have become the dominant consumers of compute across datacenters and edge devices. Training and inference for CNNs, transformers, mixture-of-experts models, and LLMs demand hardware ranging from Google’s TPU [33, 34] to custom accelerators, creating a heterogeneous landscape where architects must predict performance before committing to costly hardware decisions.

The shift toward domain-specific architectures [25] makes performance prediction both more important and more difficult. Design space exploration, parallelization selection, and hardware-software co-design all require fast, accurate performance models—yet ML workloads pose unique challenges: diverse computational patterns (dense matrix operations, sparse accesses, communication-bound collectives) across GPUs, TPUs, custom accelerators, and multi-device clusters.

A rich ecosystem of modeling tools has emerged. Analytical models (Timeloop [54], MAESTRO [42]) evaluate in microseconds with 5–15% error. Trace-driven simulators (ASTRA-sim [78], VIDUR [3])

replay execution traces for system-level modeling. Hybrid approaches (NeuSight [46]) combine analytical structure with learned components to achieve 2.3% error. Yet no comprehensive survey organizes these methods for the practitioner who must select a tool for a specific task. Existing surveys focus on ML *techniques* for modeling [71] or specific hardware [54]; this survey fills that gap with a methodology-centric view.

We make the following contributions:

- A **methodology-centric taxonomy** organizing tools along three dimensions: methodology type, target platform, and abstraction level, with a coverage matrix identifying research gaps and a workload analysis exposing CNN-validation bias.
- A **systematic survey** of 22 tools from 53 papers across architecture venues (MICRO, ISCA, HPCA, ASPLOS) and systems venues (MLSys, OSDI, NSDI) published 2016–2026.
- A **comparative analysis** of accuracy–speed trade-offs with careful qualification of reported claims.
- **Hands-on evaluations** where we independently measure tool performance and accuracy, and identification of **open challenges** including the CNN-to-transformer generalization gap, kernel-to-end-to-end error composition, and emerging accelerator support.

Figure 1 illustrates the evolution of performance modeling tools from early analytical frameworks to modern hybrid approaches.

## 2 Survey Methodology

We searched ACM Digital Library, IEEE Xplore, Semantic Scholar, and arXiv using terms related to ML performance modeling, with backward/forward citation tracking from seminal works. Target venues include architecture (MICRO, ISCA, HPCA, ASPLOS), systems (MLSys, OSDI, SOSP, NSDI), and related (NeurIPS, MobiSys, DAC, ISPASS). Papers must propose or evaluate a tool for predicting ML workload performance with quantitative evaluation; we exclude non-performance tasks and general-purpose workloads. From 287 initial candidates, title/abstract screening yielded 118 papers; full-text review reduced the set to 53 that met all criteria, supplemented by 12 foundational works for context. We cover 2016–2026 and classify each paper by *methodology type* (analytical, simulation, trace-driven, ML-augmented, hybrid), *target platform*, and *abstraction level* (kernel, model, system).

**Related surveys.** Prior surveys address adjacent topics: Rakhshanfar and Zarandi [60] survey ML for processor DSE; Sze et al. [72] treat DNN hardware design (the foundation for Timeloop/MAESTRO); simulators such as GPGPU-Sim [4], gem5 [6], and SST [64] have been extensively used as validation targets in the performance



**Figure 1: Evolution of performance modeling tools for ML workloads (2016–2026).** Colors indicate publication venue category. Early analytical frameworks (Eyeriss, Paleo) gave way to systematic accelerator modeling (Timeloop, MAESTRO) and distributed training simulation (ASTRA-sim). Recent work targets LLM-specific modeling (VIDUR, Frontier) and hybrid approaches (NeuSight).

modeling literature; and MLPerf [51, 63] standardizes *measurement* rather than *prediction*. Early ML accelerator modeling (2014–2018) established foundational approaches: DianNao [11] introduced analytical dataflow modeling for dedicated accelerators, Eyeriss [13] systematized row-stationary dataflow analysis, and Paleo [58] pioneered layer-wise analytical estimation. This survey differs by spanning the full methodology spectrum across all major platforms with hands-on evaluations. The closest prior work, Dudziak et al. [17], compares edge device predictors for NAS; we broaden to the full landscape.

### 3 Background

#### 3.1 ML Workload Characteristics

ML workloads are expressed as computation graphs whose operator shapes are statically known and amenable to analytical modeling. Frameworks such as PyTorch [56] and TensorFlow [1] compile these graphs for execution, though MoE and dynamic inference introduce input-dependent control flow. Performance depends on tensor-to-memory mapping (dataflow, tiling), KV cache management for LLM inference [43], and at scale, compute–memory–network interactions across data, tensor, pipeline, and expert parallelism [15]. LLM inference splits into compute-bound prefill and memory-bound decode phases [57], both modeled under batched serving [2, 80].

#### 3.2 Modeling Methodologies

We classify approaches into five categories. **Analytical models** express performance as closed-form functions (e.g., the roofline model [77]), offering microsecond evaluation but requiring per-architecture derivation. **Cycle-accurate simulators** (GPGPU-Sim [4], Accel-Sim [37]) achieve high fidelity at 1000–10000× slowdown, serving primarily as validation oracles for the high-level methods that are the focus of this survey. **Trace-driven simulators** (ASTRA-sim [78], VIDUR [3]) trade fidelity for orders-of-magnitude speedup. **ML-augmented approaches** learn from profiling data (nn-Meter [83]) but may not generalize beyond training distributions. **Hybrid approaches** combine analytical structure with

learned components (NeuSight [46], Habitat [81]), aiming to balance accuracy, speed, and interpretability.

### 3.3 Problem Formulation

Performance modeling predicts a target metric (latency, throughput, energy, or memory footprint) given a workload description and hardware configuration. Workloads are represented at the operator, graph, IR, or trace level; hardware is characterized by specifications, performance counters, or learned embeddings. Ground-truth measurements typically rely on hardware performance counters accessed via PAPI [7] or LIKWID [73]. Accuracy metrics—MAPE, RMSE, and rank correlation—vary across the literature, and differences in benchmarks, hardware targets, and evaluation protocols limit direct comparison (Section 6).

### 4 Taxonomy

We organize the literature along three dimensions. The *primary axis* is methodology type—how a tool predicts performance—because methodology determines the fundamental trade-offs between accuracy, speed, interpretability, and data requirements. The *secondary axes* are target platform and abstraction level, which together determine the scope and applicability of each tool. We additionally characterize tools by workload coverage, exposing a pervasive CNN-validation bias in the literature.

Table 1 provides a unified view combining the coverage matrix (number of surveyed tools per methodology–platform cell) with trade-off profiles, with empty cells highlighting research gaps. The dominant pairings are: analytical models for accelerators, cycle-accurate simulation for GPUs/CPUs, trace-driven simulation for distributed systems, and ML-augmented approaches for edge devices.

Table 1 reveals three structural gaps: (1) trace-driven *execution replay* simulation (as distinct from instruction-trace-driven cycle-accurate simulation such as Accel-Sim) is used exclusively for distributed systems; (2) edge devices are served only by ML-augmented approaches, lacking hybrid alternatives; (3) no ML-augmented tool

**Table 1: Methodology taxonomy: coverage matrix and trade-off profile.** Platform columns show the number of surveyed tools per cell; 0 indicates an explicit research gap. Speed, data requirements, and interpretability determine practical applicability; the failure mode column identifies the primary condition under which each methodology breaks down.

| Methodology    | DNN<br>Accel. | GPU | Distrib.<br>Systems | Edge/<br>Mobile | CPU | Eval.<br>Speed | Data<br>Req. | Interp. | Failure<br>Mode |
|----------------|---------------|-----|---------------------|-----------------|-----|----------------|--------------|---------|-----------------|
| Analytical     | 3             | 3   | 2                   | 0               | 0   | $\mu$ s        | None         | High    | Dynamic effects |
| Cycle-Accurate | 1             | 2   | 0                   | 0               | 1   | Hours          | Binary       | High    | Scale           |
| Trace-Driven   | 0             | 0   | 7                   | 0               | 0   | Min.           | Traces       | Med.    | Trace fidelity  |
| ML-Augmented   | 0             | 3   | 0                   | 3               | 1   | ms             | Profiling    | Low     | Distrib. shift  |
| Hybrid         | 1             | 2   | 0                   | 0               | 1   | ms             | Mixed        | Med.    | Training domain |



**Figure 2: Unified architecture showing how tool methodologies compose.** Analytical engines and ML cost models feed into hybrid approaches, while system-level orchestrators (trace-driven) assemble component predictions into end-to-end estimates. Cycle-accurate simulators primarily serve as validation oracles.

targets distributed systems directly. Methodologies cluster into two speed regimes: sub-millisecond (analytical, ML-augmented, hybrid) for DSE, and minutes-to-hours (simulation, trace-driven) for validation.

Figure 2 illustrates how tools from different methodology types compose: analytical engines provide fast base estimates, ML components learn residual corrections, and trace-driven simulators orchestrate system-level execution.

#### 4.1 Primary Axis: Methodology Type

The choice of methodology determines fundamental trade-offs between accuracy, evaluation speed, data requirements, and interpretability, as summarized in Table 1; Section 5 provides detailed per-tool analysis.

**Analytical models** (Timeloop [54]: 5–10% vs. RTL; MAESTRO [42]; Sparseloop [79]; AMALI [10]) provide microsecond evaluation and full interpretability but require per-architecture derivation (AMALI’s 23.6% MAPE illustrates GPU dynamic effects). **Cycle-accurate simulators** (GPGPU-Sim [4], Accel-Sim [37]: 0.90–0.97 IPC; PyTorch-Sim [38]) are impractical for DSE at 1000–10000× slowdown [4, 37]. **Trace-driven simulators** (ASTRA-sim [78]: 5–15%; VIDUR [3]:



**Figure 3: Abstraction level hierarchy and the composition problem.** Tools operate at one of three levels; composing predictions across levels accumulates error. Error ranges are representative values from surveyed papers.

<5%; SimAI [75]; Frontier [20]) replay execution traces for system-level modeling. **ML-augmented models** (nn-Meter [83]; LitePred [18]; HELP [45]; TVM [12]/Ansor [84]) learn from profiling data but risk silent distribution shift. **Hybrid** approaches (NeuSight [46]: 2.3% MAPE; Habitat [81]; ArchGym [41]) combine analytical priors with learned corrections [17].

#### 4.2 Secondary Axes: Platform and Abstraction Level

Platform constrains methodology: **accelerators** use analytical models; **GPUs** span all types; **distributed systems** require trace-driven simulation; **edge devices** use ML-augmented approaches; **CPUs** [53, 71] are least studied. Abstraction level determines composition errors (Figure 3): kernel-level tools achieve 2–3% error, model-level 5–12%, and system-level 5–15%, with errors propagating through the chain.

#### 4.3 Workload Coverage

Table 2 characterizes the workload types on which each tool has been validated, exposing a pervasive CNN-validation bias.

Figure 4 quantifies this CNN-validation bias: of the 14 surveyed tools, 9 (64%) include CNN validation, while only 1 tool validates on MoE workloads and none validates on diffusion models. The table reveals that no surveyed tool has been validated on diffusion models or dynamic inference workloads [39], only Frontier [20] has validated MoE support, and no single tool offers

**Table 2: Workload validation coverage.** ✓ = validated in the original paper; ○ = partial or indirect validation; — = no validation. Nearly all tools report accuracy on CNN workloads; transformer and MoE coverage is sparse. Empty columns (diffusion, dynamic inference) represent workload types with no validated performance modeling tools.

| Tool      | CNN | Trans-<br>former | LLM<br>Train | MoE | Diff. |
|-----------|-----|------------------|--------------|-----|-------|
| Timeloop  | ✓   | ○                | —            | —   | —     |
| MAESTRO   | ✓   | —                | —            | —   | —     |
| NeuSight  | ✓   | ✓                | —            | —   | —     |
| Habitat   | ✓   | —                | —            | —   | —     |
| AMALI     | —   | ✓                | —            | —   | —     |
| ASTRA-sim | ✓   | ○                | ✓            | —   | —     |
| VIDUR     | —   | ✓                | —            | —   | —     |
| SimAI     | —   | —                | ✓            | —   | —     |
| Lumos     | —   | —                | ✓            | —   | —     |
| Frontier  | —   | ✓                | —            | ✓   | —     |
| nn-Meter  | ✓   | —                | —            | —   | —     |
| LitePred  | ✓   | —                | —            | —   | —     |
| HELP      | ✓   | —                | —            | —   | —     |
| TVM/Ansor | ✓   | ○                | —            | —   | —     |



**Figure 4: Workload validation coverage across surveyed tools.** CNN validation dominates (64% of tools), while MoE and diffusion models have minimal or no validated prediction tools, highlighting a critical generalization gap.

validated transformer prediction across the full kernel-to-system stack. Practitioners working with non-CNN workloads must accept unvalidated predictions, collect their own validation data, or fall back to measurement.

## 5 Survey of Approaches

This section surveys performance modeling tools for ML workloads, organized by target platform, examining modeling challenges, available tools, and their strengths and limitations. Table 3 provides a comprehensive comparison.

### 5.1 DNN Accelerator Modeling

The analytical tractability of DNN accelerator modeling stems from the regularity of computation [72], building on early characterization pioneered by DianNao [11]. A convolution layer maps to a

seven-deep nested loop over batch, output channel, input channel, and spatial dimensions; Timeloop [54] enumerates mappings of these loops to a spatial-temporal hardware hierarchy, computing data reuse at each memory level as the ratio of loop bounds. This exhaustive search finds the optimal dataflow in microseconds (5–10% error, 2000× speedup) because the search space, though combinatorially large, admits efficient pruning: any mapping that exceeds a memory level’s capacity is immediately discarded. MAESTRO [42] achieves similar modeling with a more compact “data-centric” representation that specifies which data dimension is stationary at each level, trading enumeration completeness for specification simplicity—but sacrificing Timeloop’s ability to model per-PE utilization, explaining why Timeloop achieves tighter error bounds on architectures with irregular PE arrays. SCALE-Sim [66] complements both by providing cycle-accurate systolic array simulation for validation. Sparseloop [79] extends Timeloop’s analysis to sparse tensors by introducing format-specific access count models for compression formats (CSR, bitmap)—the key challenge being that sparse data access patterns depend on the data values, requiring statistical or format-aware modeling rather than purely geometric analysis. PyTorchSim [38] integrates PyTorch 2 with NPU simulation but lacks real-hardware validation; ArchGym [41] connects ML surrogates to simulators (0.61% RMSE vs. simulator, not hardware). Accelerator modeling is the most mature subdomain, with Timeloop as the de facto DSE standard. The key gap is silicon validation; emerging PIM tools [26, 31, 44, 55] also lack hardware validation.

## 5.2 GPU Performance Modeling

GPUs dominate ML training and inference, requiring models for SIMT execution, warp scheduling, memory coalescing, and occupancy effects.

**Cycle-accurate simulation.** GPGPU-Sim [4] and Accel-Sim [37] achieve 0.90–0.97 IPC correlation but at 1000–10000× slowdown; reverse-engineering [30] improved Accel-Sim to 13.98% MAPE. These simulators integrate with memory subsystem models—from DRAMSim2 [65] and Ramulator [40] to their modern successors DRAMSim3 [48] and Ramulator 2.0 [50]—for accurate DRAM timing, critical for memory-bound LLM inference.

**Analytical and hybrid models.** AMALI [10] models GPU performance through memory hierarchy analysis (L1, L2, HBM data movement volumes); the roofline model [77] provides upper bounds, with recent LLM-specific extensions [32]. NeuSight [46] achieves 2.3% on GPT-3 by decomposing kernels into tiles that mirror CUDA’s thread block tiling; Habitat [81] achieves 11.8% cross-GPU transfer via wave scaling.

The accuracy disparity across these approaches reflects a fundamental architectural distinction. Accelerator models achieve 5–10% error because DNN accelerator execution is deterministic: loop nest orderings fully determine data movement, and spatial architectures have predictable pipeline behavior. GPU execution introduces three sources of non-determinism that progressively degrade analytical accuracy: (1) warp scheduling decisions that depend on runtime resource availability, (2) memory coalescing patterns that depend on address alignment, and (3) L2 cache contention that depends on co-running kernels. AMALI’s 23.6% MAPE reflects the cost of

**Table 3: Summary of surveyed performance modeling tools for ML workloads, organized by target platform. Methodology: A=Analytical, S=Simulation, T=Trace-driven, M=ML-augmented, H=Hybrid. \*Accuracy measures surrogate-vs-simulator fidelity, not real hardware error. †Reported accuracy unverifiable due to reproducibility issues. ‡No accuracy baseline against real hardware reported.**

| Tool                                        | Platform    | Method | Target             | Accuracy         | Speed      | Key Capability          |
|---------------------------------------------|-------------|--------|--------------------|------------------|------------|-------------------------|
| <i>DNN Accelerator Modeling</i>             |             |        |                    |                  |            |                         |
| Timeloop [54]                               | NPU         | A      | Latency/Energy     | 5–10%            | μs         | Loop-nest DSE           |
| MAESTRO [42]                                | NPU         | A      | Latency/Energy     | 5–15%            | μs         | Data-centric directives |
| Sparseloop [79]                             | NPU         | A      | Sparse tensors     | 5–10%            | μs         | Compression modeling    |
| PyTorchSim [38]                             | NPU         | S      | Cycle-accurate     | N/A <sup>‡</sup> | Hours      | PyTorch 2 integration   |
| ArchGym [41]                                | Multi       | H      | Multi-objective    | 0.61%*           | ms         | ML-aided DSE            |
| <i>GPU Performance Modeling</i>             |             |        |                    |                  |            |                         |
| Accel-Sim [37]                              | GPU         | S      | Cycle-accurate     | 10–20%           | Hours      | SASS trace-driven       |
| GPGPU-Sim [4]                               | GPU         | S      | Cycle-accurate     | 10–20%           | Hours      | CUDA workloads          |
| AMALI [10]                                  | GPU         | A      | LLM inference      | 23.6%            | ms         | Memory hierarchy        |
| NeuSight [46]                               | GPU         | H      | Kernel/E2E latency | 2.3%             | ms         | Tile-based prediction   |
| Habitat [81]                                | GPU         | H      | Training time      | 11.8%            | Per-kernel | Wave scaling            |
| <i>Distributed Training and LLM Serving</i> |             |        |                    |                  |            |                         |
| ASTRA-sim [78]                              | Distributed | T      | Training time      | 5–15%            | Minutes    | Collective modeling     |
| SimAI [75]                                  | Distributed | T      | Training time      | 1.9%             | Minutes    | Full-stack simulation   |
| Lumos [49]                                  | Distributed | T      | LLM training       | 3.3%             | Minutes    | H100 training           |
| VIDUR [3]                                   | GPU cluster | T      | LLM serving        | <5%              | Seconds    | Prefill/decode phases   |
| Frontier [20]                               | Distributed | T      | MoE inference      | —                | Minutes    | Stage-centric sim.      |
| TrioSim [47]                                | Multi-GPU   | T      | DNN training       | N/A <sup>‡</sup> | Minutes    | Lightweight multi-GPU   |
| <i>Edge Device Modeling</i>                 |             |        |                    |                  |            |                         |
| nn-Meter [83]                               | Edge        | M      | Latency            | <1% <sup>†</sup> | ms         | Kernel detection        |
| LitePred [18]                               | Edge        | M      | Latency            | 0.7%             | ms         | 85-platform transfer    |
| HELP [45]                                   | Multi       | M      | Latency            | 1.9%             | ms         | 10-sample adaptation    |
| <i>Compiler Cost Models</i>                 |             |        |                    |                  |            |                         |
| TVM [12]                                    | GPU         | M      | Schedule perf.     | ~15%             | ms         | Autotuning guidance     |
| Ansor [84]                                  | GPU         | M      | Schedule perf.     | ~15%             | ms         | Program sampling        |
| TLP [82]                                    | GPU         | M      | Tensor program     | <10%             | ms         | Transformer cost model  |

ignoring these dynamic effects; NeuSight’s 2.3% captures them implicitly through tile-level profiling that matches the GPU’s thread block scheduling granularity—predicting per-tile latency and aggregating, which naturally captures occupancy and memory locality that per-operator models miss.

**LLM-specific and compiler models.** VIDUR [3] simulates LLM serving at <5% error; LIFE [19], HERMES [5], Omniwise [23], and SwizzlePerf [74] target inference. TVM [12]/Ansor [84] (~15% MAPE), TLP [82] (<10%), and SynPerf [76] target compiler auto-tuning [85].

### 5.3 Distributed Training and LLM Serving

Distributed systems require modeling communication, synchronization, and parallelism strategies [29, 59, 68]. ASTRA-sim [78] achieves 5–15% error via Chakra traces [69]; SimAI [75] reaches 1.9% at Alibaba scale; Echo [8] scales simulation to 10K+ devices; Lumos [49] 3.3% on H100s; PRISM [21] provides prediction intervals at 10K+ GPUs. Paleo [58] pioneered analytical estimation; MAD Max [28] and Sailor [70] extend it; Llama 3 [15] provides validation ground truth at 16K GPUs. The speed hierarchy among distributed system simulators directly reflects their modeling granularity. VIDUR achieves second-scale simulation by modeling LLM

serving at the request level—each prefill and decode phase is a single simulated event with profiled duration—sacrificing visibility into intra-phase behavior. ASTRA-sim operates at the collective communication level, replaying Chakra execution traces to model compute-communication overlap, which requires simulating each collective operation individually (minutes per configuration). SimAI further decomposes collectives to the NCCL algorithm level, modeling chunk-based ring and tree reduction protocols, which adds fidelity for network congestion effects at the cost of additional simulation time. The practical implication: practitioners exploring serving configurations (scheduler, batch size, model placement) should start with VIDUR’s fast simulation, then validate promising configurations with ASTRA-sim or SimAI for network-sensitive scenarios.

For inference serving, VIDUR [3] models scheduling with vLLM [43]; DistServe [86] disaggregates prefill and decode for goodput optimization; Frontier [20] targets MoE; POD-Attention [24] and AQUA [67] address prefill-decode overlap and memory offloading respectively; ThrottLL’eM [35] models power effects; speculative decoding [9] creates a moving target for all simulators.

## 5.4 Edge Device Modeling

Hardware diversity makes per-device analytical modeling impractical. nn-Meter [83] claims <1% MAPE but is unverifiable due to dependency failures (Section 7); LitePred [18] achieves 0.7% across 85 platforms; HELP [45] reaches 1.9% with 10-sample meta-learning. ESM [52] finds well-tuned random forests match deep learning surrogates, and transfer learning provides 22.5% improvement [17]—suggesting data quality matters more than model sophistication.

## 5.5 Cross-Cutting Themes

Three architectural insights emerge from our analysis.

*First, structural decomposition that mirrors hardware execution consistently outperforms black-box approaches.* Timeloop’s loop nests reflect systolic array dataflow, NeuSight’s tiles mirror CUDA thread block scheduling, and VIDUR’s prefill/decode split captures the GPU’s distinct compute- vs. memory-bound regimes. In each case, the modeling abstraction succeeds because it aligns with the hardware boundary that dominates performance: data reuse at the PE array level, occupancy at the SM level, and phase-level batching at the system level. By contrast, AMALI’s single-pass memory hierarchy model misses dynamic scheduling effects that cross these boundaries, explaining its higher error. Notably, tools with *verifiable* accuracy (e.g., Timeloop’s reference outputs, VIDUR’s Docker-based reproduction) appear more widely adopted than tools reporting high but unverifiable accuracy (e.g., nn-Meter).

*Second, the most critical architectural features for accurate ML modeling differ by platform.* For accelerators, data reuse (determined by dataflow and tiling) dominates because accelerator designs deliberately eliminate dynamic effects. For GPUs, thread block occupancy and memory coalescing are most critical because the SIMD execution model introduces contention not present in systolic arrays. For distributed systems, collective communication topology and pipeline bubble overhead dominate because compute per device is well-characterized but inter-device interaction is not. This explains why no single methodology works across all platforms.

*Third, a persistent accuracy–generality–speed trade-off* explains methodological diversity: cycle-accurate simulators (Accel-Sim, GPGPU-Sim) maximize accuracy by modeling microarchitectural state but sacrifice speed; analytical models (Timeloop, MAESTRO) maximize speed by exploiting structural regularity but sacrifice accuracy on dynamic workloads; ML-augmented approaches (nn-Meter, LitePred) achieve both speed and accuracy on trained distributions but sacrifice generality to unseen architectures. Sub-domain maturity correlates with economic incentive: accelerator DSE is most mature (irreversible chip errors), distributed training is fastest-growing (million-dollar runs), and edge modeling has weakest reproducibility.

## 6 Comparison and Analysis

We analyze trade-offs across methodology types along accuracy and speed dimensions (see Table 3 for per-tool details); generalization and interpretability challenges are deferred to Section 8. Figure 5 visualizes the accuracy–speed trade-off space, revealing three distinct clusters of tools.



**Figure 5: Accuracy–speed trade-off across surveyed tools.** Each point represents a tool’s reported MAPE vs. evaluation speed (log-scale categories). Hybrid (NeuSight) and trace-driven (SimAI) approaches achieve the best accuracy–speed balance.

## 6.1 Accuracy by Problem Difficulty

We organize accuracy results by inherent problem difficulty rather than comparing across incompatible benchmarks (Figure 6). Accelerator dataflow modeling is most tractable (Timeloop: 5–10%) because systolic arrays exhibit deterministic data movement—the absence of dynamic scheduling, speculative execution, and shared caches eliminates the primary sources of modeling error. Single-GPU kernel prediction achieves 2–12% via hybrid methods (NeuSight, Habitat), where the accuracy gap relative to accelerators arises from GPU-specific dynamic effects: warp scheduling decisions that depend on runtime register pressure, memory coalescing efficiency that varies with access patterns, and shared memory bank conflicts that are input-dependent. Distributed systems reach 2–15% (SimAI 1.9%, ASTRA-sim 5–15%), where the dominant error source shifts from compute modeling to communication topology and collective algorithm selection. Cross-platform edge prediction achieves 0.7–2% but requires per-device profiling; GPU analytical modeling remains hardest (AMALI: 23.6%). Setup costs vary dramatically: analytical models require only architecture specifications, ML-augmented approaches need 10–10K profiling samples per device, and cycle-accurate simulators require hardware-specific binaries or traces.

## 6.2 Practitioner Tool Selection

Tool selection depends on the target platform, acceptable error margin, and available setup time; Figure 7 provides a decision flowchart. For *accelerator DSE*, use Timeloop or MAESTRO for microsecond-speed exhaustive search with interpretable bottleneck feedback; Sparseloop extends this to sparse workloads. For *GPU evaluation*, NeuSight offers the best accuracy–speed balance for LLMs; use Accel-Sim when microarchitectural detail is needed, accepting the



Figure 6: Reported accuracy (MAPE) of surveyed tools, grouped by methodology type. Range midpoints used where ranges are reported. Cross-tool comparison is approximate due to differing benchmarks, workloads, and hardware targets.



Figure 7: Tool selection decision flowchart. Practitioners choose based on target platform, then refine by workload characteristics and resource constraints. Colors indicate methodology type: blue=analytical, green=hybrid, orange=trace-driven, purple=ML-augmented.

1000× slowdown. For *distributed systems*, use VIDUR for LLM serving configuration and ASTRA-sim or SimAI for training parallelism at scale; MAD Max provides fast analytical estimates when trace collection is impractical. For *edge devices*, LitePred offers the broadest platform coverage, while HELP excels with minimal profiling data. Practitioners should consider tools with Docker-first deployment (VIDUR, Timeloop, ASTRA-sim) over tools with unpinned dependencies, as our evaluation shows containerized tools are consistently easier to reproduce.

Table 4: Reproducibility evaluation results. Tools are ranked by overall assessment. <sup>†</sup>Timeloop CLI works but Python bindings fail.

| Tool                  | Setup | Reprod. | Usability | Total  |
|-----------------------|-------|---------|-----------|--------|
| VIDUR                 | 2.5   | 3.5     | 3         | 9/10   |
| Timeloop <sup>†</sup> | 3     | 4       | 2         | 9/10   |
| ASTRA-sim             | 2.5   | 3       | 3         | 8.5/10 |
| NeuSight              | 2     | 3       | 2.5       | 7.5/10 |
| nn-Meter              | 2     | 0       | 1         | 3/10   |

Table 5: VIDUR simulation results for Llama-2-7B inference serving on a simulated A100 GPU (100 requests each, seed=42). Schedulers use different arrival rates, so latency values are not directly comparable; results illustrate VIDUR’s scheduling model fidelity. All metrics from our own experiments.

| Metric              | vLLM   | Sarathi |
|---------------------|--------|---------|
| Requests            | 100    | 100     |
| QPS (Poisson)       | 2.0    | 4.0     |
| Avg E2E latency (s) | 0.170  | 0.174   |
| P99 E2E latency (s) | 0.285  | 0.294   |
| Avg TTFT (s)        | 0.027  | 0.028   |
| Avg TPOT (s)        | 0.0093 | 0.0095  |
| Requests preempted  | 28     | 0       |

## 7 Experimental Evaluation

We conducted hands-on evaluations of five tools spanning methodology types: Timeloop (analytical), ASTRA-sim (trace-driven, distributed), VIDUR (trace-driven, LLM serving), nn-Meter (ML-augmented, edge), and NeuSight (hybrid, GPU).

**Environment.** All evaluations ran on Apple M2 Ultra (aarch64, 192 GB RAM) using Docker containers where provided—no GPU hardware was available, so we cannot validate absolute accuracy claims. We assess each tool on setup difficulty, output reproducibility, and usability. Table 4 summarizes results.

### 7.1 Per-Tool Results

**VIDUR** (9/10). We simulated Llama-2-7B on a simulated A100 under two scheduler configurations (Table 5). Note that the two schedulers were run at different arrival rates (QPS 2.0 vs. 4.0), so their latency metrics are *not directly comparable*; rather, the results illustrate VIDUR’s ability to model distinct scheduling behaviors. The vLLM scheduler preempted 28% of requests even at lower load, while Sarathi’s chunked prefetch [2] avoided preemptions entirely despite 2× higher arrival rate—consistent with the KV-cache management differences described in the respective papers [2, 43].

**Timeloop** (9/10). Docker CLI produces deterministic, bit-identical outputs for Eyeriss-like configurations; reference outputs enable hardware-free verification. Python bindings fail (`ImportError: libbarvinok.so.23`).

**ASTRA-sim** (8.5/10). We ran collective microbenchmarks and ResNet-50 training at 2–8 GPUs (Table 6). Reduce-Scatter takes

Table 6: ASTRA-sim quantitative results from our experiments on the HGX-H100 configuration. Top: collective microbenchmarks (8 NPUs, 1MB). Bottom: ResNet-50 data-parallel training scaling.

| Collective Microbenchmarks (8 NPUs, 1 MB) |         |              |
|-------------------------------------------|---------|--------------|
| Collective                                | Cycles  | Ratio vs. AR |
| All-Reduce                                | 57,426  | 1.000        |
| All-Gather                                | 44,058  | 0.767        |
| Reduce-Scatter                            | 28,950  | 0.504        |
| All-to-All                                | 114,000 | 1.985        |

  

| ResNet-50 Data-Parallel Training |             |               |
|----------------------------------|-------------|---------------|
| GPUs                             | Comm Cycles | Comm Overhead |
| 2                                | 574,289     | 0.05%         |
| 4                                | 1,454,270   | 0.13%         |
| 8                                | 3,307,886   | 0.30%         |



Figure 8: Reproducibility assessment for evaluated tools. Docker-first tools (VIDUR, Timeloop, ASTRA-sim) are consistently reproducible, while tools relying on serialized ML models (nn-Meter) become unusable. The dashed line separates Docker-based from non-Docker deployments.

half the time of All-Reduce (consistent with half the data); communication overhead scales  $5.76\times$  for 4× more GPUs, matching ring All-Reduce scaling.

**NeuSight** (7.5/10). Tile-based decomposition mirrors CUDA tiling for dense operations; irregular workloads had limited examples.

**nn-Meter** (3/10). After four attempts (>4h), no predictions ran: pickle-serialized predictors (scikit-learn 0.23.1) are incompatible with current versions. The claimed <1% MAPE is **unverifiable**.

Figure 8 visualizes the reproducibility assessment, showing that Docker-first tools are consistently easier to reproduce.

## 7.2 Lessons and Threats to Validity

Five lessons emerge: (1) **Docker-first deployment** correlates with high reproducibility in our sample (all three Docker-based tools produced verifiable outputs; nn-Meter without Docker did not), though our sample of five tools limits causal conclusions. (2) **ML model serialization is fragile**—nn-Meter’s pickle-based predictors became unusable within two years. (3) **Reference outputs enable**



Figure 9: Workload coverage of surveyed tools by publication period. The shift toward transformer and LLM workloads accelerates from 2023, but MoE and diffusion models remain largely uncharacterized.

**trust without hardware**—Timeloop and ASTRA-sim include verifiable baselines. (4) **Scale-limited evaluation understates system tools**—our 2–8 GPU tests show only 0.30% communication overhead, far below production scales [15]. (5) **Reproducible accuracy claims should be weighted higher** than unrepeatable ones.

**Threats.** Our venue-focused search may under-represent industry and non-English publications; we exclude proprietary tools (NVIDIA Compute, internal TPU models); and accuracy metrics vary across papers (MAPE, RMSE, Kendall’s  $\tau$ ), limiting direct comparison.

## 8 Open Challenges and Future Directions

**Generalization gaps.** *Workload:* CNN→transformer transfer is largely unvalidated (NeuSight excepted); MoE, diffusion [39], and dynamic inference lack validated tools; scaling laws [14, 22, 27, 36] predict loss but not latency. Figure 9 shows the shift toward LLM workloads since 2023. *Hardware:* cross-family transfer (GPU→TPU→PIM)<sup>10</sup> remains unsolved despite meta-learning (HELP) and feature-based transfer (LitePred). *Temporal:* software stack evolution silently invalidating models is addressed by no tool.

**The composition problem.** Composing kernel-level predictions into end-to-end estimates is unsolved (Figure 10): NeuSight’s 2.3% kernel MAPE yields  $\sim 10\times$  higher variance at model level ( $\sigma_{\text{model}} \approx \sigma_{\text{kernel}} \cdot \sqrt{N}$ ), and correlated errors can compound linearly. VIDUR sidesteps this by profiling entire prefill/decode phases.

**Emerging hardware and reproducibility.** PIM [26, 31, 44, 55], chiplets, and disaggregated designs blur memory hierarchy assumptions; FlashAttention [16] changes the landscape faster than models retrain. No MLPerf [51, 63] equivalent exists for performance *prediction*.

**Future directions:** (1) validated non-CNN tools; (2) bounded composition error; (3) unified energy-latency-memory prediction [62]; (4) temporal robustness benchmarks; (5) Docker-first deployment with portable formats (ONNX, Chakra [69]).



**Figure 10: Error composition across abstraction levels.** Kernel-level predictions (2–3% each) accumulate through hidden overheads (kernel launch, memory allocation, data movement, synchronization) that are not captured by kernel-level tools, yielding 5–12% model-level error. System-level errors add communication and scheduling overhead.

## 9 Conclusion

This survey analyzed 22 tools in depth for predicting ML workload performance, organized by methodology type, target platform, and abstraction level. Key findings: (1) *No single methodology dominates*—analytical models offer microsecond interpretable evaluation, trace-driven simulators provide 2–15% system-level error, and hybrid approaches achieve the best accuracy–speed balance (NeuSight: 2.3% MAPE), with the right choice depending on the practitioner’s priorities. (2) *LLM workloads demand specialized modeling*—prefill/decode distinctions, KV cache management, and dynamic batching require purpose-built tools (VIDUR, Frontier) rather than CNN-era extensions. (3) *Reproducibility is a practical bottleneck*—Docker-first tools remain reproducible while tools relying on serialized ML models have become unusable. (4) *Accuracy claims require scrutiny* due to varying benchmarks and metrics.

The most pressing gaps are CNN-to-transformer generalization, kernel-to-end-to-end composition, emerging hardware support (PIM, chiplets), and reproducibility failures. As ML workloads grow in scale and diversity, this survey provides practitioners guidance for tool selection and researchers a roadmap for advancing the field.

## References

- [1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 265–283.
- [2] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramachandran. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In *Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 117–134.
- [3] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramachandran. 2024. VIDUR: A Large-Scale Simulation Framework for LLM Inference. In *Proceedings of Machine Learning and Systems (MLSys)*. 1–15.
- [4] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 163–174. <https://doi.org/10.1109/ISPASS.2009.4919648>
- [5] Abhimanyu Rajeshkumar Bambhaniya et al. 2025. HERMES: Understanding and Optimizing Multi-Stage AI Inference Pipelines. *arXiv preprint arXiv:2504.09775* (2025). Heterogeneous multi-stage LLM inference simulator with analytical modeling.
- [6] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 Simulator. *ACM SIGARCH Computer Architecture News* 39, 2 (2011), 1–7. <https://doi.org/10.1145/2024716.2024718>
- [7] Shirley Browne, Jack Dongarra, Nathan Garner, George Ho, and Philip Mucci. 2000. A Portable Programming Interface for Performance Evaluation on Modern Processors. *International Journal of High Performance Computing Applications* 14, 3 (2000), 189–204. <https://doi.org/10.1177/109434200001400303> PAPI: portable API for hardware performance counters, foundational tool for performance analysis.
- [8] Kai Cai, Wei Miao, Junyu Zhu, Jiaxu Chen, Hao Shan, Huanyu Li, and Chi Zhang. 2024. Echo: Simulating Distributed Training At Scale. *arXiv preprint arXiv:2412.12487* (2024).
- [9] Tianqi Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In *Proceedings of the 41st International Conference on Machine Learning (ICML)*. 1–15.
- [10] Zheng Cao et al. 2025. AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)*. 1–14. <https://doi.org/10.1145/3695053.3731064> Reduces GPU LLM inference MAPE from 127.56% to 23.59% vs GCoM baseline.
- [11] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In *Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 269–284. <https://doi.org/10.1145/2541940.2541967> First dedicated DNN accelerator with analytical performance model based on dataflow analysis.
- [12] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In *Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 578–594.
- [13] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In *Proceedings of the 43rd International Symposium on Computer Architecture (ISCA)*. 367–379. <https://doi.org/10.1109/ISCA.2016.40>
- [14] Leshem Choshen, Yang Zhang, and Jacob Andreas. 2025. A Hitchhiker’s Guide to Scaling Law Estimation. In *Proceedings of the 42nd International Conference on Machine Learning (ICML)*. 1–25. Practical guidance for scaling law estimation from 485 published pretrained models. IBM/MIT.
- [15] Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Pavan Balaji, Ching-Hsiang Chu, Jongsoo Park, et al. 2025. Scaling Llama 3 Training with Efficient Parallelism Strategies. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)*. 1–15. 4D parallelism for Llama 3 405B on 16K H100 GPUs. Achieves 400 TFLOPS/GPU. Meta.
- [16] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 35. 16344–16359.
- [17] Lukasz Dudziak, Thomas Chau, Mohamed S. Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas D. Lane. 2024. Latency Predictors for Neural Architecture Search. In *Proceedings of Machine Learning and Systems (MLSys)*. 1–14.
- [18] Yang Feng, Zhehao Li, Jiacheng Yang, and Yunxin Liu. 2024. LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search. In *Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI)*. 1–18.
- [19] Paraskevas Gavriilidis et al. 2025. LIFE: Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling. *arXiv preprint arXiv:2508.00904* (2025). Hardware-agnostic analytical model for LLM inference performance forecasting.
- [20] Siddharth Ghosh et al. 2025. Frontier: Simulating the Next Generation of LLM Inference Systems. *arXiv preprint arXiv:2508.03148* (2025). Stage-centric simulator for MoE and disaggregated LLM inference, models expert parallelism and cross-cluster routing.
- [21] Alicia Golden et al. 2025. PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training. *arXiv preprint arXiv:2510.15596* (2025). Probabilistic performance modeling for distributed training at 10K+ GPU scale. Meta..
- [22] Alexander Hagele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. 2024. Scaling Laws and Compute-Optimal Training

- 1045      Beyond Fixed Training Durations. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 37. Spotlight. Practical scaling laws with constant LR +  
1046      cooldowns for reliable training compute prediction.
- 1047      [23] Ameer Haj-Ali et al. 2025. Omniwise: Predicting GPU Kernels Performance with  
1048      LLMs. *arXiv preprint arXiv:2506.20886* (2025). First LLM-based GPU kernel  
1049      performance prediction, 90% within 10% error on AMD MI250/MI300X.
- 1050      [24] Yanbin Hao et al. 2025. POD-Attention: Unlocking Full Prefill-Decode Overlap  
1051      for Faster LLM Inference. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 1–15. Full overlap between prefill and decode phases for LLM inference.
- 1052      [25] John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer  
1053      Architecture. *Commun. ACM* 62, 2 (2019), 48–60. <https://doi.org/10.1145/3282307>
- 1054      Turing Award Lecture: domain-specific architectures and the end of Dennard scaling.
- 1055      [26] Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee,  
1056      Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne-  
1057      uPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing. In  
1058      *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 1–17. NPU-  
1059      PIM heterogeneous architecture for LLM inference with performance modeling.  
KAIST/Georgia Tech..
- 1060      [27] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya,  
1061      Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes  
1062      Welbl, Aidan Clark, et al. 2022. Training Compute-Optimal Large Language  
1063      Models. *arXiv preprint arXiv:2203.15556* (2022). Chinchilla scaling laws: compute-  
1064      optimal training requires scaling data proportionally to model size.
- 1065      [28] Samuel Hsia, Kartik Chandra, and Kunle Olukotun. 2024. MAD Max Beyond  
1066      Single-Node: Enabling Large Machine Learning Model Acceleration on Dis-  
1067      tributed Systems. In *Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA)*. 753–766. <https://doi.org/10.1109/ISCA59077.2024.00064>
- 1068      [29] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu  
1069      Chen, Hyoukjoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng  
1070      Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline  
1071      Parallelism. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 32. 103–112.
- 1072      [30] Rodrigo Huerta, Mojtaba Abaie Shoushtary, Jose-Lorenzo Cruz, and Antonio  
1073      Gonzalez. 2025. Dissecting and Modeling the Architecture of Modern GPU Cores.  
In *Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. 369–384. Reverse-engineers modern NVIDIA GPU cores, improves  
1074      Accel-Sim to 13.98% MAPE. UPC Barcelona..
- 1075      [31] Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfind-  
1076      ing Future PIM Architectures by Demystifying a Commercial PIM Technology.  
In *Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA)*. 1–15. uPIMulator: cycle-accurate PIM simulation  
1077      framework for UPMEM. KAIST..
- 1078      [32] Ryota Imai, Kentaro Harada, Ryo Sato, and Toshio Nakaike. 2024. Roofline-  
1079      Driven Machine Learning for Large Language Model Performance Prediction.  
1080      *NeurIPS Workshop on Machine Learning for Systems* (2024).
- 1081      [33] Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil,  
1082      James Laudon, Cliff Young, and David Patterson. 2023. TPU v4: An Optically  
1083      Reconfigurable Supercomputer for Machine Learning with Hardware Support for  
1084      Embeddings. *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)* (2023), 1–14. <https://doi.org/10.1145/3579371.3589350> 4096-  
1085      chip pods with 3D optical interconnect; up to 1.7x/2.1x faster than TPU v3.
- 1086      [34] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,  
1087      Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borber, et al. 2017. In-Datacen-  
1088      ter Performance Analysis of a Tensor Processing Unit. In *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*. 1–12. <https://doi.org/10.1145/3079856.3080246> First dedicated ML inference  
1089      accelerator; 15–30x over CPUs/GPUs on CNN inference.
- 1090      [35] Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios  
1091      Xydis, and Dimitrios Soudris. 2025. throttLLM-eM: Predictive GPU Throttling for  
1092      Energy Efficient LLM Inference Serving. In *Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA)*. 1–14. Achieves  
1093      up to 43.8% lower energy consumption for LLM inference.
- 1094      [36] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess,  
1095      Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. *arXiv preprint arXiv:2001.08361*  
1096      (2020). Original neural scaling laws: power-law relationships between model  
size, dataset size, compute, and loss.
- 1097      [37] Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In  
1098      *Proceedings of the 47th International Symposium on Computer Architecture (ISCA)*. 473–486. <https://doi.org/10.1109/ISCA45697.2020.00047>
- 1099      [38] Jungho Kim et al. 2025. PyTorchSim: A Comprehensive, Fast, and Accurate  
1100      NPU Simulation Framework. In *Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. 1–14. <https://doi.org/10.1145/3725843>
- 1101      [39] Jin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. 2026. The Cost of  
1102      Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI  
1103      Infrastructure Perspective. In *Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA)*. 1–14. HPCA 2026 (Jan 31–  
1104      Feb 4, 2026, Las Vegas). First comprehensive system-level analysis of AI agents;  
1105      quantifies resource usage, latency, and datacenter power consumption.
- 1106      [40] Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and  
1107      Extensible DRAM Simulator. *IEEE Computer Architecture Letters* 15, 1 (2016), 45–  
1108      49. <https://doi.org/10.1109/LCA.2015.2414456> Fast extensible DRAM simulator  
1109      supporting DDR3, LPDDR3, GDDR3, WIO3, HBM3 standards.
- 1110      [41] Srivatsan Krishnan, Amir Yazdanbakhsh, Shvetank Prakash, Norman P.  
1111      Jouppi, Jignesh Parmar, Hyoukjun Kim, James Laudon, and Chandrakant  
1112      NarayananSwami. 2023. ArchGym: An Open-Source Gymnasium for Machine  
1113      Learning Assisted Architecture Design. In *Proceedings of the 50th International Symposium on Computer Architecture (ISCA)*. 1–16. <https://doi.org/10.1145/3579371.3589049>
- 1114      [42] Hyoukjun Kwon, Prasanth Chataras, Michael Sarber, Michael Pellauer, Angshuman  
1115      Parashar, and Tushar Krishna. 2019. MAESTRO: A Data-Centric Approach  
1116      to Understand Reuse, Performance, and Hardware Cost of DNN Mappings. In  
1117      *Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO)*. 1–14. <https://doi.org/10.1145/3352460.3358292>
- 1118      [43] Woosuk Kwon, Zhuohuan Li, Siyun Zhuang, Ying Sheng, Lianmin Zheng,  
1119      Cody Hau Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient  
1120      Memory Management for Large Language Model Serving with PagedAttention.  
In *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*. 611–626. <https://doi.org/10.1145/3600006.3613165>
- 1121      [44] Hyojung Lee, Daehyeon Baek, Jimyoung Son, Jeun Choi, Kihyo Moon, and  
1122      Minsung Jang. 2025. PAISE: PIM-Accelerated Inference Scheduling Engine for  
1123      Transformer-based LLM. In *Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA)*. 1–14. PIM-based LLM  
1124      inference scheduling. 48.3% speedup, 11.5% power reduction. Samsung..
- 1125      [45] Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. 2021. HELP:  
1126      Hardware-Adaptive Efficient Latency Prediction for NAS via Meta-Learning.  
In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 34. 27016–  
1127      27028.
- 1128      [46] Seunghyun Lee, Amar Phanishayee, and Divya Mahajan. 2025. NeuSight: GPU  
1129      Performance Forecasting via Tile-Based Execution Analysis. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 1–15.
- 1130      [47] Jianbo Li et al. 2025. TrioSim: A Lightweight Simulator for Large-Scale DNN  
1131      Workloads on Multi-GPU Systems. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)*. 1–13. Multi-GPU DNN simulation  
1132      with lightweight approach for distributed training analysis.
- 1133      [48] Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2020.  
1134      DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator. *IEEE Computer Architecture Letters* 19, 2 (2020), 106–109. <https://doi.org/10.1109/LCA.2020.2973991> Modernized DRAM simulator with thermal modeling and HMC  
1135      support.
- 1136      [49] Wenxuan Liang et al. 2025. Lumos: Efficient Performance Modeling and Esti-  
1137      mation for Large-scale LLM Training. In *Proceedings of Machine Learning and Systems (MLSys)*. 1–16. Trace-driven performance modeling achieving 3.3% error  
1138      on H100 GPUs for LLM training.
- 1139      [50] Haocong Luo, Yahya Can Tugrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray  
1140      Yaglikeci, and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensi-  
1141      ble DRAM Simulator. *IEEE Computer Architecture Letters* 22, 2 (2023), 129–132.  
<https://doi.org/10.1109/LCA.2023.3333759> Modular DRAM simulator with DDR5,  
1142      LPDDR5, HBM3, GDDR6 support and RowHammer mitigation modeling.
- 1143      [51] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micke-  
1144      vicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf,  
1145      et al. 2020. MLPerf Training Benchmark. In *Proceedings of Machine Learning and Systems (MLSys)*. 336–349. Standard ML training benchmark suite cover-  
1146      ing image classification, object detection, NLP, recommendation, reinforcement  
1147      learning.
- 1148      [52] Azaz-Ur-Rehman Nasir, Samroz Ahmad Shoaib, Muhammad Abdullah Hanif, and  
1149      Muhammad Shafique. 2025. ESM: A Framework for Building Effective Surrogate  
1150      Models for Hardware-Aware Neural Architecture Search. In *Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC)*. 1–6. 97.6% accuracy  
1151      surrogate model framework for HW-aware NAS.
- 1152      [53] Amir Nasr-Esfahany et al. 2025. Concorde: Fast and Accurate CPU Performance  
1153      Modeling with Compositional Analytical-ML Fusion. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA)*. 1–15. Hybrid  
1154      analytical-ML approach achieving 2% CPI error at 5 orders of magnitude faster  
1155      than gem5.
- 1156      [54] Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen,  
1157      Victor A. Ying, Anurag Muber, Rangharajan Venkatesan, Brucek Khailany,  
1158      Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach  
1159      1160

- 1161 to DNN Accelerator Evaluation. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 304–315.  
 1162 <https://doi.org/10.1109/ISPASS.2019.00042>
- 1163 [55] Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk  
 1164 Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. AttAcc! Unleashing the Power of  
 1165 PIM for Batched Transformer-based Generative Model Inference. In *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 1–16. PIM-based accelerator for  
 1166 batched transformer attention. Seoul National University/UIUC.
- 1167 [56] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory  
 1168 Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.  
 1169 PyTorch: An Imperative Style, High-Performance Deep Learning Library. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 32. 8024–8035.
- 1170 [57] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aakanksha Shah, Íñigo Goiri,  
 1171 Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM  
 1172 Inference Using Phase Splitting. In *Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA)*. 118–132. <https://doi.org/10.1109/ISCA59077.2024.00019> Best Paper Award.
- 1173 [58] Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance  
 1174 Model for Deep Neural Networks. In *Proceedings of the 5th International Conference on Learning Representations (ICLR)*. <https://openreview.net/forum?id=SyVJ85lg>
- 1175 [59] Samyam Rajbhandari, Jeff Rasley, Olatunji Rber, and Yuxiong He. 2020. ZeRO:  
 1176 Memory Optimizations Toward Training Trillion Parameter Models. In *Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC)*. 1–16. <https://doi.org/10.1109/SC41405.2020.00024>  
 1177 DeepSpeed ZeRO optimizer partitioning for memory-efficient distributed training.
- 1178 [60] Mehdi Rakhshanfar and Aliakbar Zarandi. 2021. A Survey on Machine Learning-based  
 1179 Design Space Exploration for Processor Architectures. *Journal of Systems Architecture* 121 (2021), 102339. <https://doi.org/10.1016/j.sysarc.2021.102339>
- 1180 [61] Saeed Rashidi, Srinivas Srinivasan, Kazem Hamedani, and Tushar Krishna. 2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed  
 1181 DL Training Platforms. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 81–92. <https://doi.org/10.1109/ISPASS48437.2020.00018>
- 1182 [62] Vijay Janapa Reddi et al. 2025. MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Inference. In *Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA)*. 1–14. Energy efficiency benchmarking for ML inference workloads.
- 1183 [63] Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maxim Breeshevko, Mark Duber, et al. 2020. MLPerf Inference Benchmark. In *Proceedings of the 47th International Symposium on Computer Architecture (ISCA)*. 446–459. <https://doi.org/10.1109/ISCA45697.2020.00045> Standard ML inference benchmark suite with server and offline scenarios.
- 1184 [64] Arun F. Rodrigues, K. Scott Hemmert, Brian W. Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, R. Risen, Jeanine Cook, Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2012. The Structural Simulation Toolkit. In *ACM SIGMETRICS Performance Evaluation Review*, Vol. 38. 37–42. <https://doi.org/10.1145/1964218.1964225> Modular framework for system-level simulation, widely used for HPC and interconnect modeling.
- 1185 [65] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. *IEEE Computer Architecture Letters* 10, 1 (2011), 16–19. <https://doi.org/10.1109/L-CA.2011.4> Widely-used cycle-accurate DDR/DDR3 memory simulator validated against manufacturer Verilog models.
- 1186 [66] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2019. A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 58–68. <https://doi.org/10.1109/ISPASS.2019.00016> Cycle-accurate systolic array simulator for DNN accelerator DSE.
- 1187 [67] Zhuomin Shen, Jaeho Kim, et al. 2025. AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains. In *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 1–16. <https://doi.org/10.1145/3676641.3715983> Improves LLM inference responsiveness by 20x through network-accelerated memory offloading.
- 1188 [68] Mohammad Shoeybi, Mostafa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. In *arXiv preprint arXiv:1909.08053*. Intra-layer tensor parallelism for large language model training.
- 1189 [69] Srinivas Sridharan, Taekyung Heo, Jinwoo Choi, Garyfallia Yu, Saeed Rashidi, William Won, Zhaodong Meng, and Tushar Krishna. 2023. Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces. *arXiv preprint arXiv:2305.14516* (2023).
- 1190 [70] Foteini Strati, Zhendong Zhang, George Manos, Ixeia Sanchez Periz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, and Ana Klimovic. 2025. Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters. In *Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP)*. 1–18. Automated distributed training with runtime/memory simulation over heterogeneous resources. ETH Zurich/MIT.
- 1191 [71] Ondrej Sykora, Alexis Rucker, Charith Mendis, Rajkishore Barik, Phitchaya Mangop Pothilimthana, and Saman Amarasinghe. 2022. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation. In *Proceedings of the IEEE International Symposium on Workload Characterization (ISWC)*. 1–13. <https://doi.org/10.1109/ISWC55918.2022.00014>
- 1192 [72] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. In *Proceedings of the IEEE*, Vol. 105. 2295–2329. <https://doi.org/10.1109/JPROC.2017.2761740> Canonical DNN accelerator taxonomy covering dataflows, data reuse, and energy efficiency.
- 1193 [73] Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In *Proceedings of the 39th International Conference on Parallel Processing Workshops (ICPPW)*. 207–216. <https://doi.org/10.1109/ICPPW.2010.38> Lightweight tools for thread/cache topology, affinity, and performance counter measurement.
- 1194 [74] Adrian Tschan, Mohamed Awad, et al. 2025. SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization. *arXiv preprint arXiv:2508.20258* (2025). LLM-based spatial optimization for GPU kernels, up to 2.06x speedup via swizzling.
- 1195 [75] Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Heyang Zhou, Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, et al. 2025. SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale LLM Training with Scalability and Precision. In *Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI)*. 1–18. Full-stack LLM training simulator achieving 98.1% alignment with real-world results. Alibaba Cloud/Tsinghua.
- 1196 [76] Zixian Wang et al. 2025. SynPerf: Synthesizing High-Performance GPU Kernels via Pipeline Decomposition. *arXiv preprint* (2025). Under review.
- 1197 [77] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. *Commun. ACM* 52, 4 (2009), 65–76. <https://doi.org/10.1145/1498765.1498785>
- 1198 [78] William Won, Taekyung Heo, Saeed Rashidi, Saeed Talati, Srinivas Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-Model Training at Scale. In *Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*. 283–294. <https://doi.org/10.1109/ISPASS57527.2023.00035>
- 1199 [79] Yannan Nellie Wu, Joel Emer, and Vivienne Sze. 2022. Sparseloop: An Analytical Approach to Sparse Tensor Accelerator Modeling. In *Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)*. 1–15. <https://doi.org/10.1109/MICRO56248.2022.00078>
- 1200 [80] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. ORCA: A Distributed Serving System for Transformer-Based Generative Models. In *Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 521–538.
- 1201 [81] Geoffrey X. Yu, Yubo Gao, Pavel Golber, and Asaf Cidon. 2021. Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In *Proceedings of the USENIX Annual Technical Conference (ATC)*. 503–521.
- 1202 [82] Yi Zhai, Yu Cheng Wang, Peng Jiang, and Congming Kang. 2023. TLP: A Deep Learning-based Cost Model for Tensor Program Tuning. In *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. 833–845. <https://doi.org/10.1145/3575693.3575736>
- 1203 [83] Li Lyra Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In *Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys)*. 81–93. <https://doi.org/10.1145/3458864.3467882> Best Paper Award.
- 1204 [84] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhao, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In *Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 863–879.
- 1205 [85] Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E. Gonzalez, Ion Stoica, and Zhihao Zhang. 2021. TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers. In *Advances in Neural Information Processing Systems (NeurIPS)*, Vol. 34. 29876–29888.
- 1206 [86] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianyu Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In *Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*. 1–18.

1219  
 1220  
 1221  
 1222  
 1223  
 1224  
 1225  
 1226  
 1227  
 1228  
 1229  
 1230  
 1231  
 1232  
 1233  
 1234  
 1235  
 1236  
 1237  
 1238  
 1239  
 1240  
 1241  
 1242  
 1243  
 1244  
 1245  
 1246  
 1247  
 1248  
 1249  
 1250  
 1251  
 1252  
 1253  
 1254  
 1255  
 1256  
 1257  
 1258  
 1259  
 1260  
 1261  
 1262  
 1263  
 1264  
 1265  
 1266  
 1267  
 1268  
 1269  
 1270  
 1271  
 1272  
 1273  
 1274  
 1275