



# Debunking the CUDA Myth Towards GPU-based AI Systems

Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving

Yunjae Lee\*

KAIST

Daejeon, Republic of Korea

[yunjae408@kaist.ac.kr](mailto:yunjae408@kaist.ac.kr)

Eunyeong Cho

KAIST

Daejeon, Republic of Korea

[eunyeong5433@kaist.ac.kr](mailto:eunyeong5433@kaist.ac.kr)

Hyungjun Kim

SqueezeBits

Seoul, Republic of Korea

[hyungjun.kim@squeezubits.com](mailto:hyungjun.kim@squeezubits.com)

Ranggi Hwang

KAIST

Daejeon, Republic of Korea

[ranggi.hwang@kaist.ac.kr](mailto:ranggi.hwang@kaist.ac.kr)

Juntaek Lim\*

KAIST

Daejeon, Republic of Korea

[juntaek0425@kaist.ac.kr](mailto:juntaek0425@kaist.ac.kr)

Huijong Jeong

SqueezeBits

Seoul, Republic of Korea

[huijong.jeong@squeezubits.com](mailto:huijong.jeong@squeezubits.com)

Joonhyung Lee

NAVER Cloud

Seongnam, Republic of Korea

[joonhyung.lee@navercorp.com](mailto:joonhyung.lee@navercorp.com)

Se Jung Kwon

NAVER Cloud

Seongnam, Republic of Korea

[sejung.kwon@navercorp.com](mailto:sejung.kwon@navercorp.com)

Minsoo Rhu†

KAIST

Daejeon, Republic of Korea

[mrhu@kaist.ac.kr](mailto:mrhu@kaist.ac.kr)

Jehyeon Bang

KAIST

Daejeon, Republic of Korea

[jehyeon.bang@kaist.ac.kr](mailto:jehyeon.bang@kaist.ac.kr)

Taesu Kim

SqueezeBits

Seoul, Republic of Korea

[taesu.kim@squeezubits.com](mailto:taesu.kim@squeezubits.com)

Jinseop Im

KAIST

Daejeon, Republic of Korea

[jinseop.im@kaist.ac.kr](mailto:jinseop.im@kaist.ac.kr)

Dongsoo Lee

NAVER Cloud

Seongnam, Republic of Korea

[dongsoo.lee@navercorp.com](mailto:dongsoo.lee@navercorp.com)

## Abstract

This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in

the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.

## CCS Concepts

• Computer systems organization → Neural networks.

## Keywords

Artificial intelligence, domain-specific architecture, GPU, NPU

## ACM Reference Format:

Yunjae Lee, Juntaek Lim, Jehyeon Bang, Eunyeong Cho, Huijong Jeong, Taesu Kim, Hyungjun Kim, Joonhyung Lee, Jinseop Im, Ranggi Hwang, Se Jung Kwon, Dongsoo Lee, and Minsoo Rhu. 2025. Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving. In *Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)*, June 21–25, 2025, Tokyo, Japan. ACM, New York, NY, USA, 17 pages. <https://doi.org/10.1145/3695053.3731050>

\*Both authors contributed equally to this research.

†Corresponding author: Minsoo Rhu ([mrhu@kaist.ac.kr](mailto:mrhu@kaist.ac.kr))

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

ISCA '25, Tokyo, Japan

© 2025 Copyright held by the owner/author(s).

ACM ISBN 979-8-4007-1261-6/25/06

<https://doi.org/10.1145/3695053.3731050>

## 1 Introduction

*In the past few years there have been many studies claiming GPUs deliver substantial speedups (10×–1,000×) over multi-core CPUs on throughput computing kernels. . . . After applying optimizations appropriate for both CPUs and GPUs the performance gap between NVIDIA GTX280 and Intel Core i7 960 narrows to only 2.5×*

“Debunking the 100X GPU vs. CPU Myth:  
An Evaluation of Throughput Computing on  
CPU and GPU”, Intel, ISCA, 2010 [45]

Intel’s latency-optimized processor architectures have dominated the computing industry for decades, serving as the foundation for executing a wide range of applications. However, as GPU computing gained prominence in the late 2000s, NVIDIA, then the underdog, began to challenge Intel’s long-standing dominance in the server market. With the rise of AI, throughput-optimized processor architectures spearheaded by NVIDIA GPUs dethroned Intel, establishing NVIDIA’s CUDA software ecosystem as the de facto standard for training and deploying AI models.

One might argue that domain-specific architectures for AI, also known as Neural Processing Units (NPUs), present competitive alternatives to GPUs. However, cases of successfully utilizing NPUs for deploying AI services are limited to a handful of hyperscalers that can amortize the enormous development costs of NPUs by serving millions to billions of customers with their AI offerings (e.g., Google’s TPU [37], Meta’s MTIA [15], Amazon’s Inferentia [2]). As a result, most AI services deployed today are built using NVIDIA GPUs. The primary advantage of NVIDIA GPUs over commercially available NPUs is their ease of programming with CUDA. The flexible programming interface of CUDA, along with the rich software ecosystem built around GPU-accelerated backend libraries (e.g., cuBLAS, cuDNN, cuSPARSE, cuSOLVER, cuDF, cuVS [55–58, 68, 69]), allows developers to easily implement and optimize new AI models created by AI practitioners (e.g., state-space models like Mamba [1]). While some cloud service providers like Google and Amazon do offer NPUs for developers in the form of “AI-as-a-Service” [3, 20], these platforms provide only limited access and programmability for the backend NPUs, such as Google TPU [19] and Amazon Inferentia [2]. This limitation makes it challenging to perform low-level kernel implementation and performance optimization specific to the target backend NPU architecture.

Given this landscape, Intel’s Gaudi NPU [24] is noteworthy for several reasons. First, Gaudi NPUs come with a native programming language called TPC-C (the CUDA equivalent for Gaudi) [33] as well as low-level compute primitives that ease the implementation of compute kernels targeting the NPU’s compute engines. Second, the performance of end-to-end AI applications utilizing these compute kernels is (according to Intel’s claims) comparable to, and in some cases better than, that of NVIDIA GPUs. Third, Gaudi NPUs are currently widely available for purchase, allowing researchers to thoroughly characterize this new NPU device vs. NVIDIA GPUs.

To this end, this paper presents a detailed characterization of Intel’s Gaudi NPU for AI model serving, assessing whether Intel, now the underdog, can pose a tangible threat to NVIDIA’s seemingly unassailable dominance in the AI computing market. A thorough understanding of this emerging NPU architecture and its applicability to various AI workloads can offer valuable insights for programmers, AI service providers, and computer architects working on next-generation NPU designs. As such, we conduct a comprehensive analysis of the Gaudi NPU from multiple dimensions, evaluating not just its raw performance but also its programmability for facilitating performance optimization and AI model development.

**Table 1: Comparison of NVIDIA A100 and Intel Gaudi-2.**

|                |                                          | NVIDIA A100                           | Intel Gaudi-2         | Ratio        |
|----------------|------------------------------------------|---------------------------------------|-----------------------|--------------|
| <b>Compute</b> | TFLOPS (BF16)                            | 312 (Tensor Cores)<br>39 (SIMD Cores) | 432 (MME)<br>11 (TPC) | 1.4×<br>0.3× |
|                | HBM Type                                 | HBM2E                                 |                       | -            |
| <b>Memory</b>  | HBM Capacity                             | 80 GB                                 | 96 GB                 | 1.2×         |
|                | HBM bandwidth                            | 2 TB/sec                              | 2.46 TB/sec           | 1.2×         |
|                | SRAM capacity                            | 40 MB (L2 Cache)                      | 48 MB (Shared)        | 1.2×         |
|                | <b>Communication<br/>(Bidirectional)</b> | 600 GB/sec<br>(NVLink)                | 600 GB/sec<br>(RoCE)  | 1.0×         |
| <b>Power</b>   | 400 Watts                                | 600 Watts                             | 1.5×                  |              |

**(Performance)** To enable detailed experimental studies and analyses, we first develop a set of microbenchmarks [79] targeting Gaudi NPUs to stress-test their ability to maximize performance in several key compute, memory, and communication primitives. In this work, we use Intel’s second-generation Gaudi NPU (Gaudi-2) and NVIDIA’s A100 GPU as comparison points<sup>1</sup>, as both processors are manufactured using TSMC’s 7nm technology node and are supported by an HBM2E-based memory subsystem, providing comparable performance (Table 1). Our microbenchmarking revealed that Gaudi-2 demonstrates competitive performance in important primitive AI operations, particularly for tasks involving regular compute and memory accesses. However, Gaudi-2 did fall short of A100 in certain scenarios involving fine-grained data accesses and collective communications across a small number of processors.

After confirming Gaudi NPU’s competitiveness against NVIDIA GPUs in primitive AI operations through our microbenchmark-based analysis, we next evaluate both systems at the end-to-end AI application level [79]. Specifically, we focus on recommendation systems (RecSys) and large language models (LLMs), as these two are among the most widely deployed AI models in cloud environments. Our analysis of end-to-end AI applications revealed that Gaudi-2 achieves 28% lower energy-efficiency for RecSys but 50% higher energy-efficiency for LLMs compared to the A100.

**(Programmability)** We also present case studies on utilizing the Gaudi NPU’s programming interface to optimize its performance. Specifically, we discuss software-level optimizations [74, 79] that can be employed to develop Gaudi NPU-optimized versions of FBGEMM’s Batched Embedding Table [66] and vLLM [42], which enable high-performance model serving for RecSys and LLMs, respectively. Initially, publicly available Gaudi-optimized software for RecSys embedding layers and vLLM showed underwhelming results, achieving only 37% and 6% of the performance seen in their GPU-optimized counterparts. However, through various software-level optimizations applied at the low-level TPC-C (Batched Embedding Table) and high-level PyTorch (vLLM), we show that the performance-optimized Gaudi-2 can achieve 80% and 101% of A100’s performance running end-to-end RecSys and LLM applications based on the state-of-the-art FBGEMM and vLLM.

Overall, we conclude that the Gaudi NPU has significant potential to emerge as a contender to NVIDIA GPUs for AI model serving, challenging NVIDIA’s dominance in the AI computing industry. Most AI practitioners use high-level AI frameworks like PyTorch or TensorFlow for model development. As long as NPU

<sup>1</sup>The hardware and software architecture of Intel’s recently announced Gaudi-3 is virtually identical to that of Gaudi-2 (but with limited availability), except that Gaudi-3 offers higher compute and memory throughput, thanks to its chiplet-based design.

chip vendors effectively support these frameworks with optimized low-level backend libraries, our analysis suggests that NVIDIA’s CUDA programming system might not be as formidable a “moat” in the AI server market. In other words, our conclusion is that the strength of NVIDIA GPU-based AI systems lies in its rich software ecosystem, rather than in CUDA itself. However, it is important not to misconstrue our assessment as an overly optimistic outlook on Intel’s Gaudi NPUs. NVIDIA’s stronghold in AI still remains robust, and we believe that Gaudi NPUs still face several key challenges that should be addressed to effectively compete with NVIDIA’s established position, which we further discuss in Section 5.

To summarize our **key contributions**:

- To the best of our knowledge, this work is the first to provide a detailed characterization of Intel’s Gaudi NPUs compared to NVIDIA GPUs, examining not only their performance but, more crucially, their programmability<sup>2</sup>.
- We implement a set of microbenchmarks to conduct a comparative study with NVIDIA GPUs, analyzing the potential, limitations, and bottlenecks of the Gaudi NPU architecture.
- To assess Gaudi NPU’s programmability, we discuss important software-level optimization strategies to employ to develop Gaudi-optimized versions of DLRM [53] and vLLM [42]. Through this exercise, we discuss the strengths and weaknesses of the Gaudi NPU’s software architecture.

## 2 Background

### 2.1 Intel Gaudi Hardware Architecture

**(Compute)** The Gaudi processor architecture is designed based on a *heterogeneous* compute paradigm, integrating two key components (Figure 1): Matrix Multiplication Engines (MMEs) and fully programmable Tensor Processing Cores (TPCs). Gaudi-2 features two MMEs and 24 TPCs, which together provide high throughput by pipelining computations between the MMEs and TPCs.

The MME is a large, output-stationary systolic array with a  $256 \times 256$  MAC (Multiply-Accumulate) structure [40], designed to handle general matrix multiplication (GEMM) workloads, such as fully connected, convolutional, and batched GEMM layers. The MME is designed to be highly *configurable* in order to maximize the utilization of its MAC array. Specifically, the two MMEs in Gaudi-2, originally composed of two separate  $256 \times 256$  MAC units, can be dynamically reconfigured at runtime as a single  $512 \times 256$  MAC unit, a single  $1024 \times 128$  MAC unit, and others, depending on the shape of the input and output matrices of the GEMM operation. The optimal MME configuration for each target GEMM is determined by the Gaudi graph compiler, which we discuss in Section 3.2. Similar to NVIDIA’s Tensor Cores [61], the MME is a co-processor purposefully designed to accelerate matrix multiplications. As such, the MME is not directly programmable, meaning users cannot alter its functionality and can only utilize it for matrix multiplications.

Unlike the MME, the TPC is a highly programmable, VLIW (Very Long Instruction Word)-based processor designed to execute multiple types of instructions in parallel. Each instruction type is processed by dedicated units that handle load/store operations and



Figure 1: High-level overview of Intel’s Gaudi NPU architecture.

scalar/vector operations (Figure 1), enabling efficient parallel execution. The SIMD (Single Instruction, Multiple Data) vector unit can handle 2048-bit wide vector operations. This makes the TPC highly effective for various data-parallel tasks in AI, particularly for nonlinear and non-matrix-based computations, such as vector gather-scatter operations or activation functions.

In terms of performance, the MME in Gaudi-2 delivers up to 432 TFLOPS of throughput for BF16 (brain floating point 16-bit [81]) operations. The TPCs provide an additional 11 TFLOPS for BF16. In comparison, NVIDIA’s A100 offers 312 TFLOPS for matrix operations (using Tensor Cores) and 39 TFLOPS for vector operations (using SIMD Cores). In total, Gaudi-2 delivers approximately 1.26 times in aggregate higher compute throughput than A100 (Table 1).

**(Memory)** Gaudi-2 features 96 GB of HBM2E, delivering a bandwidth of 2.45 TB/sec (Table 1). High memory bandwidth plays a crucial role in AI workloads, particularly in memory-bound tasks such as the embedding vector gathers in RecSys [53, 80] and the decoding stages of LLMs [10, 64, 77]. The A100 offers 2 TB/sec of bandwidth, making Gaudi-2 approximately 20% higher in terms of maximum memory throughput. Regarding on-chip storage, Gaudi-2 includes 48 MB of on-chip SRAM, referred to as shared memory, which serves as a scratchpad for the Gaudi graph compiler. This shared memory acts as temporary storage, facilitating data movement between the MMEs, TPCs, and DMA engines to maximize both on-chip data reuse and hardware utilization. Each TPC has its own local memory (used as a scratchpad within the TPC), divided into scalar and vector memory banks. The scalar memory in Gaudi-2 TPC is 1 KB in size and is accessed in 4-byte aligned chunks, while the vector memory is 80 KB and is accessed in 128- or 256-byte chunks. These local memories are private to each TPC, ensuring fast, dedicated memory operations without interference from other TPCs. In contrast, global memory (including the on-chip shared memory and off-chip HBM), is accessible to the entire system with a minimum access granularity of 256-byte chunks.

**(Communication)** Intel’s HLS-Gaudi-2 server [25] is integrated with eight Gaudi-2 chips. Each Gaudi-2 is equipped with  $24 \times 100$  GbE RoCEv2 [23] ports, providing a maximum bandwidth of 2.4 Tbps when all eight chips participate in collective communication. Of the 24 RoCE ports, 21 are dedicated to direct, point-to-point (P2P) inter-chip communication, with each pair of Gaudi-2 chips connected by three 100 GbE links. Since any given pair of Gaudi-2 is connected via P2P links, the effective bandwidth depends on the number of Gaudi-2 chips involved in the collective communication. For example, when two Gaudi-2 chips communicate, only 300 Gbps of communication bandwidth is available ( $3 \times 100$  GbE links), which is just 1/8 of the maximum 2.4 Tbps bandwidth. In contrast,

<sup>2</sup>The set of benchmarks used in this paper to characterize Intel Gaudi has been open-sourced at <https://github.com/VIA-Research/Intel-Gaudi-AI-benchmarks>.

```

1 # 64x64 FP32 Matrix multiply-add
2 # dev = GPU('cuda') or Gaudi('hpu')
3 A = torch.rand((64,64), device = dev)
4 B = torch.randn((64,64), device = dev)
5 C = torch.ones((64,64), device = dev)
6 D = torch.zeros((64,64), device = dev)
7
8 # Compute D = A x B + C
9 if dev == 'cuda':
10    D = torch.wmma_cuda(A, B, C, D)
11
12 elif dev == 'hpu':
13    # Launch GEMM on MME
14    result = torch.matmul(A, B)
15
16 # Launch ADD on TPC
17 D = torch.add_tpc(result, C, D)

```

(a)

```

1 // For simplicity, we ignore matrix tiling
2 __global__ void wmma_cuda (float *A, float *B,
3                           float *C, float *D) {
4     // Declare matrices and load them from global memory
5     wmma::fragment<wmma::matrix_a, ...> frag_a;
6     wmma::load_matrix_sync(frag_a, A, ...);
7
8     // Compute A x B using Tensor Cores
9     wmma::mma_sync(frag_acc, frag_a, frag_b, frag_acc);
10
11    // ADD C matrix to the intermediate result
12    for (int i = 0; i < frag_c.num_elements; i++) {
13        frag_c.x[i] = frag_acc.x[i] + frag_c.x[i];
14
15    // Store the result
16    wmma::store_matrix_sync(D, frag_c, ...);
17 }

```

(b)

```

1 void add_tpc(tensor inputA, tensor inputB, tensor outputC) {
2     // Get index space information
3     int5 InputCoord, OutputCoord;
4     int depthStart, depthEnd = get_index_space_information();
5     int widthStart, widthEnd = get_index_space_information();
6
7     // A single step in the depth dimension is 256B / 4B (FP32) = 64
8     int depthStep = 64; int depth_dimension = 0;
9     int width_dimension = 1;
10
11    // Declare input/output for 256-byte FP32 vectors (=float64)
12    float64 x, y, result;
13
14    for (int d = depthStart; d < depthEnd; d += depthStep) {
15        InputCoord[depth_dimension] = d;
16        OutputCoord[depth_dimension] = d;
17
18        // Unroll factor is set as 4
19        #pragma unroll(4)
20        for (int w = widthStart; w < widthEnd; w += 1) {
21            InputCoord[width_dimension] = w;
22            OutputCoord[width_dimension] = w;
23
24            // Fetches 256-byte vector from global memory
25            x = v_f32_ld_tnsr(InputCoord, inputA);
26            y = v_f32_ld_tnsr(InputCoord, inputB);
27
28            // Element-wise vector add operation
29            result = v_f32_add_b(x, y);
30
31            // Stores the 256-byte vector to global memory
32            v_f32_st_tnsr(OutputCoord, outputC, result);
33        ...
34    }

```

(c)

**Figure 2: Example code showing how a matrix multiply-add operation can be programmed for execution on an NVIDIA GPU and Intel Gaudi.** (a) At the PyTorch level, the corresponding low-level kernels are executed based on the device type (i.e., GPU (“cuda”, line 9) or Gaudi (“hpu”, line 12)). (b) In NVIDIA’s CUDA, the matrix multiply-add operation is performed within a single kernel launch using the WMMA APIs, which leverage Tensor Cores alongside SIMD Cores. (c) In Gaudi, the GEMM operation can only be handled at the PyTorch level (line 14 in (a)), so a TPC-C kernel (`add_tpc`) is called at the PyTorch level to execute the subsequent add operation (line 17 in (a)).

NVIDIA’s DGX A100 server is integrated with a network switch (NVSwitch [54]) that enables all GPUs within the node to communicate simultaneously at the total NVLink bandwidth. Unlike



**Figure 3: High-level overview of the TPC programming model.** Example assumes a program performing an element-wise vector addition operation, partitioned into two dimensions of index space, with each partition being executed by a single TPC. The loop within the TPC program is assumed to be unrolled by a factor of “4” to maximize both instruction-level parallelism and memory-level parallelism.

P2P connections where multiple processors must split bandwidth, NVSwitch ensures that each GPU can transfer data at maximum speed, regardless of how many GPUs are involved in communication. Consequently, for AI model serving that do not fully utilize all eight Gaudi-2 chips, such dynamic scaling can result in limited communication bandwidth, affecting system-wide performance.

## 2.2 Intel Gaudi Software Architecture

**(Programming model)** The GPU programming system follows the Single Instruction Multiple Thread (SIMT) model, which falls under the Single Program Multiple Data (SPMD) paradigm. To support efficient SIMT execution, GPUs include unique microarchitectural support, such as a large register file for fine-grained, massive multi-threading [17], dynamic branch divergence resolution [16, 70, 73], and warp-wide memory coalescing [8, 51]. In contrast, Gaudi lacks these features and instead uses a *single-threaded* programming model optimized for data-level rather than thread-level parallelism.

In NVIDIA’s CUDA, programmers can leverage the WMMA (Warp Matrix Multiply and Accumulate) APIs [48] to directly utilize Tensor Cores alongside conventional SIMD Cores for computation within low-level CUDA kernels. Furthermore, these operations can be extended to a higher-level PyTorch API (line 10 of Figure 2(a) and line 9 of Figure 2(b)). In contrast, the Gaudi SDK currently restricts direct access to the MME units, allowing programmers to explicitly control only the TPCs. Instead, access to MME units is limited to the PyTorch level (Figure 2(a), line 14), thereby constraining programmers from directly optimizing performance involving both MME and TPC at a lower level. This limitation can be addressed through the Gaudi graph compiler’s optimization pass, which we will discuss later. In Gaudi’s TPC programming model, the workload is partitioned across different TPCs, which execute the same TPC program. Workload distribution is performed by partitioning the *index space* (equivalent to CUDA *grid*), enabling each TPC to process different data independently. The index space can be divided up to five dimensions, and each member of the index space is allocated with an indivisible unit of work processed by a single TPC.

Figure 3 illustrates how a single TPC conducts a simple element-wise vector addition operation (pseudo-code shown in Figure 2(c)). A TPC program is typically structured using a for-loop, which iteratively executes *vector-wide* “Load→Compute→Store” operations. For optimal performance, the TPC programmers are advised to carefully partition the index space across the TPCs while maximizing

the performance of individual TPCs. Two important best practices are recommended for TPC programmers. First, to maximize memory bandwidth utilization, the TPC’s data access granularity should be aligned to 256 bytes as this is the minimum access granularity for global memory (e.g., line 8 in Figure 2(c) and a single step in the depth dimension in Figure 3, which determines the memory access granularity, is sized at 256-bytes). Second, to fully utilize the TPC processor, it is recommended that programmers manually *unroll* the for-loop to maximize both instruction-level and memory-level parallelism (e.g., a single step in the width dimension, which determines the magnitude of loop unrolling and thus the number of parallel operations, is sized at 4). This recommendation stems from the fact that TPC instructions have an average architectural latency of 4 processor cycles (i.e., the effect of executing a TPC instruction is reflected in the architectural state 4 cycles later) [27]. By unrolling four TPC instructions within a given iteration of a for-loop, the processor pipeline can be better utilized and memory access latency can be better hidden. To reduce the burden of manual loop unrolling, the TPC programming system provides preprocessor directives such as “#pragma unroll” (line 19 in Figure 2(c)).

**(Graph compiler)** Intel’s Gaudi NPU comes with a software suite called Intel Gaudi SDK [31], which is tightly integrated with PyTorch and TensorFlow. This integration allows developers to utilize familiar tools while taking advantage of Gaudi NPU’s hardware acceleration. Gaudi SDK includes a *graph compiler* that converts AI models into a format optimized for execution on Gaudi NPUs. The graph compiler applies high-level model optimizations, such as operator and kernel fusion. For example, an MLIR [44]-based operation fuser selects arbitrary subgraphs of element-wise, reduction, and normalization operations, then JIT-fuses and compiles them into TPC kernels [7, 30, 40]. This improves performance by reducing memory bandwidth usage and enabling tensor shape-aware optimizations, unlike shape-agnostic kernel libraries. The graph compiler also performs hardware-specific optimizations, which include MME configuration adjustments for high MME utilization (Figure 1, further discussed in Section 3.2) and operator pipelining between MME and TPC. When an MME operation is followed by a TPC operation (e.g., GEMM followed by an activation function), the graph compiler breaks them into smaller, independent sub-operations to enable pipelined execution. This approach helps hide latency and reduce overall execution time. The graph compiler orchestrates data transfers between the MME and TPC for pipelined execution, using on-chip shared memory as an intermediate buffer to minimize redundant off-chip memory accesses.

While lowering the target model graph into Gaudi NPU executable operations is crucial for performance optimization, the programmer, unfortunately, has no control over the graph compiler’s optimization process. In other words, users cannot modify the behavior of the graph compiler nor dictate when a particular graph compiler optimization pass should be activated or not.

### 3 Characterizing Gaudi NPU Performance

#### 3.1 Motivation and Evaluation Methodology

**(Motivation)** AI practitioners utilize high-level AI software frameworks like PyTorch for model development. Consequently, a competitive AI software ecosystem should provide not only highly

Table 2: Evaluated microbenchmarks.

| Microbenchmark |                          | System  | Implementation   |
|----------------|--------------------------|---------|------------------|
| Compute        | GEMM                     | Gaudi-2 | PyTorch API      |
|                |                          | A100    | PyTorch API      |
| Memory         | non-GEMM                 | Gaudi-2 | TPC-C            |
|                |                          | A100    | CUDA             |
| Communication  | Vector gather-scatter    | Gaudi-2 | TPC-C            |
|                |                          | A100    | CUDA             |
| Communication  | Collective communication | Gaudi-2 | Intel HCCL [28]  |
|                |                          | A100    | NVIDIA NCCL [59] |

Table 3: Evaluated end-to-end AI workloads.

| Model              | Embedding layer | MLP layer                                          | Interaction layer                                                                                            |
|--------------------|-----------------|----------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| DLRM-DCNv2<br>[52] | RM1             | # tables: 10<br># embeddings: 1M<br># gathers: 10  | Bottom: 512-256-64<br>Top: 1024-1024-512-256-1<br># layers: 3                                                |
|                    | RM2             | # tables: 20<br># embeddings: 1M<br># gathers: 100 | Bottom: 256-64-64<br>Top: 128-64-1<br># layers: 2                                                            |
| Llama-3.1<br>[12]  | 8B              | # vocabularies: 128,256                            | # layers: 32<br># heads for query: 32<br># heads for key, value: 8<br>hidden/intermediate size: 4,096/14,336 |
|                    | 70B             | # vocabularies: 128,256                            | # layers: 80<br># heads for query: 64<br># heads for key, value: 8<br>hidden/intermediate size: 8,192/28,672 |

optimized, low-level backend libraries that accelerate performance-critical primitive AI operations (e.g., GEMM, vector gather-scatter, collective communication), but, more critically, it should also deliver the performance benefits of hardware acceleration at the end-to-end AI application level. To this end, the primary objective of our characterization is twofold. First, we develop microbenchmarks using Intel Gaudi SDK and our custom-designed TPC-C kernels to evaluate Gaudi’s ability to achieve high-performance in key primitive AI operations (Table 2). Second, we set out to explore whether Gaudi can deliver competitive performance at the end-to-end AI application level. In our end-to-end performance characterization, we use recommendation systems (RecSys) and large language models (LLMs), as they represent the two most widely deployed AI models in today’s datacenters while exhibiting very different compute and memory characteristics (Table 3).

**(Methodology)** All experiments discussed in the rest of this paper are conducted using an HLS-Gaudi-2 server (which contains eight Gaudi-2 chips connected via RoCE) and a DGX A100 server (which contains eight A100 GPUs connected via NVSwitch and NVLink) (Table 1). On the software side, we use Intel Gaudi Software v1.18.0, which is based on PyTorch 2.4, along with the TPC-C SDK for custom Gaudi kernel development. For the GPU system evaluation, we use PyTorch 2.4 and CUDA 12.4 for GPU kernel development. All experimental results presented in this paper assume the BF16 data type, except when we evaluate end-to-end RecSys models which utilize FP32. The non-RecSys evaluation results for FP32 and their key takeaways were practically identical to those for BF16, so we omit presenting those results for brevity.

For our microbenchmark analysis targeting primitive AI operations (Section 3.2 to Section 3.4), we focus on absolute performance and its resource utilization. When evaluating GEMM operations, we used PyTorch API, which defaults to cuBLAS [55] as its backend library, choosing the most optimal GEMM implementation per target device. For end-to-end AI workload analysis (Section 3.5), we evaluate both performance and energy-efficiency. When evaluating



**Figure 4:** Roofline model showing the achieved TFLOPS of Gaudi-2 and A100 (BF16). The irregularly-shaped GEMMs, represented by triangle markers, have the  $N$  dimension size fixed at 16. Performance is measured when both Gaudi-2 and A100 are configured to operate at its maximum possible frequency.



**Figure 5:** Compute utilization when the GEMM operations are (a) square-shaped ( $M=K=N$ ) and (b) irregularly-shaped ( $M$  and  $K$  are relatively larger than the fixed  $N$ ). In all the thermal plots presented in this paper, warmer colors indicate higher values. In (a), we leave the GEMM shapes that do not satisfy  $M=K=N$  vacant.

the energy-efficiency of end-to-end AI workloads, each system's power consumption is measured using `nvidia-smi` [60] for A100 and `hl-smi` [32] for Gaudi-2.

### 3.2 Primitive “Compute” Operations

**(MME for GEMM operations)** We first analyze the efficiency of Gaudi MMEs in conducting GEMM operations. GEMM involves multiplying two matrices, matrix  $A$  of size  $(M \times K)$  and  $B$  of size  $(K \times N)$ , to produce a result matrix  $C$  of size  $(M \times N)$ . In Figure 4, we compare the performance of various  $(M, K, N)$  GEMM shapes on Gaudi-2 and A100 by plotting the achieved TFLOPS using a roofline model. To simplify our discussion, we classify GEMM operations into two types: (1) *square-shaped* GEMM (represented by square markers), where the dimensions  $M$ ,  $K$ , and  $N$  are all equal; and (2) *irregularly-shaped* GEMM (represented by triangle markers), where dimension  $N$  is set to a relatively small value compared to  $M$  and  $K$ , resulting in input matrices  $A$  and  $B$  that are tall and skinny, exhibiting the properties of memory-bound GEMV operations. As shown in Figure 4, Gaudi-2 consistently outperforms A100 across



**Figure 6:** Examples of how an irregularly-shaped GEMM is handled (a) by a typical output-stationary systolic array and (b) by Gaudi MME's systolic array *with reconfigurability*.



**Figure 7:** (a) How the geometry of the MME systolic array (i.e., width ( $MME_{width}$ ) and height ( $MME_{height}$ ) of MME) is configured based on the  $(M, N)$  of GEMM while  $K$  is fixed to 16,384, and (b) the corresponding compute utilization measured on Gaudi-2. To illustrate how MME's configurability enhances compute utilization, we compare (c) the *measured* compute utilization when the MME executes these GEMMs (black bars) vs. the theoretical, *calculated* compute utilization when a non-configurable, output-stationary  $(256 \times 256 \times 2)$  systolic array with the same maximum FLOPS to Gaudi-2, as depicted in Figure 6(a) (white bars), executes these GEMMs (results in (c) assume  $M=K=16,384$ , while varying  $N$ ). It is worth pointing out that the gray-colored MME configuration in (a) activates only a *subset* of the maximum  $(256 \times 256 \times 2)$  MAC array. Intel does not disclose specific details on the MME's microarchitecture or its configurability; one possible explanation is that for smaller GEMM shapes, the MME power-gates the inactive portions of the MAC array to save energy.

all  $(M, K, N)$  GEMM shapes we explore in this study. Notably, Gaudi-2 achieves 429 TFLOPS when  $M=K=N=8192$ , reaching 99.3% of its peak compute throughput (Table 1). Part of Gaudi-2's higher absolute GEMM performance is due to its superior hardware specifications; its MME provides a maximum of 432 TFLOPS, which is 40% higher than the A100's 312 TFLOPS offered by its Tensor Cores. Therefore, we also compare how *efficiently* these two processors are able to utilize its hardware resources by measuring their compute “utilization” during GEMM executions.

In Figure 5, we measure the ratio of *achieved* TFLOPS to *peak* TFLOPS to quantify the compute utilization of GEMM operations. The results indicate that Gaudi-2 not only achieves higher absolute

TFLOPS (Figure 4) but also outperforms A100 in terms of compute utilization. Across all evaluated data points, Gaudi-2 achieves an average 4.5% higher compute utilization (maximum of 32% when  $M=K=N=2,048$ ) than A100. These results were counterintuitive to our initial expectations because large systolic arrays, as employed in Gaudi MMEs, are known to suffer from low MAC utilization when the GEMM operation is irregularly shaped and, therefore, not optimally aligned with the geometry of the systolic array [18, 38, 65]. For example, in a typical output-stationary systolic array, when the GEMM’s  $M$  and  $N$  dimensions are smaller than the height and width of the systolic array, the MAC units can experience significant underutilization (Figure 6(a)). As discussed in Section 2, however, recall that Gaudi’s MME can dynamically *reconfigure* the geometry of its systolic array (i.e., height and width dimensions) to better align with the target GEMM’s  $(M,K,N)$  shape, significantly enhancing the utility of their MAC units (Figure 6(b)). To better understand this behavior, we use the Intel Gaudi Profiler to reverse-engineer how the graph compiler and runtime system manages MME’s GEMM execution, which provide hints on how the MME geometry is dynamically configured in relation to the target  $(M,K,N)$  GEMM shape. In Figure 7, we summarize the results of our reverse-engineering, showing how the geometry of the MME systolic array is configured as a function of the input GEMM’s  $M$  and  $N$  dimension sizes while fixing  $K=16,384$  (Figure 7(a)) and how this configuration translates into the MME’s compute utilization (Figure 7(b)). Compared to a typical output-stationary systolic array design *without* reconfigurability (Figure 7(c)), the configurable MME architecture provides up to 15% improvement in compute utilization vs. non-configurable, output-stationary systolic array.

**Key takeaway #1:** *When performing GEMM, Gaudi-2 achieved both higher absolute performance and greater compute utilization than A100. This superior GEMM performance and efficiency can be attributed not only to Gaudi-2’s higher max compute throughput but, more importantly, to the configurability of the Gaudi-2 MME, which enables its systolic array to flexibly adapt its geometry to be most optimal for the target GEMM’s  $(M,K,N)$  shape.*

**(TPC for non-GEMM operations)** We now evaluate the performance of Gaudi-2’s TPC in conducting vector operations, which is critical in performing AI operations like activation functions. Our microbenchmarks for non-GEMM operations are designed based on the STREAM [49] benchmark suite, which measures sustainable compute throughput and memory bandwidth for element-wise vector operations. Algorithm 1 summarizes these three microbenchmarks: ADD, SCALE, and TRIAD. These microbenchmarks access two (SCALE) or three (ADD, TRIAD) arrays in a *streaming* fashion. The number of floating-point operations involved is 1 for both ADD and SCALE (addition and multiplication, respectively) and 2 for TRIAD (multiplication followed by addition). We implement these microbenchmarks using TPC-C, where each TPC executes the ADD, SCALE, or TRIAD operation over a dedicated set of array elements (a total of 24 million elements) assigned to its specific index space (Figure 3). We utilize these microbenchmarks to demonstrate how performance can be optimized by applying the two TPC programming best practices discussed in Section 2.2, namely (1) the need to align data access granularity in 256 bytes, and (2) the importance of unrolling loops to maximize parallelism.

---

**Algorithm 1** Microbenchmarks for evaluating non-GEMM operations

---

```

1: procedure ADD( $a, b, c, N$ )
2:   for  $i = 0$  to  $N - 1$  do
3:      $c[i] \leftarrow a[i] + b[i]$ 
4:   end for
5: end procedure
6:
7: procedure SCALE( $a, b, scalar, N$ )
8:   for  $i = 0$  to  $N - 1$  do
9:      $b[i] \leftarrow scalar \times a[i]$ 
10:   end for
11: end procedure
12:
13: procedure TRIAD( $a, b, c, scalar, N$ )
14:   for  $i = 0$  to  $N - 1$  do
15:      $c[i] \leftarrow scalar \times a[i] + b[i]$ 
16:   end for
17: end procedure

```

---

We first show how a *single* TPC’s performance is improved by applying the two aforementioned best practices. In Figure 8(a), we measure the compute throughput by varying our microbenchmarks’ data access granularity from 2 to 2,048 bytes, *without* loop unrolling. The results clearly show the significant performance drop when the data access granularity is set lower than 256 bytes, which is Gaudi’s minimum memory access granularity. At data access granularities higher than 256 bytes, the overall throughput saturates at around 55 GFLOPS for TRIAD and around 30 GFLOPS for SCALE and ADD. From this point, we explore how much further performance improvements can be achieved by unrolling the for-loop in Algorithm 1 (discussed in Section 2.2). As shown in Figure 8(b), the compute throughput of SCALE improves remarkably, while ADD and TRIAD achieve only slight improvements as the loop unrolling factor increases. Both TRIAD and ADD load vectors from two arrays, resulting in two load instructions and one compute instruction (`v_bf16_mac_b` for TRIAD and `v_bf16_add_b` for ADD) for each loop iteration. In contrast, SCALE requires only one load instruction and one compute instruction (`v_bf16_mul_b`) from a single array, for each loop iteration, providing more pipeline opportunities to hide the 4 TPC processor cycle latency and benefiting more from loop unrolling. Overall, these results highlight the importance of loop unrolling in TPC-C kernels to exploit instruction-level and memory-level parallelism for high-performance.

Once these optimizations are in place for a single TPC, we scale up the number of TPCs by weak scaling the ADD, SCALE, and TRIAD workloads (Figure 8(c)). As shown, all three microbenchmarks achieve scalable improvements in compute throughput until the number of TPCs reaches between 11 and 15, eventually saturating at approximately 330, 530, and 670 GFLOPS for ADD, SCALE, and TRIAD, respectively. These values are significantly lower than Gaudi’s peak compute throughput of 11 TFLOPS, which is provided by its 24 TPCs (Table 1). This limitation arises because all three microbenchmarks have very low operational intensity (i.e., 1/6 operations/byte for ADD, 1/4 operations/byte for SCALE, and 2/6 operations/byte for TRIAD), rendering the off-chip memory bandwidth to limit its overall performance.

Consequently, to explore how much Gaudi-2 and A100 can saturate the peak compute throughput of their vector engines (11 TFLOPS with Gaudi TPCs and 39 TFLOPS with A100 SIMD Cores), the experiments in Figure 8(d,e,f) artificially increase the operational intensity of ADD, SCALE, and TRIAD. Specifically, starting



**Figure 8: Compute throughput of ADD, TRIAD, and SCALE over a vector with 24 million scalar elements (BF16).** The effects of (a) data access granularity and (b) unrolling factor on a single TPC’s throughput are shown in figures (a and b). In (c), we scale out the number of TPCs by weak scaling the three microbenchmarks. The left axis on figures (d, e, and f) illustrate the compute throughput when the operational intensity of (d) ADD, (e) SCALE, and (f) TRIAD is artificially increased. The right axis on these figures (d, e, and f) shows the compute utilization when Gaudi-2 and A100’s compute throughput peaks at its saturation point.

from the default STREAM benchmark configuration, we gradually increase the operational intensity (defined as the number of operations performed relative to the number of memory accesses) up to the point where the achieved compute throughput becomes saturated. At lower operational intensities, the three microbenchmarks become memory-bound, so Gaudi-2 demonstrates slightly higher compute throughput than A100 thanks to its 20% higher memory bandwidth (Table 1). At higher operational intensities where the workloads become compute-bound, A100 achieves much higher compute throughput because of its 3.5× higher vector computation power. In terms of compute utilization, Gaudi-2’s compute throughput saturates at approximately 5.5 TFLOPS, 5.5 TFLOPS, and 10.9 TFLOPS for the ADD, SCALE, and TRIAD operations, reaching 50%, 50%, and 99% of its maximum 11 TFLOPS throughput, respectively. In comparison, A100’s compute throughput saturates at around 19.4 TFLOPS, 19.4 TFLOPS, and 38.2 TFLOPS for the ADD, SCALE, and TRIAD operations, similarly reaching 50%, 50%, and 98% of its maximum 39 TFLOPS throughput, respectively (Figure 8(d,e,f)).

**Key takeaway #2:** Due to the 3.5× performance gap in vector math throughput, Gaudi-2 falls short of A100 in terms of absolute non-GEMM performance. However, in terms of compute “efficiency”, Gaudi-2 is comparable to A100 across all evaluated microbenchmarks, demonstrating the competitiveness of its design.



**Figure 9: Memory bandwidth utilization of (a) vector gather and (b) scatter operations.** The x-axis shows the proportion of vectors accessed among the total 4M vectors, i.e., the fraction of vectors either gathered from or scattered to random memory locations.

### 3.3 Primitive “Memory” Operations

Our characterization of non-GEMM primitive compute operations, using the STREAM benchmark suite, confirmed the competitiveness of the Gaudi-2 memory system in handling *streaming* memory access patterns. Another key aspect of memory system performance is its ability to manage *random* memory accesses. Inspired by the design philosophy behind the GUPS (Giga Updates Per Second [47]) benchmark suite, we developed our microbenchmarks to measure the memory system’s performance in handling vector gather-scatter operations. These operations involve reading (vector gather) and writing (vector scatter) large amounts of data at random memory locations, which is highly memory-intensive and exhibits low data locality. The performance of these vector gather-scatter operations is particularly relevant for AI workloads like RecSys and LLMs, which require frequent embedding table lookups [6, 39, 41, 43, 46].

To this end, our microbenchmarks perform vector gathers from and, similarly, vector scatters to random locations within a 2D vector array. This 2D vector array consists of 4 million vectors, with vector sizes ranging from 16 bytes to 2,048 bytes. In Figure 9, we show the memory bandwidth utilization of Gaudi-2 and A100 during vector gather-scatter operations. In general, Gaudi-2 achieves competitive memory bandwidth utilization when the vector size is  $\geq 256$  bytes, which is its minimum memory access granularity. For instance, Gaudi-2 achieves on average 64% memory bandwidth utilization for  $\geq 256$  bytes vector gather operations, which is only slightly lower than A100’s 72% average memory bandwidth utilization. However, for vector sizes smaller than 256 bytes, Gaudi-2 exhibits a significant drop in memory bandwidth utility, achieving only an average 15% memory throughput for  $\leq 128$  bytes vector gather vs. A100’s average 36% memory bandwidth utilization, a 2.4× drop in memory performance.

We speculate that the primary reason for A100’s superior vector gather-scatter performance is as follows. Several prior studies focusing on reverse-engineering NVIDIA GPU microarchitecture [36, 50] observed that its last-level cache either uses a cache line size of 32 bytes or a 32-byte sectored cache, suggesting that the minimum data access granularity for off-chip memory is also optimized for

32-byte data transfers [9, 71, 75]. This design enables NVIDIA GPUs to fetch 32, 64, and 128 bytes from off-chip memory with minimal memory bandwidth waste, unlike Gaudi-2, which inevitably wastes bandwidth for data transfer sizes smaller than 256 bytes.

**Key takeaway #3:** *Gaudi-2 provides competitive memory performance for regular data transfers with streaming access patterns. However, for random memory accesses like vector gather-scatter, Gaudi-2's performance falls short of A100's when the data transfer size is smaller than its 256-byte minimum access granularity.*

### 3.4 Primitive “Communication” Operations

Recent large-scale AI models, like RecSys and LLMs, require multiple GPU or NPU devices for model serving, which necessitates frequent collective communications, such as AllReduce and Reduce-Scatter. In this subsection, we use the collective communication libraries developed by Intel and NVIDIA (HCCL [28] and NCCL [59], respectively) to characterize the performance of six representative collective communication operations. Both Intel’s HLS-Gaudi-2 and NVIDIA’s DGX A100 server nodes provide an aggregate of 300 GB/sec of intra-node communication bandwidth. In Figure 10, we use the bus bandwidth utilization suggested by NCCL [62] to compare the communication performance of both systems as the number of participating devices varies from 2 to 8 devices. When all eight devices participate in communication, Gaudi-2 shows higher bus bandwidth utilization than A100 for 5 of the 6 collective communication patterns evaluated. However, as the number of communicating devices decreases, Gaudi-2 experiences an almost linear decline in bus bandwidth utilization, unlike A100, whose bus bandwidth utilization remains relatively stable regardless of the number of communicating devices. As discussed in Section 2.1, NVIDIA’s DGX A100 server is equipped with an all-to-all network switch (NVSwitch) that enables all GPUs within the server node to communicate simultaneously, leveraging the full aggregate intra-node NVLink bandwidth. In contrast, Intel’s HLS-Gaudi-2 server directly connects each pair of Gaudi-2 devices using P2P links, so the effective collective communication bandwidth scales proportionally with the number of devices involved. This setup explains the gradual decrease in Gaudi-2’s bus bandwidth utilization as the number of devices used for collective communication decreases.

**Key takeaway #4:** *The system-level collective communication performance of Gaudi-2 based system falls short of A100, not because of the limitations of the Gaudi-2 processor architecture itself, but because of its lack of an all-to-all network switch provisioned in NVIDIA’s DGX A100 system. This switch enables A100 GPUs to more flexibly exploit intra-node network bandwidth, regardless of the number of devices involved in communication, a feature that is currently missing in Intel Gaudi-2 systems.*

### 3.5 End-to-End Application-level Analysis

**(Models and backend software)** We now evaluate Gaudi-2 and A100 at the end-to-end AI application level, focusing on RecSys and LLMs. RecSys incorporates a heterogeneous mix of sparse and dense layers, including frontend embedding layers (which perform *embedding lookups* where multiple embedding vectors are “gathered” from embedding tables) and backend MLP layers. Consequently,



**Figure 10: Bus bandwidth utilization of Gaudi-2 and A100 for collective communication operations for data sizes ranging from 2 KB to 32 MB: (a) AllReduce, (b) AllGather, (c) Reduce-Scatter, (d) AlltoAll, (e) Reduce and (f) Broadcast operations.**

we evaluate two RecSys model configurations based on DLRM-DCNv2 [80] from the latest MLPerf benchmark suite [52] (Table 3): the compute-intensive RM1, where feature interaction and bottom/top MLP layers are dominant, and the memory-intensive RM2, where embedding layers are dominant. Because Intel Gaudi SDK currently lacks support for multi-device RecSys serving (a feature that is natively supported in TorchRec [35] for serving RecSys over multi-GPUs), we focus on single-device RecSys serving for Gaudi-2.

As for LLMs, we evaluate Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct for single- and multi-device serving [12]. Both systems can employ KV caches and FlashAttention [11] by using NVIDIA’s TensorRT-LLM [63] for A100 and Intel’s optimum-habana [22] for Gaudi-2 as their backend engines. Finally, a synthetic dataset with an input token length fixed at 100 and output token lengths swept from 25 to 400 were used to examine the effect of fixed input-output lengths on performance. Dynamic LLM model serving scenarios with variable input-output token lengths [42, 82] are evaluated in Section 4.2. NVIDIA’s CUDA Graphs [21] and Intel’s HPU Graphs [29] were used as a performance tuning knob whenever appropriate and we report the highest performance achieved.

**(Single-device model serving)** At the time of this writing, the embedding layer implemented in Intel’s Gaudi SDK [31] achieved, on average, 37% of the performance of its GPU counterpart in our RecSys configurations. To assess the potential of Gaudi-2’s hardware architecture without being constrained by the limitations of the Gaudi SDK’s current implementations, we designed a custom version of the embedding layer using TPC-C. This design incorporates several key performance optimization strategies employed in the GPU-optimized FBGEMM library (Section 4.1 details our implementation strategy), and we utilize it as Gaudi’s backend kernel library for all our evaluations in end-to-end RecSys model serving.

In Figure 11, we show Gaudi-2’s end-to-end speedup (top) and energy-efficiency (bottom) over A100 for the two RecSys models, RM1 and RM2. Overall, Gaudi-2 experiences an average performance degradation of 22% and 18% for RM1 and RM2, respectively. While Gaudi-2 achieves higher performance with wide embedding



Figure 11: Gaudi-2’s improvement in (a) performance and (b) energy-efficiency over A100 when RM1, RM2 are served on a single device.

vectors and large batch sizes (maximum 1.36 $\times$  speedup), which can be attributed to Gaudi-2’s higher compute throughput (Table 1), it generally falls short of A100 for all other model configurations. For instance, in the memory-intensive RM2, Gaudi-2 exhibits a significant performance drop in embedding vector sizes less than 256-byte, regardless of batch size (a maximum of 70% performance loss). This is primarily due to Gaudi-2’s 256-byte minimum memory access granularity and reduced memory bandwidth utilization for small number of vector gather operations, as highlighted in Section 3.3. In terms of power consumption, Gaudi-2 consumed an average of 12% more absolute power than A100 in RM1 and RM2. This result aligns with our expectations, given that Gaudi’s TDP is 50% higher than A100 (Table 1). Overall, with an increase in average latency with Gaudi-2, its end-to-end energy consumption is increased by an average 28% than A100 for RM1 and RM2.

As for LLMs, the leftmost heatmap in Figure 12(a) shows Gaudi-2’s speedup over A100 when serving Llama-3.1-8B-Instruct over a single device. Across all batch sizes and output token lengths, Gaudi-2 consistently outperforms A100, achieving an average speedup of 1.47 $\times$  (with a maximum 1.70 $\times$  speedup). This performance advantage generally stems from Gaudi-2’s high peak FLOPS and memory bandwidth, benefiting both the compute-bound prefill stage and the memory-bound decoding stage (Figure 12(b)). Although Gaudi-2’s theoretical performance advantage over A100 is 1.4 $\times$  in GEMM throughput and 1.2 $\times$  in memory bandwidth (Table 1), Gaudi-2 achieves an even greater speedup due to its superior compute utilization across various GEMM shapes, as discussed in Section 3.2. The leftmost heatmap in Figure 13 shows a single Gaudi-2’s energy-efficiency improvement over A100. In general, Gaudi-2 exhibited lower power consumption than A100 for small batch sizes, while its power consumption increased with larger batches and eventually surpassed A100 at the highest batch sizes. On average, Gaudi-2 exhibited only an average 1% higher power consumption than A100, despite its 50% higher TDP. As discussed in Figure 7(a), for small GEMM shapes, Gaudi-2 activates only a subset of its large MME units so a possible explanation for Gaudi-2’s lower power consumption over small batches is that it more aggressively power-gates its circuitry via DVFS. Overall, Gaudi-2’s higher absolute performance



Figure 12: (a) Gaudi-2’s improvement in performance over A100 when the Llama-3.1 models are served on a single device and on multiple devices. (b) For the Llama-3.1-8B-Instruct model served on a single device, we divide the latency into prefill and decoding stages, keeping the batch size fixed at 64. The left graph shows the latency breakdown when the input length is fixed at 100 while varying the output length. The right graph shows the breakdown when the output length is fixed at 100 while the input length is varied.



Figure 13: Gaudi-2’s improvement in energy-efficiency over A100 when the Llama-3.1 models are served on a single device and on multiple devices.

and comparable power consumption yield an average 48% higher energy-efficiency than A100 for single-device LLM serving.

**(Multi-device model serving)** The three rightmost heatmaps in Figure 12(a) illustrate Gaudi-2’s speedup when serving the Llama-3.1-70B-Instruct model across 2, 4, and 8 devices using tensor-parallelism (TP) [72]. The multi-device LLM serving results generally align with the trends observed in the single-device Llama-3.1-8B-Instruct serving, with average speedups of 1.29 $\times$ , 1.32 $\times$ , and 1.35 $\times$  for 2, 4, and 8 devices, respectively. Interestingly, the speedup level increases with the number of devices used, attributed to the collective communication performance provided by Intel’s HLS-Gaudi-2 server. As discussed in Section 3.4, the use of P2P links for communication makes the performance of all-reduce (the collective communication primitive in tensor-parallelism) to be proportionally higher to the number of devices involved in collective communication, enabling higher system-level performance as more devices are employed for multi-device LLM serving. Energy-efficiency trends, illustrated in the three rightmost heatmaps in Figure 13, generally align with those observed in single-device serving. Across all multi-device LLM servings, Gaudi-2 consistently demonstrates lower



Figure 14: Block diagrams illustrating (a) **SingleTable** embedding lookup operation, which processes embedding tables sequentially with loop unrolling to improve throughput, and (b) **BatchedTable** embedding lookup operation, which consolidates multiple tables into a single large table, enhancing memory bandwidth utilization at lower batch sizes by treating each table with offset-based indexing.

average power consumption, consuming around 88% of the power of A100. Overall, Gaudi-2 achieves energy-efficiency improvements of 1.48 $\times$ , 1.51 $\times$ , and 1.56 $\times$  over A100 for 2, 4, and 8 devices.

**Key takeaway #5:** *LLM serving is primarily dominated by matrix multiplications, so Gaudi-2 demonstrates superior energy-efficiency than A100, achieving an average 48% and 52% improvement for single- and multi-device serving, respectively. However, RecSys relies on vector gathers and small MLP layers, so Gaudi-2 generally lags behind A100 with an average 20% performance slowdown and an average 28% drop in energy-efficiency.*

## 4 Characterizing Gaudi NPU Programmability

In this section, we present case studies on utilizing the Gaudi NPU’s programming model and its software stack to conduct performance optimizations for RecSys and LLM serving systems.

### 4.1 Performance Optimization at the Low-level TPC-C: A DLRM Case Study

Meta’s TorchRec library [35] is built on FBGEMM’s GPU-optimized embedding lookup operator, which reduces CUDA kernel launch overhead by *batching* multiple embedding tables’ vector gather operations into a *single* CUDA kernel execution (referred to as **BatchedTable**). Currently, Intel’s Gaudi SDK does not support TorchRec, so its embedding lookup implementation does not batch vector gather operations across multiple tables. Instead, each TPC kernel launch processes only a *single* table’s embedding vector gathers (henceforth referred to as **SingleTable**), resulting in  $N$  separate TPC kernel launches for  $N$  embedding table lookups.

To evaluate Gaudi’s programmability for low-level performance optimizations, we implemented both the **SingleTable**<sup>3</sup> and the **BatchedTable** embedding lookup operators [79] for Gaudi-2 using TPC-C. Our approach incorporates several optimizations tailored for embedding lookups as follows. The **SingleTable** operator performs embedding lookups individually for each table. The TPC-C kernel’s for-loop is unrolled by a factor of 4 over embedding table lookup indices to maximize memory-level parallelism (i.e., four embedding vector gathers per each TPC are concurrently initiated for



Figure 15: Memory bandwidth utilization of embedding lookup operations, using the embedding layer configuration from RM2 (Table 3). (a) Utilization is normalized to **SingleBatch** with a vector size fixed at 256 bytes while varying the numbers of tables. (b, c, d) Utilization when varying both embedding vector sizes and batch sizes.

each for-loop iteration, Figure 14(a)). The gathered embedding vectors are stored inside TPC’s local memory to minimize data movement. Additionally, we distribute workloads (i.e., `offsetsPerTable` in Figure 14(a)) across multiple TPC units to maximize chip-wide memory-level parallelism.

Despite these optimizations in our **SingleTable** operator, a challenge remains: with low batch sizes, a single TPC unit cannot fully utilize memory bandwidth because the workload per TPC is limited to embedding vector lookups within a *single* embedding table. Even when multiple embedding tables are subject to embedding lookups, memory bandwidth remains underutilized because embedding lookups across multiple tables are performed sequentially through separate TPC-C kernel launches (i.e., memory bandwidth utilization does not increase with a larger number of tables, Figure 15(a)). To address this issue, our **BatchedTable** operator *fuses* embedding lookups from multiple tables into a single TPC-C kernel. Similar to FBGEMM’s CUDA-optimized **BatchedTable**, our TPC-C **BatchedTable** implementation treats multiple tables as one large table, using a separate offset to indicate the starting index location of each table (`tableOffsets` in Figure 14(b)). This approach requires passing indices and offsets for all tables to the TPC-C kernel in a single call. Consequently, our **BatchedTable** achieves significantly higher memory bandwidth utilization compared to **SingleTable** as the number of tables increases, as shown in Figure 15(a). It is worth noting that, with larger batch sizes, the performance gap between **SingleTable** and **BatchedTable** diminishes, as **SingleTable** can

<sup>3</sup>As mentioned in Section 3.5, the embedding lookup operator provided with Gaudi SDK (based on the **SingleTable** approach) achieved 37% of the performance of its GPU-optimized FBGEMM counterpart. Our custom TPC-C **SingleTable** embedding lookup provides an average 60% higher performance than this Gaudi SDK version.



Figure 16: High-level overview of the PagedAttention implementation and its execution timeline in (a) baseline  $vLLM_{base}$  and (b) performance-optimized  $vLLM_{opt}$ .

exploit more parallelism across different batches to improve memory bandwidth utilization (Figure 15(b) and (c)).

Overall, our Gaudi-2 BatchedTable achieves an average memory bandwidth utilization of 34.2% and a peak utilization of 70.5%, representing a 1.52 $\times$  improvement over SingleTable. In comparison, A100 demonstrates an average memory bandwidth utilization of 38.7% with a peak of 81.8% (Figure 15(d)). As shown in our vector gather-scatter microbenchmark experiments (Figure 9), Gaudi-2 shows sub-optimal performance in fine-grained vector gathers so BatchedTable (Gaudi-2) experiences a noticeable performance drop for vector sizes below 256 bytes, with an average utilization of 12.0%. In contrast, the A100 sustains much higher performance at these lower vector sizes, with an average utilization of 25.3%.

**Key takeaway #6:** *This case study confirmed that the TPC-C programming system provides a sufficient level of flexibility for low-level performance optimizations. Compared to state-of-the-art FBGEMM-based A100 executions, our Gaudi-2 optimized kernel for embedding layers achieved, on average, 95% of the throughput of A100 for large embedding vector sizes ( $\geq 256$  bytes) but only 47% for small vectors ( $< 256$  bytes). The noticeable performance degradation for small vectors primarily stems from A100’s superior hardware architecture (which better supports fine-grained memory accesses) rather than from the differences in the programming models.*

## 4.2 Performance Optimization at the High-level PyTorch: A vLLM Case Study

Serving LLMs over batched requests poses unique challenges due to the dynamic nature of input-output sequences across different requests. These variations can result in GPU memory fragmentation, which reduces the maximum batch size that the serving system can support, lowering throughput. To address this issue, vLLM [42] has gained widespread adoption which employs *PagedAttention* that divides the key-value (KV) cache into smaller blocks, allocating them on demand rather than pre-allocating memory that would otherwise remain unused. This strategy effectively mitigates memory fragmentation, significantly increasing maximum batch size. In batched LLM serving, the attention layers [78] experience a significant increase in latency as batch size grows, so supporting high-performance PagedAttention is critical for vLLM.

While vLLM natively supports a CUDA-optimized PagedAttention kernel [42] for GPU-based LLM serving systems, implementing PagedAttention for Gaudi-based systems presents several unique challenges. This is because the current Gaudi SDK lacks low-level

APIs that allow programmers to directly control the operation of the MME units within the user-programmed TPC-C kernel. In NVIDIA’s CUDA, programmers can utilize the WMMA (Warp Matrix Multiply and Accumulate) APIs [48] to directly utilize GPU’s Tensor Cores (alongside the normal CUDA Cores) for computation within the low-level CUDA kernel. However, Gaudi programmers can only access the MME units at the PyTorch level, whose functionality is limited to the built-in, pre-compiled MME-optimized kernels provided with the Intel Gaudi SDK. Consequently, performance optimizations involving the MME units must be conducted at the PyTorch level, underscoring the role of the Gaudi graph compiler. This poses a unique challenge for implementing Gaudi-optimized PagedAttention, as it demands efficient coordination of GEMM operations and KV cache management for high performance. This contrasts with our DLRM case study in Section 4.1, where the primary performance optimization target was the embedding lookup operator, which consists of vector operations that can take advantage of the programmable TPC vector unit through custom low-level kernel implementations. In this case study, we discuss performance optimization strategies [74] that can be employed at the PyTorch level for implementing Gaudi-optimized PagedAttention. Specifically, we discuss Gaudi’s PyTorch level programmability and its interaction with the Gaudi graph compiler to manage low-level hardware behavior to maximize LLM serving throughput.

Figure 16(a) illustrates the baseline implementation of PagedAttention mechanism in Gaudi vLLM fork [34] (hereafter referred to as  $vLLM_{base}$ ). This approach uses a 2D tensor, BlockTable, to store the indices of KV cache blocks required by each query. When multiple requests within a single batch have varying sequence lengths, BlockTable is padded with zeros for queries with shorter sequence lengths (e.g., the blue-colored query in Figure 16(a)), leading to unnecessary gathering of KV cache blocks by the TPC units. This redundant gathering of KV cache blocks, caused by zero-padded indices in BlockTable, results in inefficient utilization of Gaudi’s compute and memory resources. Furthermore, recall that efficient use of Gaudi requires the TPC and MME to operate in parallel to hide latency, necessitating pipelined execution of the TPC-based KV cache block gather operations and MME-based GEMM operations.  $vLLM_{base}$  initially gathers the scattered KV cache blocks into a contiguous memory region, and execute the FusedSDPA [26] kernel, which is a Gaudi-optimized implementation of FlashAttention [11] (the functionality of which is equivalent to PyTorch’s scaled\_dot\_product\_attention [67]). We observe that such implementation is not optimized for the block-based operative nature



**Figure 17:** (a and b) Effectiveness of  $v\text{LLM}_{opt}$  on improving the performance of PagedAttention vs.  $v\text{LLM}_{base}$  (results are normalized to  $v\text{LLM}_{base}$ ): (a) we vary the input sequence length and batch size and measure output token generation latency (fraction of zero-padded indices are 0% in this experiment), and (b) under the sequence length=4K and batch size=32 datapoint in (a), we vary the proportion of zero-padded indices in BlockTable from 10 to 90% to evaluate the effect of redundant KV cache block gathers. In (c), we show how much PagedAttention’s throughput improves with  $v\text{LLM}_{opt}$  by comparing it to the A100, with results normalized to the A100. In (d) and (e), we sweep the maximum decode stage batch size [82] and present (d) the changes in end-to-end serving throughput and (e) the observed mean TTFT (Time-To-First-Token) and mean TPOT (Time-Per-Output-Token) values. The results are collected on a single  $v\text{LLM}_{opt}$  based Gaudi-2 and A100. To properly reflect LLM serving system’s dynamism and variable output length, we used the Dynamic-Sonnet dataset [13].

of PagedAttention as it prevents the Gaudi graph compiler from effectively pipelining its operation across MME and TPC, resulting in further performance degradation.

To alleviate the impact of redundant KV cache block gathers, a performance-optimized vLLM ( $v\text{LLM}_{opt}$ ) replaces the 2D BlockTable with a 1D tensor named BlockList (Figure 16(b)). By concatenating only the *effectual* KV cache block indices of each request into this BlockList, gathers caused by zero-padding are eliminated, fetching only the KV cache blocks needed for each query. Additionally, by restructuring the query tensor shape to align with the BlockList-based KV cache blocks,  $v\text{LLM}_{opt}$  performs batched GEMM across the gathered KV cache blocks, followed by the corresponding Softmax operation and others. With this approach adopted in  $v\text{LLM}_{opt}$ , we observe that the graph compiler more effectively partitions the TPC-based KV cache block gather operations and the MME-based batched GEMM operations into independent sub-operation slices, enabling efficient pipelined execution across TPC and MME and significantly improving hardware utilization.

Figure 17 shows  $v\text{LLM}_{opt}$ ’s effect in improving PagedAttention and end-to-end LLM performance. On average,  $v\text{LLM}_{opt}$  achieves 7.4 $\times$  improvement in PagedAttention throughput over  $v\text{LLM}_{base}$  when the fraction of zero-padded indices is 0% (Figure 17(a)). This result highlights the efficacy of  $v\text{LLM}_{opt}$ ’s PyTorch-level performance optimizations and graph compiler’s effectiveness in pipelining PagedAttention’s operations across MME and TPC. Furthermore, the experiment in Figure 17(b) shows that PagedAttention’s throughput improves by up to 55.7 $\times$  (average 21 $\times$ ) as the fraction of zero-padded indices increases, emphasizing the importance of eliminating redundant operations. Despite these improvements,  $v\text{LLM}_{opt}$  still falls short of A100, achieving an average of 45% of A100’s PagedAttention throughput (Figure 17(c)). However, due to Amdahl’s law and Gaudi-2’s performance gains in GEMM operations in MLP layers (Section 3.2), the  $v\text{LLM}_{opt}$ -based Gaudi-2 demonstrates similar end-to-end performance (Figure 17(d)) to A100, with comparable sensitivity to SLO (service level objective) oriented metrics when sweeping the inference server’s maximum batch size (i.e., the change in TTFT (Time-To-First-Token) vs. TPOT (Time-Per-Output-Token) (Figure 17(e))).

**Key takeaway #7:** *This case study showed that, while Gaudi SDK’s current lack of support for directly programming MMEs within the low-level TPC-C kernel imposes restrictions on programmer flexibility, graph compiler can still effectively capture the appropriate level of parallelism to better utilize its compute resources when programmed properly at the PyTorch level. Consequently, while the performance-optimized Gaudi vLLM achieved 45% of the performance of a GPU-optimized vLLM, Gaudi-2’s end-to-end LLM performance was shown to be competitive to A100.*

## 5 Discussion and Future Work

**(Discussion)** Section 4.2 discussed the programming challenges associated with the black-box nature of Gaudi SDK, particularly its lack of low-level APIs for directly programming the MME units. While our vLLM case study demonstrated that PyTorch-level programming, combined with graph compiler-optimized operator scheduling, can achieve end-to-end LLM performance competitive with GPUs, there still exists a 2.2 $\times$  gap in PagedAttention’s performance vs. the GPU-optimized version and the absence of control and programming interfaces for Gaudi’s hardware resources posed challenges in fully understanding the logic behind graph compiler’s optimization passes. For instance, Gaudi’s reliance on Intel’s proprietary graph compiler, coupled with the lack of a direct programming interface to the MMEs, creates challenges for implementing low-level optimizations such as the kernel fusion techniques used in FlashAttention [11]. That said, Gaudi’s approach of raising the level of programming abstraction to simplify the development of high-performance AI kernels is in line with recent industry trends. For example, OpenAI’s Triton [76] also employs a Python-based programming model that abstracts many low-level GPU programming details, streamlining developer experience by handing over much of the performance optimizations to the OpenAI compiler stack. Overall, our experience with Gaudi so far suggests that the applicability of this emerging NPU device would be greatly improved by better support for low-level programming interfaces to key compute engines like MMEs, as well as more thorough documentation of Gaudi’s graph compiler functionality and optimization passes.

**(Future Work)** Because NVIDIA GPUs are the de facto standard in AI systems, we focused on comparing Gaudi against NVIDIA

GPUs, leaving out the comparison with AMD GPUs [4] or other NPUs [2, 37]. Possible future work includes evaluating Gaudi against these alternative platforms. Additionally, Intel claims that Gaudi NPUs are competitive to NVIDIA GPUs for training large-scale AI models requiring hundreds to thousands of devices. Analyzing Gaudi’s competitive edge against NVIDIA GPUs in training scenarios is part of our immediate future work. Furthermore, AMD’s recently announced Strix Halo (Ryzen AI Max) processor [5] integrates Zen CPU cores, RDNA GPU architecture, and XDNA 2 NPU into a single SoC, offering a distinct alternative to both NVIDIA GPUs and Intel Gaudi NPUs. Unlike Gaudi, which is designed primarily for large-scale distributed AI workloads, Strix Halo targets efficient on-device AI inference with its XDNA 2 NPU, while also providing a powerful integrated GPU for mixed AI and graphics workloads. Future work could explore how Strix Halo’s unified memory architecture and heterogeneous compute capabilities compare to Gaudi NPUs and NVIDIA GPUs for AI serving.

## 6 Related Work

Emani et al.[14] and Zhang et al.[83] analyzed Gaudi NPU’s performance with an emphasis on LLMs. These prior works lack a detailed comparative analysis against GPUs, especially from a computer architect’s perspective, nor do they provide comparison from an energy efficiency standpoint. To the best of our knowledge, this work is the first and most comprehensive characterization of Gaudi NPU across multiple dimensions, using microbenchmarking, end-to-end energy analysis, and importantly, programming case studies to characterize its performance as well as programmability.

## 7 Conclusion

This paper evaluates Intel Gaudi NPUs as an alternative to NVIDIA GPUs, concluding that Gaudi NPUs have potential to become a strong contender to NVIDIA GPUs for AI model serving. AI practitioners utilize high-level frameworks like PyTorch for model development. Our analysis suggests that, as long as Intel properly supports AI frameworks with performance-optimized backend libraries, the CUDA programming system itself might not be as formidable a “moat”. However, we emphasize that our current assessment should not be interpreted as an overly optimistic outlook on Gaudi NPUs. NVIDIA’s dominance in AI remains robust due to its comprehensive software ecosystem and we believe that Gaudi would benefit from better supporting low-level programming interfaces that facilitate more flexible programming experiences.

## Acknowledgments

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2024-00438851, (SW Starlab) High-performance Privacy-preserving Machine Learning System and System Software), (No.RS-2024-00395134, DPU-Centric Datacenter Architecture for Next-Generation AI Devices), (No.RS-2024-00402898, Simulation-based High-speed/High-Accuracy Data Center Workload/System Analysis Platform), (No. RS-2024-00457882, AI Research Hub Project), (No.RS-2025-02214652, Development of

SoC Technology for AI Semiconductor-Converged Pooled Storage/Memory), (No.RS-2025-02264029, Implementation and Validation of an AI Semiconductor-Based Data Center Composable Cluster Infrastructure, 30%), and IITP under the Graduate School of Artificial Intelligence Semiconductor (IITP-2025-RS-2023-00256472) grant funded by the Korea government (MSIT). We also appreciate the support from the NAVER-Intel Co-Lab. This work was conducted by KAIST and reviewed by both NAVER and Intel. Minsoo Rhu is the corresponding author.

## References

- [1] Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. <https://github.com/state-spaces/mamba>.
- [2] Amazon Web Services. 2024. AWS Inferentia. <https://aws.amazon.com/ai/machine-learning/inferentia/>.
- [3] Amazon Web Services. 2024. Machine Learning (ML) on AWS. <https://aws.amazon.com/ai/machine-learning/>.
- [4] AMD. 2024. AMD MI300. <https://www.amd.com/en/products/accelerators/instinct/mi300.html>.
- [5] AMD. 2025. AMD Ryzen AI Max Series Processors. <https://www.amd.com/content/dam/amd/en/documents/partner-hub/ryzen-ai-max-series-how-to-sell-guide-competitive.pdf>.
- [6] Bahar Asgari, Ramyad Hadidi, Jiashen Cao, Da Eun Shim, Sung-Kyu Lim, and Hyesoon Kim. 2021. FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction. In *Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA)*.
- [7] Jayaram Bobba, Tzachi Cohen, Dibyendu Das, Sergei Grechanik, and Dafna Mordechai. 2024. Speeding Up Intel Gaudi Deep Learning Accelerators Using an MLIR-Based Compiler. <https://lvm.org/devmtg/2024-10/slides/quicktalks/Bobba-SpeedingUpIntelGaudi.pdf>.
- [8] Niladri Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonia. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In *Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC)*.
- [9] Esha Choukse, Michael B. Sullivan, Mike O’Connor, Mattan Erez, Jeff Pool, David Nellans, and Stephen W. Keckler. 2020. Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeonjae Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellatt, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. In *arXiv preprint arXiv:2204.02311*.
- [11] Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In *arXiv preprint arXiv:2307.08691*.
- [12] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Srivastava, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Tourret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esibou, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Elhab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junting Jia, Kalyan Vasuden Alwala, Kartikeya

Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yearly, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidor, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabappa, Sanjay Singh, Sean Bell, Sehyoun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthi, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramamathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenying Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Couder, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajineld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenber, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdani, Beau James, Ben Maurer, Benjamin Leonhardi, Bernice Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damion Civin, Dana Beatty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcuate, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayt, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Sweene, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi (Jack) Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspregos, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhota, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martyns Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Prithish Yuvaraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghavam Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghai Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang,

Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Victor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wencheng Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu (Sid) Wang, Yuchen Hao, Yundi Qian, Yuqi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoedu Wen, Zhenyu Yang, and Zhiwei Zhao. 2024. The Llama 3 Herd of Models. In *arXiv preprint arXiv:2407.21783*.

- [13] Dynamic Sonnet Dataset. 2024. [https://huggingface.co/datasets/squeezebits/dynamic\\_sonnet\\_llama3](https://huggingface.co/datasets/squeezebits/dynamic_sonnet_llama3).
- [14] Murali Emani, Sam Foreman, Varuni Sastry, Zhen Xie, Siddhisanket Raskar, William Arnold, Rajeev Thakur, Venkatram Vishwanath, Michael E. Papka, Sanjiv Shanmugavelu, Darshan Gandhi, Hengyu Zhao, Dun Ma, Kiran Ranganath, Rick Weisner, Jiunn-yew Chen, Yuting Yang, Natalia Vassilieva, Bin C. Zhang, Sylvia Howland, and Alexander Tsyplikhin. 2024. Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators. In *IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*.
- [15] Amin Firoozshahan, Joel Coburn, Roman Leventstein, Rakesh Nattoji, Ashwin Kamath, Olivia Wu, Gurdeepak Greval, Harish Aepala, Bhasker Jakka, Bob Dreyer, Adam Hutchin, Utku Diril, Krishnakumar Nair, Ehsan K. Aredestani, Martin Schatz, Yuchen Hao, Rakesh Komuravelli, Kunming Ho, Sameer Abu Asal, Joe Shajrawi, Kevin Quinn, Nagesh Sreedhara, Pankaj Kansal, Willie Wei, Dheepak Jayaraman, Linda Cheng, Pritam Chopda, Eric Wang, Ajay Bikumandla, Arun Karthik Sengottuvel, Krishna Thottempudi, Ashwin Narasimha, Brian Dodds, Cao Gao, Jiyuan Zhang, Mohammed Al-Sanabani, Ana Zehtabioskuie, Jordan Fix, Hangchen Yu, Richard Li, Kaustubh Gondkar, Jack Montgomery, Mike Tsai, Saritha Dwarakapuram, Sanjay Desai, Nili Avidan, Poorvaja Ramani, Karthik Narayanan, Ajit Mathews, Sethu Gopal, Maxim Naumov, Vijay Rao, Krishna Noru, Harikrishna Reddy, Prahlad Venkatapuram, and Alexis Bjorlin. 2023. MTIA: First Generation Silicon Targeting Meta’s Recommendation Systems. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [16] Wilson W.L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In *Proceedings of the International Symposium on Microarchitecture (MICRO)*.
- [17] Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [18] Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmendra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman Ebrahimi, Nam Sung Kim, Cliff Young, and Hadi Esmaeilzadeh. 2020. Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks. In *Proceedings of the International Symposium on Microarchitecture (MICRO)*.
- [19] Google Cloud. 2024. Tensor Processing Units (TPU). <https://cloud.google.com/tpu>.
- [20] Google Cloud. 2024. Vertex AI. <https://cloud.google.com/vertex-ai>.
- [21] Alan Gray. 2019. Getting Started with CUDA Graphs. <https://developer.nvidia.com/blog/cuda-graphs/>.
- [22] Hugging Face. 2022. Optimum for Intel Gaudi Accelerators. <https://github.com/huggingface/optimum-habana>.
- [23] InfiniBand Trade Association. 2024. RDMA over Converged Ethernet. <https://www.roceinitiative.org/>.
- [24] Intel. 2023. Habana Gaudi-2 White Paper. <https://www.intel.com/content/www/us/en/content-details/784827/gaudi-2-white-paper.html>.
- [25] Intel. 2023. Intel HLS-Gaudi2 AI Accelerator Server. [https://habana.ai/wp-content/uploads/2023/10/HLS-Gaudi2\\_Datasheet\\_10\\_23.pdf](https://habana.ai/wp-content/uploads/2023/10/HLS-Gaudi2_Datasheet_10_23.pdf).
- [26] Intel. 2024. Fused Scaled Dot Product Attention for Gaudi. [https://docs.habana.ai/en/latest/PyTorch/Reference/Python\\_Packages.html#hpex-kernels-fusedsdpa](https://docs.habana.ai/en/latest/PyTorch/Reference/Python_Packages.html#hpex-kernels-fusedsdpa).
- [27] Intel. 2024. Gaudi TPC Architectural Overview. [https://docs.habana.ai/en/latest/TPC/TPC\\_User\\_Guide/Processor\\_Architectural\\_Overview.html](https://docs.habana.ai/en/latest/TPC/TPC_User_Guide/Processor_Architectural_Overview.html).
- [28] Intel. 2024. Habana Collective Communications Library (HCCL). [https://github.com/HabanaAI/hccl\\_demo](https://github.com/HabanaAI/hccl_demo).
- [29] Intel. 2024. Habana HPU Graphs. [https://docs.habana.ai/en/latest/PyTorch/Reference/Python\\_Packages.html#hpu-graph-apis](https://docs.habana.ai/en/latest/PyTorch/Reference/Python_Packages.html#hpu-graph-apis).
- [30] Intel. 2024. Intel Gaudi 3 AI Accelerator Technical Paper. <https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html>.
- [31] Intel. 2024. Intel Gaudi Software Suite. <https://habana.ai/intel-gaudi-software/>.
- [32] Intel. 2024. System Management Interface Tool (hl-smi). [https://docs.habana.ai/en/latest/Management\\_and\\_Monitoring/Embedded\\_System\\_Tools\\_Guide/System\\_Management\\_Interface\\_Tool.html](https://docs.habana.ai/en/latest/Management_and_Monitoring/Embedded_System_Tools_Guide/System_Management_Interface_Tool.html).
- [33] Intel. 2024. TPC-C Language. [https://docs.habana.ai/en/latest/TPC/TPC\\_User\\_Guide/TPC\\_C\\_Language.html](https://docs.habana.ai/en/latest/TPC/TPC_User_Guide/TPC_C_Language.html).
- [34] Intel. 2024. vLLM Fork for Gaudi. <https://github.com/HabanaAI/vllm-fork>.
- [35] Dmytro Ivchenko, Dennis Van Der Staay, Colin Taylor, Xing Liu, Will Feng, Rahul Kindi, Anirudh Sudarshan, and Shahin Sefati. 2022. TorchRec: a PyTorch Domain

- Library for Recommendation Systems. In *Proceedings of the ACM Conference on Recommender Systems (RecSys)*.
- [36] Zhe Jia, Marco Maggini, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture Via Microbenchmarking. In *arXiv preprint arXiv:1804.06826*.
- [37] Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [38] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre Luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Haghmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemth Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datadcenter Performance Analysis of a Tensor Processing Unit. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [39] Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Recommendation. In *Proceedings of the IEEE International Conference on Data Mining (ICDM)*.
- [40] Roman Kaplan. 2024. Intel Gaudi 3 AI Accelerator: Architected for Gen AI Training and Inference. In *Hot Chips: A Symposium on High Performance Chips*.
- [41] Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, Xiaodong Wang, Brandon Reagen, Carole-Jean Wu, Mark Hempstead, and Xuan Zhang. 2020. RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [42] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, and Hao Zhang. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In *Proceedings of the ACM Symposium on Operating System Principles (SOSP)*.
- [43] Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In *Proceedings of the International Symposium on Microarchitecture (MICRO)*.
- [44] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In *International Symposium on Code Generation and Optimization (CGO)*.
- [45] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singh, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [46] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rockäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In *Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS)*.
- [47] Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kepner, Robert F Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC Challenge (HPC) Benchmark Suite. In *Proceedings of the ACM International Conference on Supercomputing (ICS)*.
- [48] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In *IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*.
- [49] John D McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. *IEEE Computer Society Technical Committee on Computer Architecture Newsletter* 2, 19-25 (1995).
- [50] Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. *IEEE Transactions on Parallel and Distributed Systems* 28, 1 (2017), 72-86.
- [51] Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In *Proceedings of the International Symposium on Computer Architecture (ISCA)*.
- [52] MLPerf. 2024. MLCommons (MLPerf) Inference Benchmarks for Recommendation Task. [https://github.com/mlcommons/inference/tree/master/recommendation/dlrm\\_v2/pytorch](https://github.com/mlcommons/inference/tree/master/recommendation/dlrm_v2/pytorch).
- [53] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaram, Jongsoon Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alison G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthy, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. In *arXiv preprint arXiv:1906.00091*.
- [54] NVIDIA. 2018. NVSwitch: Leveraging NVLink to Maximum Effect. <https://developer.nvidia.com/blog/nvswitch-leveraging-nvlink-to-maximum-effect/>.
- [55] NVIDIA. 2024. cuBLAS. <https://developer.nvidia.com/cublas>.
- [56] NVIDIA. 2024. cuDNN. <https://developer.nvidia.com/cudnn>.
- [57] NVIDIA. 2024. cuSOLVER. <https://developer.nvidia.com/cusolver>.
- [58] NVIDIA. 2024. cuSPARSE. <https://developer.nvidia.com/cusparse>.
- [59] NVIDIA. 2024. NVIDIA Collective Communications Library (NCCL). <https://developer.nvidia.com/nccl>.
- [60] NVIDIA. 2024. NVIDIA System Management Interface. <https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf>.
- [61] NVIDIA. 2024. NVIDIA Tensor Cores. <https://www.nvidia.com/en-us/data-center/tensor-cores/>.
- [62] NVIDIA. 2024. Performance Reported by NCCL Tests. <https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md>.
- [63] NVIDIA. 2024. TensorRT-LLM. <https://github.com/NVIDIA/TensorRT-LLM>.
- [64] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Feedback. In *Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS)*.
- [65] Beomsik Park, Ranggi Hwang, Dongho Yoon, Yoonhyuk Choi, and Minsoo Rhu. 2022. DiVA: An Accelerator for Differentially Private Machine Learning. In *Proceedings of the International Symposium on Microarchitecture (MICRO)*.
- [66] PyTorch. 2024. FBGEMM GPU Python API. [https://pytorch.org/FBGEMM/fbgemm\\_gpu-python-api/the\\_ops\\_training.html](https://pytorch.org/FBGEMM/fbgemm_gpu-python-api/the_ops_training.html).
- [67] PyTorch. 2024. Scaled Dot Product Attention (SDPA) Python API. [https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled\\_dot\\_product\\_attention.html](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
- [68] RAPIDS. 2024. cuDF: GPU DataFrames. <https://github.com/rapidsai/cudf>.
- [69] RAPIDS. 2024. cuVS: Vector Search and Clustering on the GPU. <https://github.com/rapidsai/cuvs>.
- [70] Minsoo Rhu and Mattan Erez. 2013. The Dual-Path Execution Model for Efficient GPU Control Flow. In *Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA)*.
- [71] Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures. In *Proceedings of the International Symposium on Microarchitecture (MICRO)*.
- [72] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. In *arXiv preprint arXiv:1909.08053*.
- [73] Mojtaba Abaie Shoushtary, Jordi Tubella Murgadas, and Antonio Gonzalez. 2024. Control Flow Management in Modern GPUs. In *arXiv preprint arXiv:2407.02944*.
- [74] SqueezeBots. 2025. <https://blog.squeezebots.com/intel-gaudi-3-performance-evaluation-with-synapseai-v119-39839>.
- [75] Guillaume Thomas-collignon and Vishal Mehta. 2020. Optimizing CUDA Applications for NVIDIA A100 GPU. <https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21819-optimizing-applications-for-nvidia-ampere-gpu-architecture.pdf>.
- [76] Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In *Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL)*.
- [77] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lamplé. 2023. LLaMA: Open and Efficient Foundation Language Models. In *arXiv preprint arXiv:2302.13971*.
- [78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In *Proceedings of the International Conference on Neural Information Processing Systems (NIPS)*.
- [79] VIA Research. 2025. <https://github.com/VIA-Research/Intel-Gaudi-AI-benchmarks>.

- [80] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems. In *Proceedings of the International Conference on World Wide Web (WWW)*.
- [81] Shibo Wang and Pankaj Kanwar. 2019. BFloat16: The Secret to High Performance on Cloud TPUs. <https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus?hl=en>.
- [82] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. ORCA: A Distributed Serving System for Transformer-Based Generative Models. In *Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI)*.
- [83] Chengming Zhang, Baixi Sun, Xiaodong Yu, Zhen Xie, Weijian Zheng, Kamil Iskra, Pete Beckman, and Dingwen Tao. 2023. Benchmarking and In-Depth Performance Study of Large Language Models on Habana Gaudi Processors. In *Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis*.