

# Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

W. Michael Brown\*

*NVIDIA, Santa Clara, CA, USA*

Anurag Ramesh

*Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, USA*

Thomas Lubinski

*QED-C Technical Advisory Committee - Standards, Arlington, VA, USA*

*Quantum Circuits Inc., New Haven, CT, USA*

Thien Nguyen

*NVIDIA, Santa Clara, CA, USA*

David E. Bernal Neira

*Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, USA*

---

## Abstract

As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.

**Keywords:** Quantum computing, State vector, Simulation, GPU, CUDA-Q, MPI, Interconnect

---

## 1. Introduction

Quantum computing has seen significant advances over the last decade, with qubit counts sufficient to run algorithms of interest, the realization of the first logical qubit for fault-tolerant computing [1], and the first claims of quantum advantage [2]. The development of new quantum algorithms capable of exploiting the advantages of quantum computing is as important as advances in the hardware itself. While quantum computers are now more capable, classical simulation of

these quantum algorithms remains critical to advancing the field. Experiments on quantum computers are time-consuming, with many challenges resulting from the need for calibration and noise mitigation. Fault-tolerant quantum computers might require orders of magnitude more physical qubits to achieve reliable error correction, and we expect simulation to continue to play a critical role well into the fault-tolerant era. Of course, simulation is still used in pre-silicon assessment and post-silicon debugging for the development of new classical computers. Those who have been involved in the supercomputing race will be very familiar with the need to validate bleeding-edge hardware and software based on expected behavior, a challenge amplified in quantum computing due to the wide variety of novel engineering designs.

---

\*Corresponding Author.

Email addresses: michbrown@nvidia.com (W. Michael Brown), rames102@purdue.edu (Anurag Ramesh), tlubinski@quantumcircuits.com (Thomas Lubinski), thiennguyen@nvidia.com (Thien Nguyen), dbernaln@purdue.edu (David E. Bernal Neira)

Rigorous performance evaluation is critical as quantum systems transition from research prototypes to practical computational tools. Benchmarking efforts span multiple levels of the quantum computing stack, from component-level hardware characterization [3–8] and system-level performance evaluation [9–13] to compiler and toolchain measurement [14, 15] and algorithmic performance analysis [16–21]. While each provides valuable insights within its focus area, application-oriented benchmarks are essential for evaluating end-to-end system performance on representative workloads. In this work, we utilize the QED-C Application-Oriented Performance Benchmarks for Quantum Computing [18, 22–26], an evolving suite of quantum algorithms and applications designed to assess performance on real-world problems. To enable scalable benchmarking on HPC platforms, we introduce support for MPI [27] into the QED-C framework for distributed multi-GPU evaluation. This implementation-agnostic approach supports performance assessment across quantum programming frameworks, providing insights relevant to both near-term simulation and the development of fault-tolerant quantum systems

GPU-acceleration can provide dramatic speedups for classical simulation and has become standard practice [28, 29] with popular simulation frameworks such as Qiskit [30], PennyLane [31], Cirq [32], and CUDA-Q [33] supporting GPUs. At maximum GPU memory capacity, we typically measure speedups of 2–3 orders of magnitude when comparing state-of-the-art CPUs and GPUs on a single socket, due to better-optimized algorithms and higher peak hardware throughput. While techniques for gate fusion [29, 34–36] can result in near-peak utilization of main memory bandwidth (BW) and floating-point throughput when using a single GPU, network communications can become the bottleneck when distributing a simulation across multiple GPUs. This is relevant as state-vector simulation, the most generally applicable simulation approach, scales exponentially in memory and computational requirements with the qubit count.

For 32-bit floating-point precision, the memory requirement for the state vector in gigabytes (GiB) is given by  $2^{n-27}$  for  $n$  qubits. With this encoding, current GPUs are limited to simulating around 34 qubits (128 GiB state vector) when constrained by a single GPU’s memory. While host memory can be used to supplement storage, the need for significant HPC resources can become necessary, where multiple GPUs across multiple nodes are required either to attain the aggregate memory requirements for a given qubit count or to achieve reasonable time-to-solution for the simulation of many circuits (for example, with iterative algorithms, noisy simulation, AI-guided applications, etc.). In these cases, the performance of data movement becomes a concern as data is transferred between distributed memory domains across different GPUs or po-

tentially between the host memory and the GPU.

The interconnect technologies for network communications between GPUs have undergone significant advancements since the introduction of GPU acceleration to HPC. While the first systems required a path involving the host memory, support for remote direct memory access (RDMA), first marketed by NVIDIA as “GPUDirect” was introduced to allow direct memory access between multiple GPUs on the same node or from a network interface controller (NIC) for internode communications [37]. To overcome the limitations of the PCI Express bus initially used for intranode multi-GPU communications, high-bandwidth GPU-specialized interconnects were first introduced under the marketing name NVLink [38] (NVL) (Infinity Fabric for AMD Systems). With NVLink 4, systems combining NVIDIA Grace CPUs with GPUs feature a specialized coherent interconnect between the CPU and GPU, providing the same peak bidirectional bandwidth. This is called NVLink-C2C. While these specialized interconnects have been limited to communications within a node, NVIDIA has recently released Grace Blackwell NVL72 [39] as the first generally available system to expand this high-BW all-to-all interconnect across multiple nodes, referred to here as multi-node NVL (MNNVL). In the case of NVL72, up to 72 GPUs across multiple nodes can be in the same NVL domain.

Table 1 lists the peak injection bandwidths for modern interconnect options available at writing over the last few generations of GPUs. As one can see, there is a dramatic difference in performance spanning over an order of magnitude. While there are many aspects of network performance that are beyond the scope of this paper, for multi-GPU distributed state-vector simulation, the bisection bandwidth of the aggregate inter-GPU interconnect will typically be the primary performance concern.

In this work, we introduce MPI support into the QED-C benchmark suite. We describe some of the different APIs that can be used to exploit high-performance inter-GPU communication. We present benchmarking results using different interconnect and API options. In addition to demonstrating the new benchmark capabilities and showcasing the performance achievable with MNNVL, we expect the work to be helpful for understanding the requirements and options available for software developers and system architects with quantum computing workloads.

## 2. Methods

### 2.1. Terminology

For readers new to quantum computing, we provide brief definitions of the terminology used throughout this text.

| Inter-connect              | Peak Bidirectional BW | GPU-GPU | GPU-NIC | GPU-CPU | Inter-Node |
|----------------------------|-----------------------|---------|---------|---------|------------|
| PCIe 4.0                   | 64 GB/s               | X       | X       | X       |            |
| PCIe 5.0                   | 128 GB/s              | X       | X       | X       |            |
| PCIe 6.0                   | 256 GB/s              | X       | X       | X       |            |
| NVLink 3                   | 600 GB/s              | X       |         |         |            |
| NVLink 4                   | 900 GB/s              | X       |         | *       |            |
| NVLink C2C                 | 900 GB/s              |         |         | X       |            |
| MI350X Infinity Fabric[40] | 153.6 GB/s            | X       |         |         |            |
| Slingshot 11 [41]          | 25 GB/s               |         |         |         | X          |
| Connect-X 7                | 50 GB/s               |         |         |         | X          |
| NVLink 5                   | 1800 GB/s             | X       |         |         | X          |

Table 1: Peak Bidirectional BW for Various Interconnects and NICs Involved in Multi-GPU Communication. \*The 4th generation of NVL was technically the first to support internode communication, however, test systems supporting internode communications were never made generally available.

*Classical Bits and Qubits.* A classical bit can exist exclusively in one of two states, 0 or 1. In contrast, a quantum bit, also called *qubit* can occupy a coherent superposition of these basis states. Formally, a qubit is represented as  $\alpha|0\rangle + \beta|1\rangle$ , where  $\alpha, \beta \in \mathbb{C}$  are complex amplitudes and  $|\alpha|^2 + |\beta|^2 = 1$ . Upon measurement, the superposition probabilistically collapses to either  $|0\rangle$  or  $|1\rangle$ , with probabilities determined by the associated amplitudes.

*Programs and Quantum Circuits.* In classical computing, algorithms are typically expressed as sequences of deterministic instructions that act on binary variables, manipulating them using Boolean logic and gate operations. Quantum computing adopts a related abstraction in which computation is described through quantum circuits composed of unitary operations and measurements. Unitary operations are performed through unitary operators/gates. For example, we say an operator  $A \in \mathcal{H}$  is unitary if there exists a complex-conjugate transpose  $A^\dagger$  such that  $AA^\dagger = I$ , where  $I$  is the identity matrix. All unitary operations are reversible ( $A^{-1} = A^\dagger$ ) and therefore there is a unique inverse operation that can reverse the system back to its original state. Table 2 summarizes the relationship between classical programs and quantum programs using a set of high-level conceptual analogies.

*Quantum Circuit Depth.* The *depth* of a quantum circuit refers to the number of sequential layers of quantum gates when arranged so that operations acting on disjoint sets of qubits are executed in parallel. Circuit depth plays a role analogous to computational time

Table 2: Conceptual comparison between classical and quantum programs.

| Feature           | Classical                     | Quantum                                                    |
|-------------------|-------------------------------|------------------------------------------------------------|
| Basic Unit        | Bit (value 0 or 1)            | Qubit (value 0, 1, or a superposition)                     |
| Computation Model | Logic gates on bits           | Unitary quantum gates on qubits                            |
| Core Principle    | Deterministic Boolean logic   | Probabilistic evolution via superposition and entanglement |
| Processing Power  | Processes one state at a time | Evolves many basis states in superposition                 |

in classical settings: deeper circuits enable more expressive transformations but accumulate greater noise on contemporary quantum hardware. Consequently, depth is a key metric in the design and feasibility of near-term quantum algorithms.

*State-Vector Simulation.* Classical simulation of quantum algorithms commonly relies on the *state-vector* representation, in which the full quantum state of an  $n$ -qubit system is stored as a complex vector of dimension  $2^n$ . Each quantum gate is applied through a tensor product of an appropriate unitary operator. The exponential scaling of the state-vector size imposes significant computational and memory demands.

*Measurement and Sampling.* Quantum measurement maps the final quantum state to a classical bitstring drawn according to the underlying probability distribution determined by the state’s amplitudes. Because a single circuit execution yields only one sample, quantum algorithms typically require repeated execution (*shots*) to estimate expectation values, objective functions, or probability distributions with sufficient statistical accuracy.

## 2.2. Benchmarks

In this work, we use the quantum phase estimation (QPE) [42] benchmark as it represents a fundamental algorithmic component in quantum computing applications and is amenable to weak-scaling studies due to the simple scaling behavior of gate count with the number of qubits. For a more sophisticated benchmark, we also evaluate strong scaling using a 33-qubit Transverse-field Ising model [43] Hamiltonian from HamLib [44] with 2D periodic boundary conditions on a  $3 \times 22$  triangular lattice.

*Quantum Phase Estimation (QPE).* QPE is a foundational algorithmic primitive used in quantum simulation, amplitude estimation, and eigenvalue problems. Given a unitary operator  $U$  and one of its eigenstates

$|\psi\rangle$ , the objective is to estimate the eigenphase  $\phi$  such that

$$U|\psi\rangle = e^{2\pi i \phi} |\psi\rangle.$$

The QPE circuit consists of two main components: (i) a register of qubits on which controlled powers of unitary gate  $U$  are applied (e.g.,  $U, U^2, U^4, \dots$ ), and (ii) an inverse Quantum Fourier Transform (QFT) that extracts the phase information into the computational basis. The QED-C [45] implementation samples phases that are exactly representable with  $k$  qubits (i.e.,  $\phi = n/2^k$ ), enabling clean fidelity comparisons. Increasing  $k$  increases both circuit width and depth while preserving algorithmic regularity. This makes QPE a valuable reference for evaluating circuit-generation time, transpilation overhead, execution time, and fidelity as a function of system size. Within the QED-C framework, average execution times, circuit depths, and fidelity between the measured and ideal phase distributions are recorded as key performance metrics.

*HamLib Transverse-Field Ising Model.* HamLib [44] provides a standardized set of Hamiltonians, including condensed matter physics models like the Heisenberg model, the Fermi-Hubbard model, and the Transverse-field Ising model (TFIM), chemistry models of molecules such as  $H_2$  and  $CH_4$ , and combinatorial optimization models. The library is designed for reproducible benchmarking of quantum algorithms and different backends. The size of the Hamiltonians ranges from a few qubits to large-scale instances with hundreds or even thousands of qubits.

In this work, we study the time evolution of the TFIM [43] implemented using Suzuki-Trotter [46, 47] decomposition, producing deep circuits with nontrivial interaction structure that closely resemble realistic quantum-simulation workloads. The TFIM is used extensively in benchmarking quantum-simulation workloads. The Hamiltonian for a system of  $N$  particle spins is given by

$$H = -J \sum_{\langle i,j \rangle} \sigma_i^z \sigma_j^z - h \sum_{i=1}^N \sigma_i^x$$

where  $J$  is the magnetic coupling strength,  $h$  is the transverse-magnetic field amplitude,  $\sigma^{x,z}$  are Pauli operators, and  $\langle i, j \rangle$  denotes pairs of interacting particles.

Because the Hamiltonians directly depend on qubit count, there can be significant fluctuations in the circuit depth, and therefore, we only evaluate the 33-qubit TFIM Hamiltonian at a fixed size. The HamLib TFIM benchmark then enables assessment of strong-scaling behavior, examining fidelity degradation, execution time growth, and resource requirements for physically meaningful many-body systems. Together, the QPE and HamLib benchmarks provide complementary perspectives on algorithmic performance across different scaling regimes.

For the benchmarks here, noiseless sampling was performed with 1000 shots, and 10 circuits were timed for each data point. The timing for the first circuit was excluded as a warm-up run. For HamLib, 10 steps using method 3 in the QED-C benchmark implementation were enforced. For QPE, the benchmark was launched with MPI as:

---

```
python -m mpi4py qft_benchmark.py -non -s 1000
    -c 10 -n $NQUBITS -w -a cudaq
```

---

and for HamLib as:

---

```
python -m mpi4py
    hamlib_simulation_benchmark.py -non -s 1000
    -c 11 -n 33 -w -m 3 -steps 10 -a cudaq
```

---

### 2.3. Quantum Simulation Software

We evaluate performance using NVIDIA CUDA-Q [48], a framework for heterogeneous quantum-classical computing. CUDA-Q put forward a programming model, in C++ and Python, comprising types, concepts, syntax, and semantics, which facilitates the integration of quantum processing units (QPUs) and high-performance simulation backends into conventional computing workflows.

In CUDA-Q, QPU code is defined as a quantum kernel, which is annotated for a custom compilation pipeline, based on MLIR, to produce an executable payload for the quantum coprocessor/simulator backend. There are various hardware QPUs and simulators available in CUDA-Q. Specifically, both state vector and tensor network-based simulators, based on the cuQuantum library [36], are available in CUDA-Q. Each backend, either QPU or simulator, is assigned a unique target name, allowing users to interchange backends in a hardware-agnostic manner.

The CUDA-Q simulator backend in this work, named “nvidia”, is a high-performance state vector simulator that supports multi-node multi-GPU distribution, as well as is capable of leveraging host memory. As the parallel algorithms for distributed state-vector simulation are primarily implemented within the cuQuantum library, the results here are more generally applicable to other frameworks in addition to CUDA-Q (at writing, multi-GPU support in Qiskit, PennyLane, Cirq, and some others is solely supported via cuQuantum).

The software versions used were CUDA-Q 0.12 with cuQuantum 25.9.1 and QED-C Application Oriented Benchmarks Git Hash “dd5e45a6a329”.

### 2.4. APIs for multi-GPU Communication

The recommended approach for exploiting multi-GPU parallelism is with high-level libraries for inter-GPU communication. These include GPU-aware MPI, libraries based on SHMEM [49], or newer libraries that have been heavily optimized for AI workloads. For NVL72, the newest releases of Unified Communication X [50] (UCX) underlying MPI, NVSHMEM, and

NVIDIA Collective Communications Library (NCCL) have all added optimized support. Additionally, these libraries can be used in conjunction with each other; both NVSHMEM and NCCL can be used along with MPI in the same application. For “ninja” level optimizations, lower-level APIs can be used for finer control of inter-GPU communications. For GPUDirect and NVLink, these include the CUDA interprocess communication (IPC) API and the virtual memory management (VMM) API available through the CUDA Driver. For AMD GPUs, the direct peer GPU memory access functionality is exposed by replacing the characters “cuda” with “hip” for functions in the NVIDIA-designed IPC API. For MNNVL, the IPC API is not supported, and the VMM API should be used.

Quantum circuit simulators have been implemented using a PGAS model supporting NVSHMEM [51], and MNNVL is designed to be efficient with a global address space across GPUs. AI-optimized libraries, such as NCCL, can offer performance benefits due to both business prioritization and the potential freedom from optimization limitations imposed by other standards. Both of these APIs can be particularly beneficial where fine-grain communication can be exploited for performance without involvement of the GPU host. Nonetheless, MPI remains most prevalent in HPC and is the high-level API used for internode communication in cuQuantum and CUDA-Q.

In GPU-aware MPI, GPU memory pointers can be passed directly to MPI functions, and the runtime will choose the optimized transport suitable for the inter-GPU interconnects on the system. For MNNVL, this is transparent; however, for typical GPU allocations performed with `cudaMalloc`, zero-copy transfers are not supported, and communications will be buffered. This can result in a performance impact in terms of latency. In order to achieve zero-copy transfers, developers have the option to use fabric-qualified memory allocations through the VMM API or stream-ordered allocations with a fabric-qualified memory pool. For the former, fabric-qualified memory allocation can be achieved (using CUDA 12.5 or later) with:

---

```
cudaError_t cudaMallocFabric(void** mptr,
                           size_t size, bool device) {
    int deviceId = 0;
    auto e = cudaGetDevice(&deviceId);

    CUmemAllocationProp prop = {};
    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.requestedHandleTypes =
        CU_MEM_HANDLE_TYPE_FABRIC;
    prop.location.type =
        CU_MEM_LOCATION_TYPE_DEVICE;
    prop.location.id = deviceId;

    int cpuNumaNodeId = -1;
    if (device == false) {
        cuDeviceGetAttribute(&cpuNumaNodeId,
                            U_DEVICE_ATTRIBUTE_HOST_NUMA_ID,
                            deviceId));
    prop.location.type =
```

---

```
    CU_MEM_LOCATION_TYPE_HOST_NUMA;
    prop.location.id = cpuNumaNodeId;
}

size_t granularity = 0;
e = cuMemGetAllocationGranularity(
    &granularity, &prop,
    CU_MEM_ALLOC_GRANULARITY_MINIMUM));
size_t alignedSize=(size+granularity-1) &
~(granularity - 1);

CUmemGenericAllocationHandle gHandle = 0;
e = cuMemCreate(&gHandle, alignedSize,
                &prop, 0);

CUmemFabricHandle fHandle;
e = cuMemExportToShareableHandle(&fHandle,
                                 gHandle, CU_MEM_HANDLE_TYPE_FABRIC, 0);

CUdeviceptr ptr;
e = cuMemAddressReserve(&ptr, alignedSize,
                        granularity, 0, 0);
e = cuMemMap(ptr, alignedSize, 0,
             gHandle, 0);

CUmemAccessDesc accessDesc = {};
accessDesc.flags =
    CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
accessDesc.location.type =
    CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = deviceId;
e = cuMemSetAccess(ptr, alignedSize,
                   &accessDesc, 1);

*mptr = (void *)ptr;
return e;
}

cudaError_t cudaFree(void *devPtr) {
    CUmemGenericAllocationHandle gHandle = 0;
    auto e = cuMemRetainAllocationHandle(&gHandle,
                                         devPtr);

    if (e == CUDA_SUCCESS) {
        size_t size;
        CUdeviceptr base;
        CUdeviceptr cuDevPtr = (CUdeviceptr)devPtr;
        e = cuMemGetAddressRange(&base, &size,
                                cuDevPtr);
        e = cuMemRelease(gHandle);
        e = cuMemRelease(gHandle);
        e = cuMemUnmap(cuDevPtr, size);
        e = cuMemAddressFree(cuDevPtr, size);
    }
    return e;
}
```

---

Note that full error handling for `e` has been omitted from these examples and that even when running on only a single node, the `cuMemCreate` function can fail if the system does not have properly configured IMEX channels [52]. An alternative approach to the VMM API is to allocate MPI buffers using stream-ordered memory allocators (`cudaMallocAsync()`, `cudaFreeAsync()`) with a memory pool that has been created with the `CU_MEM_HANDLE_TYPE_FABRIC` property (see `cudaMemPoolCreate()`). In addition to CUDA-runtime support, this also carries the typical advantages for stream-ordered allocation, including

low overhead where it is difficult to avoid allocation within a loop.

As with any application using CUDA-aware MPI, CUDA-Q and cuQuantum support NVL and MNNVL so long as support in the MPI communications library is configured correctly. However, a more optimized low-level implementation can also be exploited, configured in CUDA-Q with the `CUDAQ_GPU_FABRIC` environment variable. When enabled, MPI is still used for sharing of memory handles, some synchronization, and in the case of only intranode NVL, internode communications. However, most of the inter-GPU communications within the NVL domain are performed using an optimized implementation with the low-level APIs.

### 2.5. Multi-node Benchmarking and MPI

MPI support in cuQuantum facilitates both decreasing time to solution for a given system size or increasing the system size beyond what will fit in the memory of a single GPU or node. Both are achieved by distributing the state vector into smaller sub-statevectors across multiple GPUs to exploit increased memory capacity, BW, and compute throughput [36]. As this distribution requires intensive communication that can occur between any pairs of GPUs (see [53 for details on communication patterns), we demonstrate here that network communications can quickly become a bottleneck as scaling occurs.

cuQuantum supports supplementing GPU memory with host memory; however, it is expected that this feature will come with a performance impact. For example, when simulating with two additional qubits using host memory, one can expect a fourfold increase in simulation time compared to using only GPU memory. This is simply because the simulation is performed on 1/4 of the number of GPUs. In practice, the performance hit can be significantly worse, as all the data in host memory must be moved to the GPU for computation at a bandwidth that may be limited by host memory itself or the bus to the GPU. While this feature is certainly valuable for fully exploiting the limits of a given system or job size, here we limited the scope to focus on GPU scalability for high performance.

At the CUDA-Q level, MPI support for simulation is enabled through the “`mgpu`” backend option. In some cases, this support is completely transparent to the user - running with MPI will automatically distribute the state vector. Common functions for sampling and calculation of expectation values will automatically return the correct values for each MPI task. For more sophisticated workflows, CUDA-Q includes an API for common MPI functions or users can directly make use of the MPI API through libraries within C++ or Python.

The QED-C Application-Oriented Benchmark Suite is an open-source framework designed to facilitate community contributions and extensibility. The modular architecture supports multiple quantum programming APIs, including Qiskit and CUDA-Q, enabling

straightforward integration of new capabilities such as the MPI functionality described here. Here, we describe the implementation details of the MPI extensions to the Python-based benchmarks, with complete source code available in the QED-C Application-Oriented Benchmarks repository [26].

The first change is to enable transparent support for running on systems that may or may not include MPI libraries. Here, we made the decision to enable MPI only if the `mpi4py` module, available for installation through PyPI, is loaded. Thus, to run with MPI, users need only to add the MPI launch wrapper and load the `mpi4py` module (e.g. `mpirun -np $nGPUs python3 -m mpi4py script_name.py`). CUDA-Q/cuQuantum will automatically assign available GPUs on a node to MPI tasks in a round-robin manner.

We then conditionally implement some common functions, for example:

---

```
if "mpi4py" not in sys.modules:
    def leader():
        return True
    def barrier():
        return
    def bcast(data):
        return data
else:
    from mpi4py import MPI
    def leader():
        return rank == 0
    def barrier():
        MPI.COMM_WORLD.barrier()
        return
    def bcast(data):
        return MPI.COMM_WORLD.bcast(data,
                                     root=0)
```

---

A common mistake for beginners with MPI in Python is implementing all MPI tasks competing to write to the same file. In the QED-C benchmarks, for example, it is necessary to protect the downloading of HamLib data so that only a single MPI leader performs the download, using synchronization and broadcast so that other MPI tasks don’t proceed until the data is available. Of course, the same issue will occur with output intended for the screen. For the QED-C benchmarks, we were able to use a minimally invasive approach by redirecting the output of all but one MPI task:

---

```
import os, sys
rank = MPI.COMM_WORLD.Get_rank()
if rank > 0:
    f = open(os.devnull, 'w')
    sys.stdout = f
```

---

A final important implementation note is the use of synchronization for benchmarking. That is, we enforce a barrier before starting timing so that statistics are not influenced by implicit synchronization over variable startup costs or conditional execution. MPI support has been added to a subset of the QED-C benchmarks, and we are currently expanding.

In order to exploit low-level optimizations for NVLink in cuQuantum, the CUDA-Q environment variable `CUDAQ_GPU_FABRIC` can be used to specify the network configuration. This can be set to `NONE` if NVLink is not available, `NVL` if intranode NVLink is available, `MNNVL` for running within a single MN-NVL domain, or a number that specifies the size of MNNVL domains. The latter can be used, for example, when running across multiple NVL72 racks where some communication must involve InfiniBand.

## 2.6. Benchmark Systems and Methodology

We can control the interconnect paths and APIs used at runtime with environment variables. In all cases, we set `OMPI_MCA_pml="ucx"` so that Open-MPI will use UCX for communications. For systems with `MNNVL UCX_CUDA_IPC_ENABLE_MNNVL = {"yes", "no"}` is used to control whether MPI internode communications exploit MNNVL or only occur over the InfiniBand network. (As of UCX 1.9, the default for this environment variable is "try" such that MNNVL will be used if available). Within cuQuantum, we can compare CUDA-aware MPI implementations to low-level API implementations using the `CUDAQ_GPU_FABRIC` variable. For comparing PCIe to NVL intranode, different host systems are used. The impact of InfiniBand GPUDirect RDMA is evaluated with the `UCX_IB_GPU_DIRECT_RDMA = {"y", "n"}` environment variable. For all benchmarks, MPI tasks were affinitized to  $1/N$  of the logical cores on the system along with the corresponding NUMA memory; this step is critical for performance for these workloads on configurations where significant data moves through host memory interconnects to the NIC.

For CUDA-Q version 0.12 used here, we also set the `CUDAQ_GLOBAL_INDEX_BITS` environment variable for multi-node simulations when `CUDAQ_GPU_FABRIC` is set to `NVL`. This environment variable controls the topology for hierarchical communication where some groups of MPI tasks have higher communication BW. Two values were given with the first set to  $\log_2(N)$  and the second to  $\log_2(P/N)$  where  $N$  is the number of MPI tasks within an NVL domain and  $P$  is the total number of MPI tasks. For later versions of CUDA-Q, setting this environment variable is unnecessary as the configuration is automatic based on the setting for `CUDAQ_GPU_FABRIC`.

The primary system used for the studies is "Genesis", an in-house GB200 NVL72 rack. The system has 4 GB200 "Blackwell" GPUs and 2 NVIDIA Grace CPUs per node with 72 GPUs connected by all-to-all MNNVL generation 5 fabric. In addition to MN-NVL, the system is configured with one ConnectX-7 (NDR InfiniBand) 400Gb port per GPU with a fully-balanced, non-blocking fat tree topology. For the previous generation systems, we use "Hopper" to refer to a single x86 server with a single 80GB HBM3 H100

|                            | Ampere PCI         | Ampere             | Perlmutter       | Hopper             | Genesis              |
|----------------------------|--------------------|--------------------|------------------|--------------------|----------------------|
| GPU                        | A100               | A100               | A100             | H100               | GB200                |
| GPU Memory                 | 80GB HBM2e         | 80GB HBM2e         | 80GB HBM2e       | 80GB HBM3          | 192GB HBM3e          |
| GPUs / Node                | 8                  | 8                  | 4                | 1                  | 4                    |
| Intranode GPU Interconnect | PCIe 4.0           | NVL 3              | NVL 3            | N/A                | NVL 5                |
| CPU                        | 2S AMD EPYC 7742   | 2S AMD EPYC 7742   | 1S AMD EPYC 7763 | 1S AMD EPYC 7413   | 2S NVIDIA Grace      |
| CPU Memory                 | 1TB DDR4 3200      | 2TB DDR4 3200      | 256GB DDR4 3200  | 256GB DDR4 3200    | 960GB LPDDR5         |
| CPU-GPU Interconnect       | PCIe 4.0           | PCIe 4.0           | PCIe 4.0         | PCIe 5.0           | NVL 4 C2C            |
| Internode Interconnect     | N/A                | N/A                | HPE Slingshot 11 | N/A                | NVL 5 and ConnectX-7 |
| CUDA Driver Version        | 570.133.20         | 570.133.20         | 550.163.01       | 580.95.05          | 580.82.07            |
| OS                         | Ubuntu 22.04.5 LTS | Ubuntu 22.04.5 LTS | SUSE SLES 15 SPS | Ubuntu 24.04.3 LTS | Ubuntu 24.04.1 LTS   |

Table 3: Benchmark system specifications

GPU. "Ampere" refers to the generation before "Hopper" with eight 80GB A100 GPUs in a single x86 server connected by NVL. "Ampere-PCI" refers to a single x86 server with eight 80GB A100 GPUs with a PCI-express interconnect. As a baseline multi-node system using A100 GPUs, we use the Perlmutter system at the National Energy Research Scientific Computing Center (NERSC). Perlmutter has 4 A100 GPUs per node with one Slingshot 11 "Cassini" NIC per GPU using a 3-hop dragonfly topology for the internode network [41]. Over 7000 GPUs are available on the entire system. Further details of the systems are available in Table 3.

## 3. Results

Single GPU generational speedups for the 33-qubit QPE and HamLib benchmarks are shown in Figure 1 with approximately 1.8X speedup going from the NVIDIA Ampere to Hopper generations and 2.2-2.4X from Hopper to Blackwell. Multi-GPU weak scaling performance is shown in Figure 2, starting with 33 qubits and increasing this number by one with each doubling of GPU counts. This represents the largest simulations fitting into GPU memory on Perlmutter. Due to the higher memory capacity of the GB200 GPUs, we could simulate between 34 and 40 qubits on the Genesis system with better parallel efficiency. However, this would obfuscate the comparison to previous generations. The ideal weak scaling performance is based on simulation time for a single GB200 GPU, normalized to account for the increasing gate count for larger circuits with more qubits.

The speedup with MNNVL compared to InfiniBand on Genesis ranges from 2.8-4.1X, going from 2 to 16 nodes. With MNNVL, there is an initial decrease in parallel efficiency to 73% at 4 GPUs. This is due to a significant fraction of data movement shifting from 8 TB/s HBM 3e memory to the 1.8 TB/s (bidirectional) MNNVL links. This balances out, and parallel efficiency remains steady at 67-73% up to 64 GPUs. With InfiniBand, there is a significant decrease in performance going from 4 GPUs within an intranode NVL



Figure 1: Single-GPU generational speedups for CUDA-Q simulation of the 33-qubit QPE and HamLib circuits on GPUs from Ampere, Hopper, and Genesis systems. Absolute Ampere measurements were 5.2s and 72.6s for the respective benchmarks.



Figure 2: Weak-scaling performance for the QPE benchmark on various systems. Genesis-MPI uses CUDA-aware MPI algorithms for MNNVL where Genesis-CUDA uses the low-level VMM API. Genesis-IB and Genesis-IB-RDMA disable MNNVL with the former also disabling RDMA from the NIC. The number of qubits ranges from 33 on a single GPU to 39 on 64 GPUs.

domain to 8 GPUs across nodes over an interconnect path with significantly lower BW. As the addition of this path significantly impacts the bisection-BW, the benefit of intranode NVL can be diminished (see where “Ampere-PCI” meets Perlmutter).

Strong-scaling performance for the 33-qubit QPE benchmark is shown in Figure 3. In this case, MNNVL performance is 2.7-3.6X faster than InfiniBand on the Genesis system, and performance is monotonically increasing up to 64 GPUs. Again, with InfiniBand, there is a significant decrease in performance at the internode threshold. After this initial shift in data movement over InfiniBand at 8 GPUs, performance again continues to improve with the addition of more aggregate BW in both the internode and intranode paths. However, the strong scaling efficiency is significantly impacted. Again, we see that intranode NVL benefits can be diminished.

For the 33-qubit HamLib benchmark, we observe



Figure 3: Strong-scaling performance for the 33-Qubit QPE benchmark on various systems. Genesis-MPI uses CUDA-aware MPI algorithms for MNNVL where Genesis-CUDA uses the low-level VMM API. Genesis-IB and Genesis-IB-RDMA disable MNNVL with the former also disabling RDMA from the NIC.

lower network sensitivity (Figure 4). Performance is monotonically improving in all cases, however, and MNNVL still outperforms InfiniBand by 1.5-3X on Genesis. Although bisection-BW is the primary concern here, we still observe a significant performance decrease when disabling RDMA between the GPUs and InfiniBand NICs (Genesis-IB-RDMA vs Genesis IB). In the strong-scaling cases, the performance hit ranges from 13% to 59%; with weak-scaling, it can be as high as 68%. In comparing MNNVL algorithms in cuQuantum, implementations with the low-level VMM API do significantly outperform CUDA-aware MPI. Here, the impact increases at higher GPU counts. For HamLib, this ranged from 1.06X to 1.25X between 2 and 16 nodes. For QPE with strong scaling, the range was 1.11-1.41X, and for weak scaling, the range was 1.11-1.61X. We expect a larger impact from algorithmic differences here compared to the benefits of zero-copy without buffering. For single-node benchmarks, the NVL interconnect significantly outperformed PCI-express with up to 2.5X performance for the HamLib benchmark and 3.5X for QPE.

In comparing Genesis to Perlmutter, intranode speedups with 1-4 GPUs were balanced and did not significantly deviate from those achieved with a single GPU. Disproportionate advances in the internode interconnect with MNNVL yield larger speedups as the node count increases. Strong-scaling performance was between 9.7X and 15X higher for multiple nodes on Genesis for QPE and 6X and 8.1X higher for HamLib. Weak scaling performance was 13X at 36 qubits and 16.2X-16.8X at higher qubit counts. Figure 5 illustrates the speedups compared to Perlmutter, with the best simulation times for each configuration tested here.



Figure 4: Strong-scaling performance for the 33-Qubit HamLib benchmark on various systems. Genesis-MPI uses CUDA-aware MPI algorithms for MNNVL where Genesis-CUDA uses the low-level VMM API. Genesis-IB and Genesis-IB-RDMA disable MNNVL with the former also disabling RDMA from the NIC.



Figure 5: Speedup in circuit simulation time with 64-GPU Perlmutter performance as the baseline.

#### 4. Conclusion

Ignoring software optimizations, over the last four years, we've seen over 4X improvement for single-GPU quantum circuit simulation with three generations of NVIDIA GPU HW. The addition of MPI support into the QED-C benchmarks has enabled benchmarking on HPC systems using significantly higher qubit counts with better time to solution. While the potential impact of quantum computing research is already impressive at 4X, the benchmarks show astonishing gains in multi-GPU performance, with over 16X better performance over the last three years (comparing the Grace Blackwell NVL72 system to the Perlmutter system with HPE Slingshot 11).

Although the benefits of GPU-NIC RDMA may be unclear on early systems, it is now a best practice to enable this configuration. While this is particularly true where network latency is concerned, we still observed significant benefits with the GB200 networked through ConnectX-7 NDR InfiniBand. In our testing, we observed expected improvements for single-node simulation using NVL as compared to PCIe for the in-

tranode GPU interconnect. However, this benefit can be diminished when bisection BW becomes limited at the internode interconnect. For this reason, MNNVL represents a significant advance for state-vector simulation. As low-level algorithms in cuQuantum performed significantly better for MNNVL, users should enable this setting for best performance.

Gate fusion techniques in state-vector simulation can enable balanced arithmetic intensity with GPUs, utilizing near-peak main memory BW and floating-point throughput. Even with MNNVL, there remains a significant gap in peak BW for main memory versus the inter-GPU interconnect. For this reason, it is expected that a highly optimized single-GPU implementation will incur some loss in parallel efficiency; we anticipate that further optimizations for concurrent communications will be non-trivial due to increased main memory contention. Nonetheless, there is headroom for improvement, and systems such as Genesis, with fast coherent interconnects between the host and the GPU, along with multiple internode interconnects, offer the potential for sophisticated algorithms to best exploit the system.

While our focus here has been on HW and APIs for communication, algorithmic advances will also be critical. In this regard, sophisticated optimizations for the NVIDIA Grace-Hopper architecture were employed in the Jülich universal quantum computer simulator, demonstrating a capability for 50-qubit simulations when using the entire supercomputer [54]. Of course, there are many other examples with ongoing work in-house and elsewhere [55–61]. We expect continued improvements, both in algorithms and HW, that will prove vital in advancing the field of quantum computing.

#### 5. Acknowledgments

We thank Takuma Yamaguchi and Akshay Venkatesh for their expert reviews of this work.

This research used the resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 using NERSC award NERSC DDR-ERCAP0034389.

AR and DEBN acknowledge the support of the Davidson School of Chemical Engineering by the Center for Quantum Technologies under the Industry-University Cooperative Research Center Program at the US National Science Foundation under Grant No. 2224960.

#### References

- [1] D. Bluvstein, S. J. Evered, A. A. Geim, S. H. Li, H. Zhou, T. Manovitz, S. Ebadi, M. Cain, M. Kalinowski, D. Hangleiter, et al., Logical

- quantum processor based on reconfigurable atom arrays, *Nature* 626 (7997) (2024) 58–65.
- [2] Google Quantum AI and Collaborators, Observation of constructive interference at the edge of quantum ergodicity, *Nature* 646 (8086) (2025) 825–830.
- [3] T. Proctor, K. Young, A. D. Baczewski, R. Blume-Kohout, Benchmarking quantum computers, *Nat. Rev. Phys.* 7 (2) (2025) 105–118. doi:10.1038/s42254-024-00796-z. URL <https://www.nature.com/articles/s42254-024-00796-z>
- [4] A. Hashim, L. B. Nguyen, N. Goss, B. Marinelli, R. K. Naik, T. Chistolini, J. Hines, J. Marceaux, Y. Kim, P. Gokhale, T. Tomesh, S. Chen, L. Jiang, S. Ferracin, K. Rudinger, T. Proctor, K. C. Young, I. Siddiqi, R. Blume-Kohout, Practical introduction to benchmarking and characterization of quantum computers, *PRX Quantum* 6 (2025) 030202. doi:10.1103/PRXQuantum.6.030202. URL <https://link.aps.org/doi/10.1103/PRXQuantum.6.030202>
- [5] A. W. Cross, L. S. Bishop, S. Sheldon, P. D. Nation, J. M. Gambetta, Validating quantum computers using randomized model circuits, *Physical Review A* 100 (3) (2019) 032328.
- [6] T. Q. Team, Measuring quantum volume (Aug 2021). URL <https://qiskit.org/textbook/ch-quantum-hardware/measuring-quantum-volume.html>
- [7] C. H. Baldwin, K. Mayer, N. C. Brown, C. Ryan-Anderson, D. Hayes, Re-examining the quantum volume test: Ideal distributions, compiler optimizations, confidence intervals, and scalable resource estimations, *Quantum* 6 (2022) 707.
- [8] E. Pelofske, A. Bärtschi, S. Eidenbenz, Quantum volume in practice: What users can expect from nisq devices, *IEEE Transactions on Quantum Engineering* 3 (2022) 1–19.
- [9] T. Proctor, K. Rudinger, K. Young, E. Nielsen, R. Blume-Kohout, Measuring the capabilities of quantum computers, *Nature Physics* 18 (1) (2022) 75–79.
- [10] R. Blume-Kohout, K. C. Young, A volumetric framework for quantum computer benchmarks, *Quantum* 4 (2020) 362.
- [11] T. Proctor, S. Seritan, E. Nielsen, K. Rudinger, K. Young, R. Blume-Kohout, M. Sarovar, Establishing trust in quantum computations, arXiv preprint arXiv:2204.07568 (2022).
- [12] A. Wack, H. Paik, A. Javadi-Abhari, P. Jurcevic, I. Faro, J. M. Gambetta, B. R. Johnson, Quality, speed, and scale: three key attributes to measure the performance of near-term quantum computers (2021). doi:10.48550/ARXIV.2110.14108. URL <https://arxiv.org/abs/2110.14108>
- [13] A. Wack, H. Paik, A. Javadi-Abhari, P. Jurcevic, I. Faro, J. M. Gambetta, B. R. Johnson, Quality, speed, and scale: three key attributes to measure the performance of near-term quantum computers (2021), arXiv preprint arXiv:2110.14108 (2021).
- [14] Y. Kharkov, E. Mikhantiev, A. Kotelnikov, Arline benchmarks: Automated benchmarking platform for quantum compilers, arXiv preprint arXiv:2202.14025 (2022).
- [15] P. D. Nation, A. A. Saki, S. Brandhofer, L. Bello, S. Garion, M. Treinish, A. Javadi-Abhari, Benchmarking the performance of quantum computing software for quantum circuit creation, manipulation and compilation, *Nature Computational Science* (2025) 1–9.
- [16] T. Tomesh, P. Gokhale, V. Omole, G. S. Ravi, K. N. Smith, J. Viszlai, X.-C. Wu, N. Hardavellas, M. R. Martonosi, F. T. Chong, Supermarq: A scalable quantum benchmark suite, in: 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 587–603. doi:10.1109/HPCA53966.2022.00050.
- [17] A. Li, S. Stein, S. Krishnamoorthy, J. Ang, Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation, *ACM Transactions on Quantum Computing* 4 (2) (Feb. 2023). doi:10.1145/3550488. URL <https://doi.org/10.1145/3550488>
- [18] T. Lubinski, S. Johri, P. Varosy, J. Coleman, L. Zhao, J. Necaise, C. H. Baldwin, K. Mayer, T. Proctor, Application-oriented performance benchmarks for quantum computing, *IEEE Transactions on Quantum Engineering* 4 (2023) 1–32. doi:10.1109/TQE.2023.3253761.
- [19] J. R. Finžgar, P. Ross, L. Hölscher, J. Klepsch, A. Luckow, Quark: A framework for quantum computing application benchmarking, in: 2022 IEEE International Conference on Quantum Computing and Engineering (QCE), 2022, pp. 226–237. doi:10.1109/QCE53715.2022.00042.

- [20] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, H. Neven, Characterizing quantum supremacy in near-term devices, *Nature Physics* 14 (6) (2018) 595–600. doi:10.1038/s41567-018-0124-x. URL <http://dx.doi.org/10.1038/s41567-018-0124-x>
- [21] D. E. Bernal Neira, R. Brown, P. Sathe, F. Wudarski, M. Pavone, E. Rieffel, D. Venturelli, Benchmarking the operation of quantum heuristics and Ising machines: scoring parameter setting strategies on optimization applications, *Quantum Machine Intelligence* 7 (2) (2025) 1–12.
- [22] T. Lubinski, C. Coffrin, C. McGeoch, P. Sathe, J. Apanavicius, D. E. B. Neira, Optimization applications as quantum performance benchmarks (2023). arXiv:2302.02278, doi:10.48550/arXiv.2302.02278. URL <https://arxiv.org/abs/2302.02278>
- [23] T. Lubinski, J. J. Goings, K. Mayer, S. Johri, N. Reddy, A. Mehta, N. Bhatia, S. Rappaport, D. Mills, C. H. Baldwin, L. Zhao, A. Barbosa, S. Maity, P. S. Mundada, Quantum algorithm exploration using application-oriented performance benchmarks (2024). arXiv:2402.08985. URL <https://arxiv.org/abs/2402.08985>
- [24] A. Chatterjee, S. Rappaport, A. Giri, S. Johri, T. Proctor, D. E. B. Neira, P. Sathe, T. Lubinski, A comprehensive cross-model framework for benchmarking the performance of quantum hamiltonian simulations (2024). arXiv:2409.06919. URL <https://arxiv.org/abs/2409.06919>
- [25] S. Niu, E. Kökcü, S. Johri, A. Ramesh, A. Chatterjee, D. E. Bernal Neira, D. Camps, T. Lubinski, A practical framework for assessing the performance of observable estimation in quantum simulation (2025). arXiv:2504.09813. URL <https://arxiv.org/abs/2504.09813>
- [26] QED-C, Application-Oriented Benchmarks for Quantum Computing (2019). URL <https://github.com/SRI-International/QC-App-Oriented-Benchmarks>
- [27] Multiple, Message passing interface - high performance computing (2022). URL <https://hpc.nmsu.edu/discovery/mpi/introduction/>
- [28] M. Vallero, P. Rech, F. Vella, State of practice: evaluating gpu performance of state vector and tensor network methods, *Future Generation Computer Systems* (2025) 107927.
- [29] J. Faj, I. Peng, J. Wahlgren, S. Markidis, Quantum computer simulations at warp speed: Assessing the impact of GPU acceleration: A case study with IBM Qiskit AER, NVIDIA Thrust & cuQuantum, in: 2023 IEEE 19th International Conference on e-Science (e-Science), IEEE, 2023, pp. 1–10.
- [30] A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, et al., Quantum computing with qiskit, arXiv preprint arXiv:2405.08810 (2024).
- [31] V. Bergholm, J. Izaac, M. Schuld, C. Gogolin, S. Ahmed, V. Ajith, M. S. Alam, G. Alonso-Linaje, B. AkashNarayanan, A. Asadi, et al., PennyLane: Automatic differentiation of hybrid quantum-classical computations, arXiv preprint arXiv:1811.04968 (2018).
- [32] C. Developers, Cirq: Python package for writing, manipulating, and running quantum circuits on quantum computers and simulators., Zenodo, 2025. doi:10.5281/ZENODO.4062499. URL <https://zenodo.org/doi/10.5281/zenodo.4062499>
- [33] J.-S. Kim, A. McCaskey, B. Heim, M. Modani, S. Stanwyck, T. Costa, CUDA Quantum: The platform for integrated quantum-classical computing, in: 2023 60th ACM/IEEE Design Automation Conference (DAC), IEEE, 2023, pp. 1–4.
- [34] M. Smelyanskiy, N. P. Sawaya, A. Aspuru-Guzik, qHipster: The quantum high performance software testing environment, arXiv preprint arXiv:1601.07195 (2016).
- [35] C. Zhang, Z. Song, H. Wang, K. Rong, J. Zhai, HyQuas: hybrid partitioner based quantum circuit simulation system on GPU, in: Proceedings of the 35th ACM International Conference on Supercomputing, 2021, pp. 443–454.
- [36] H. Bayraktar, A. Charara, D. Clark, S. Cohen, T. Costa, Y.-L. L. Fang, Y. Gao, J. Guan, J. Gunnels, A. Haidar, et al., cuquantum sdk: A high-performance library for accelerating quantum science, in: 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 1, IEEE, 2023, pp. 1050–1061.
- [37] NVIDIA Corporation, NVIDIA’s next generation CUDA compute architecture: Kepler gk110/210,

- <https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf>, [Accessed 31-10-2025] (2012).
- [38] NVIDIA Corporation, NVIDIA Tesla P100 The most advanced datacenter accelerator ever built featuring Pascal GP100, the world's fastest gpu, <https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf>, [Accessed 31-10-2025] (2016).
- [39] NVIDIA Corporation, NVIDIA Blackwell architecture technical brief, <https://resources.nvidia.com/en-us-blackwell-architecture>, [Accessed 31-10-2025] (2025).
- [40] AMD Corporation, AMD Instinct MI300 series cluster reference architecture guide, <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/other/instinct-mi300-series-cluster-reference-guide.pdf>, [Accessed 31-10-2025] (2025).
- [41] National Energy Research Scientific Computing Center, Perlmutter architecture, <https://docs.nersc.gov/systems/perlmutter/architecture/>, [Accessed 31-10-2025] (2023).
- [42] A. Shukla, P. Vedula, Towards practical quantum phase estimation: A modular, scalable, and adaptive approach (2025). arXiv:2507.22460. URL <https://arxiv.org/abs/2507.22460>
- [43] R. J. Elliott, P. Pfeuty, C. Wood, Ising model with a transverse field, Phys. Rev. Lett. 25 (1970) 443–446. doi:10.1103/PhysRevLett.25.443. URL <https://link.aps.org/doi/10.1103/PhysRevLett.25.443>
- [44] N. P. Sawaya, D. Marti-Dafcik, Y. Ho, D. P. Tabor, D. E. B. Neira, A. B. Magann, S. Premaratne, P. Dubey, A. Matsuura, N. Bishop, W. A. d. Jong, S. Benjamin, O. Parekh, N. Tubman, K. Klymko, D. Camps, HamLib: A library of Hamiltonians for benchmarking quantum algorithms and hardware, Quantum 8 (2024) 1559. doi:10.22331/q-2024-12-11-1559. URL <https://doi.org/10.22331/q-2024-12-11-1559>
- [45] T. Lubinski, J. J. Goings, K. Mayer, S. Johri, N. Reddy, A. Mehta, N. Bhatia, S. Rappaport, D. Mills, C. H. Baldwin, et al., Quantum algorithm exploration using application-oriented performance benchmarks, arXiv preprint arXiv:2402.08985 (2024).
- [46] M. Suzuki, Fractal decomposition of exponential operators with applications to many-body theories and monte carlo simulations, Physics Letters A 146 (6) (1990) 319–323. doi: [https://doi.org/10.1016/0375-9601\(90\)90962-N](https://doi.org/10.1016/0375-9601(90)90962-N). URL <https://www.sciencedirect.com/science/article/pii/037596019090962N>
- [47] S. Dragoi, Analysis of the trotter method for hamiltonian simulation, arXiv preprint (2022).
- [48] The CUDA-Q development team, CUDA-Q, <https://github.com/NVIDIA/cuda-quantum>, available at <https://github.com/NVIDIA/cuda-quantum> (2025). URL <https://github.com/NVIDIA/cuda-quantum>
- [49] R. Bariuso, A. Knies, SHMEM's User's Guide, Cray Research, Inc., Minneapolis, Minnesota, 1994.
- [50] P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez, Y. Itigin, M. Dubman, G. Shainer, R. L. Graham, L. Liss, et al., UCX: an open source framework for HPC network APIs and beyond, in: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, IEEE, 2015, pp. 40–43.
- [51] A. Li, B. Fang, C. Granade, G. Prawiroatmodjo, B. Heim, M. Roetteler, S. Krishnamoorthy, SV-Sim: scalable PGAS-based state vector simulation of quantum circuits, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
- [52] NVIDIA Corporation, NVIDIA IMEX Service for NVLink Networks, <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html>, [Accessed 31-10-2025] (2025).
- [53] T. Jones, B. Koczor, S. C. Benjamin, Distributed simulation of statevectors and density matrices, arXiv preprint arXiv:2311.01512 (2023).
- [54] H. D. Raedt, J. Kraus, A. Herten, V. Mehta, M. Bode, M. Hrywniak, K. Michielsen, T. LipPERT, Universal quantum simulation of 50 qubits

- on europe's first exascale supercomputer harnessing its heterogeneous CPU-GPU architecture (2025). arXiv:2511.03359.  
URL <https://arxiv.org/abs/2511.03359>
- [55] C. Jiao, W. Zhang, L. Shen, Communication optimizations for state-vector quantum simulator on CPU+ GPU clusters, in: Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 203–212.
- [56] A. Rezaei, L. Jaulmes, M. Bahna, O. T. Brown, A. Barbalace, Low-level and NUMA-aware optimization for high-performance quantum simulation, arXiv preprint arXiv:2506.09198 (2025).
- [57] Y. Teranishi, S. Hiraoka, W. Mizukami, M. Okita, F. Ino, Lazy qubit reordering for accelerating parallel state-vector-based quantum circuit simulation, ACM Transactions on Quantum Computing 6 (4) (2025) 1–33.
- [58] A. J. Gangapuram, A. Läuchli, C. Hempel, Benchmarking quantum computer simulation software packages: State vector simulators, SciPost Physics Core 7 (4) (2024) 075.
- [59] G. Stenzel, S. Zielinski, M. Kölle, P. Altmann, J. Nüßlein, T. Gabor, Qandle: Accelerating state vector simulation using gate-matrix caching and circuit splitting, arXiv preprint arXiv:2404.09213 (2024).
- [60] S. Westrick, P. Liu, B. Kang, C. McDonald, M. Rainey, M. Xu, J. Arora, Y. Ding, U. A. Acar, Grafeyn: Efficient parallel sparse simulation of quantum circuits, in: 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 1, IEEE, 2024, pp. 1132–1142.
- [61] J. Adamski, J. P. Richings, O. T. Brown, Energy efficiency of quantum statevector simulation at scale, in: Proceedings of the SC'23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, 2023, pp. 1871–1875.